To complete this assignment, students require a basic understanding of Python or R including data import, data processing and visualization, and data inference.
If you are not very familiar with using Python to plot, you can
Take a look at Matplotlib library and matplotlib for data science in Python.
Also use any other tools you are familiar with for analyzing
At the end of this assignment, students should be able to
This assignment contains five tasks very helpful to analyze data sets from different perspectives. Please read all instructions before starting.
All these exercises can be done using Python or any other available software (such as R, Matlab) as long as the results are consistent and correct.
Recommendations:
Take a look to the lecture “Data Analysis” on lecture notes section or there are several sources (books, articles, so on) on internet regarding data visualization and analysis that can be useful as a guide, such as the books called An Introduction to Statistical Learning Fundamentals of Data Visualization or Data Analytics.
According to the chosen tool (Python or R), take a look at the different cheat sheets available on internet related to it as well as the libraries/packages (pandas, matplotlib) including the most useful information related to syntax, functions, variables, conditions, formulas, and more.
All the data files required in this exercise are found from /work/courses/unix/T/ELEC/E7130/general/r-data
directory. Path is also as RDATA
environment variable if you have sourced the use.sh
file.
Note: You may type the command
kinit
before accessing to the directory to avoid issues related to the permissions.
source /work/courses/unix/T/ELEC/E7130/general/use.sh
$ cd $RDATA
$ ls
$ ...
In the first task, explain the following plots briefly. For example, what the y-axis/x-axis represent, the appropriate usage scenarios, and the limitations of each plot.
In this task, graph various kinds of plots in linear scale and logarithmic scale, and then analyze them.
Download the file flows.txt
contains values of flow lengths in bytes captured from a network in order to study the flow length variable using your favorite software.
Provide concise answers to the following sections.
Plot the flow data using:
Note: Provide the plots including the commands or functions used to plot the data in your report.
Describe the distributions choosing variables. In terms of summary data, it means the expression variable indicates the measure of central tendency of a distribution, such as mean, median, mode, max, min, etc.
Note: Provide the commands used to get the results as well as explain the reasons for your selections based on the information you gathered during the previous section.
Replot data using logarithmic values and explain why and when it is more suitable to use the logarithmic values?
Finally, make conclusions about whether there are best methods to describe the data and why, and briefly explain what the behavior of the flow data is based on the methods used.
flows.txt
flows.txt
using logarithmic valuesTips:
Useful Python functions include
plt.scatter()
,plt.hist()
,plt.boxplot()
,ecdf()
.Useful R functions include
plot()
,hist()
,boxplot()
,ecdf()
,log()
.
For the task 3, produce different kinds of plots that could be useful for analyzing network data such as stability and correlation.
Download the files linkload-*X*.txt
which contain link loads information (in bits per second) of different links in intervals of one second.
Inspect the data results, especially for stability and whether previous values contribute to the present value (short and long-range memory)
Explain your own understanding of each data set (i.e. each link)
Tips:
Useful Python functions could be
plot()
,lag_plot()
,autocorrelation_plot()
.Useful R functions could be
lag.plot()
,acf()
.
In the case of this task, graph a pairs plot for each one of the variables contained in the data set to verify the correlation and relation between them.
Download the bytes.csv
dataset contains time series data of 4 relevant columns: transmitted bytes, received bytes, transmitted packets, and received packets.
Plot the pairs plot for such values.
Answer the following questions:
💡 Tips:
Useful Python function could be
scatter_matrix()
.Useful R function could be
pairs()
.
For this task, visualize the data set by creating a time series plot, which helps in understanding the patterns and trends over time.
Download the querytime.csv
dataset which depicts the query time to a distant website with a server located in Belgium.
Plot the time series.
By observing and analyzing the plot, answer the following questions:
To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.
You can get up to 30 points for this assignment:
Task 1
Task 2
Task 3
Task 4
Task 5
The quality of the report (bonus 2p)
For the assignment, your submission must contain (Please don’t contain original data in your submission):
Regarding the report, your report must have: