ELEC-E7130 Assignment 5. Data Analysis

Esa Hyytiä

Markus Peuhkuri

Tran Thien Thi

Weixuan Jiang

César Iván Olvera Espinosa

Yu Fu

Prerequisites

  1. To complete this assignment, students require a basic understanding of Python or R including data import, data processing and visualization, and data inference.

    If you are not very familiar with using Python to plot, you can

    1. Take a look at Matplotlib library and matplotlib for data science in Python.

    2. Also use any other tools you are familiar with for analyzing

Learning outcomes

At the end of this assignment, students should be able to

  1. Understand the techniques available in data analysis and visualization.
  2. Know what numbers describe characteristics best
  3. Make graphs that represent information in an easy-to-understand way.
  4. Analyze the data set from different perspectives and characteristics such as correlation, stability, trend, seasonality or stationarity.

Introduction

This assignment contains five tasks very helpful to analyze data sets from different perspectives. Please read all instructions before starting.

All these exercises can be done using Python or any other available software (such as R, Matlab) as long as the results are consistent and correct.

Recommendations:

All the data files required in this exercise are found from /work/courses/unix/T/ELEC/E7130/general/r-data directory. Path is also as RDATA environment variable if you have sourced the use.sh file.

Note: You may type the command kinit before accessing to the directory to avoid issues related to the permissions.

$ source /work/courses/unix/T/ELEC/E7130/general/use.sh
$ cd $RDATA
$ ls
...

Task 1: Understanding different plots

In the first task, explain the following plots briefly. For example, what the y-axis/x-axis represent, the appropriate usage scenarios, and the limitations of each plot.

  1. Autocorrelation plot
  2. Boxplot
  3. Lag plot
  4. Parallel plot
  5. Scatter Matrix Plot

Report, task 1:

Task 2: Plot data

In this task, graph various kinds of plots in linear scale and logarithmic scale, and then analyze them.

Download the file flows.txt contains values of flow lengths in bytes captured from a network in order to study the flow length variable using your favorite software.

Provide concise answers to the following sections.

  1. Plot the flow data using:

    Note: Provide the plots including the commands or functions used to plot the data in your report.

  2. Describe the distributions choosing variables. In terms of summary data, it means the expression variable indicates the measure of central tendency of a distribution, such as mean, median, mode, max, min, etc.

    Note: Provide the commands used to get the results as well as explain the reasons for your selections based on the information you gathered during the previous section.

  3. Replot data using logarithmic values and explain why and when it is more suitable to use the logarithmic values?

Finally, make conclusions about whether there are best methods to describe the data and why, and briefly explain what the behavior of the flow data is based on the methods used.

Report, task 2

Tips:

Task 3: Link loads

For the task 3, produce different kinds of plots that could be useful for analyzing network data such as stability and correlation.

Download the files linkload-*X*.txt which contain link loads information (in bits per second) of different links in intervals of one second.

  1. Plot the data of each link through:
  1. Inspect the data results, especially for stability and whether previous values contribute to the present value (short and long-range memory)

  2. Explain your own understanding of each data set (i.e. each link)

Tips:

Report, task 3

Task 4: Pairs plot

In the case of this task, graph a pairs plot for each one of the variables contained in the data set to verify the correlation and relation between them.

Download the bytes.csv dataset contains time series data of 4 relevant columns: transmitted bytes, received bytes, transmitted packets, and received packets.

  1. Plot the pairs plot for such values.

  2. Answer the following questions:

💡 Tips:

Report, task 4

Task 5: Understanding time series concepts

For this task, visualize the data set by creating a time series plot, which helps in understanding the patterns and trends over time.

Download the querytime.csv dataset which depicts the query time to a distant website with a server located in Belgium.

  1. Plot the time series.

  2. By observing and analyzing the plot, answer the following questions:

Report, task 5

Grading standard

To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.

You can get up to 30 points for this assignment:

Task 1

Task 2

Task 3

Task 4

Task 5

The quality of the report (bonus 2p)

The instruction of assignment

For the assignment, your submission must contain (Please don’t contain original data in your submission):

Regarding the report, your report must have: