To complete this assignment, students are required to have prior knowledge about how to use R or Python, statistics, and how to leverage the libraries available to process, analyze and plot data.
If you lack the relevant skills, you may want to
Take a look through related slides
Refer to earlier assignments where you can learn some R and Python knowledge.
Read the supporting materials.
At the end of this assignment, students should be able to
The present assignment covers the main topics related to probability distributions, fitting different distributions, validating the model of the distribution and the first steps to sampling to be aware of the utility and simplicity of computations. This assignment contains three tasks:
Thereby, the students must understand how to find a distribution of unknown datasets by fittings and more as well as learn the importance, advantage, and disadvantage of using samples taken from the data set.
All data can be found from sampling-data.zip
archive located in the directory /work/courses/unix/T/ELEC/E7130/general/r-data
or, using the environment variable, $RDATA/sampling-data.zip
(extracted into the directory $RDATA/sampling/
) at Aalto IT computers.
Note: You may type the command
kinit
before accessing to the directory to avoid issues related to the permissions.
source /work/courses/unix/T/ELEC/E7130/general/use.sh
$ cd $RDATA
$ ls
$ ...
In the first task, answer the following questions:
What is sampling in statistics, and how does it help us understand data distributions?
Choose at least three distributions from the following options and explain their respective parameters and typical applications.
What are the components of the following goodness-of-fit plots used to validate a model?
The present task addresses the modeling of measurement data with distributions.
Note: There are several benefits to find suitable distribution to fit the data. For example, distributions will briefly describe the underlying data values and could also be utilized to generate new data to have a larger dataset in certain cases. Furthermore, some learning algorithms assume some distribution to fit the data, which can help us understand the low-level details of how the learning algorithms work.
Download the three data sets are drawn from certain distributions presented at the lectures, which are as follows:
distr_a.txt
distr_b.txt
distr_c.txt
Study each dataset to choose a good distribution for it.
Tips: - Useful Python functions could be
distfit()
. - Useful R functions could befitdist()
.
For each dataset:
Reminder: Document the process and operations.
This task provides an opportunity to practice random sampling, analyze the results, and understand the significance of sampling techniques in data analysis.
Download the file flowdata.txt
which contains the following information for a set of flows as seen before:
Complete the following tasks:
Note: The average packet size of a flow is calculated with the number of bytes in a flow divided by its number of packets, that is, as the formula below:
$$ Average\ packet\ size\ of\ a\ flow = \frac{total\ bytes\ of\ a\ flow}{total\ packets\ of\ a\ flow} $$
Note: The average throughput of a flow is the number of bytes transferred divided by the transfer time, that is, the difference between the arrival time of the last packet and the first packet.
Tips: - Useful Python functions could be
pandas.sample()
,pandas.plotting.parallel_coordinates()
,matplotlib.pyplot.plot()
. - Useful R functions could besample()
,ggparcoord()
,plot()
.
To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.
You can get up to 30 points for this assignment:
Task 1
Task 2
Task 3
The quality of the report (bonus 2p)
For the assignment, your submission must contain (Please don’t contain original data in your submission):
Regarding the report, your report must have: