To complete this assignment, students are required to have prior knowledge about how to use R or Python, statistics, and how to leverage the libraries available to process, analyze and plot data.
If you lack the relevant skills, you may want to
At the end of this assignment, students should be able to:
This assignment contains three tasks to cover important topics related to different sampling applications such as off-line estimation (collecting samples through a data set stored), on-line estimation (collecting samples in real-time, i.e., streaming data), or even one of the types of sampling for Machine Learning purposes called stratified random sampling. Please read all instructions before starting because it is helpful to identify common work.
All data for tasks 1 and 2 can be found from the sampling-data.zip
archive from the assignment page and $RDATA/sampling-data.zip
(extracted to $RDATA/sampling/
directory) at Aalto IT computers; while the file for task 3 is located in the /work/courses/unix/T/ELEC/E7130/general/ml-data
directory or using MLDATA
environment variable as a path.
The first task aims to familiarize oneself with sampling and sampling distributions, and the size of sampling for statistics from a data set (off-line sampling).
Download file sampling.txt
which contains certain session inter-arrival times to study estimation of the mean inter-arrival time based on different sample sizes.
Complete the next action points:
Plot the histogram of the original data and compute the mean.
Select 5000 random samples from original data (i.e., you should have a vector with length 5000 values). Plot its histogram and compute the mean.
Select 10000 times n random elements from the data to compute the mean of these n values. As a result, you should have a vector of 10000 values, where each of them is the mean value of n random elements. In this case, we will consider 3 scenarios:
Note: Sampling error-single mean refers to the discrepancy between a sample statistic (x̄), which represents the average value of the sample, and its corresponding population parameter (μ), which denotes the true mean of the entire population. Mathematically,
the sampling error is calculated as:
Sampling error = x̄ − μ
Note: Each mean contained in the vectors represent different results you could get for your statistic in a random sample and can be seen as samples from the sampling distribution of the sample mean statistic for n samples.
Discuss the following points: - Explain the effects of sample size on the sampling distribution and the accuracy of the estimate based on both the results (mean and standard deviation) and plots obtained by the different values of n and 5000 random samples concerning the original data. - What observations can be made regarding the presence of sampling bias in each scenario?
Tips: - Useful Python libraries could be
pandas
,matplotlib
,numpy
,seaborn
,fitdist
,scipy
andstatistics
(forvariance()
) library. Besides,sm.qqplot()
can be used to plot Q-Q plot. - Useful R functions could behist()
,fitdistr()
,rnorm()
,qqplot()
,mean()
,sd()
, andvar()
.
Reminder: Add commands that generated the plots and how statistics are computed.
This task attempts to demonstrate the effects of high variability in network measurements by estimating means with on-line sampling, i.e., as “real-time” data; the previous task is focused on off-line estimation, which is obtained by a stored data set. On the other hand, high variability can, for example, make them unpredictable in the long term.
Download the file flows.txt
which contains once more values of flow lengths in packets and in bytes captured from a network.
Complete the following action points:
Original data
On-line measurement
running_mean
to calculate meann, that is, the sample means of the first n flow lengths in bytes. Thereby, the function writes y-axis, as the sample mean values, and x-axis, as the number of flows passed.Hint: For example, there are 6 flows (flow1, flow2, flow3, and so on), and if n* is 3, i.e., to calculate the sample means of the first 3 flow lengths writting the axes.*
Note: This mimics a kind of an on-line measurement; we assume that the flows depart one by one and our estimate of the mean flow size in bytes is updated each time
Using the running_mean
, plot the mean estimate after each flow, i.e., plot the mean statistic for first observations as a function of n. Explain your observations concerning the original data and this scenario.
Suppose that the interesting statistic is the median instead of the mean as running_median
in an online scenario where a measurement system provides you with a large number of samples every second. How would you proceed in the function to calculate mediann?
Draw your conclusion about the mean and median obtained and plots generated by both the online scenario and the original data set.
running_mean
.Reminder: Document operations and reason your answers.
The purpose of the last task is to introduce the preparing data set before choosing a model or even training. During this stage, it is important to select the samples appropriately, one of the techniques is called stratified random sampling, where the population data is divided into subgroups, known as strata, so that a specific number of samples are selected from those subgroups ensuring a balance of information for each subgroup based on the specific feature(s) (reducing selection bias and chances of sampling error as well as higher accuracy than simple random sampling).
Note: Data pre-processing is the most important step in most machine learning procedures. Not having the data in suitable form would increase the learning time or it would simply be impossible to learn for the ML model.
Download the file simple_flow_data.csv
which contains simplified NetMate output which only 6 columns: source IP address, source port, destination IP address, destination port, protocol number, and duration of the flow (in micro-seconds).
Notes: - The file can be found in the directory
/work/courses/unix/T/ELEC/E7130/general/ml-data
or usingMLDATA
environment variable as a path if you have sourced theuse.sh
file. - Important to consider source and destination IP addresses as non-numerical values, the rest are numerical values.
Perform a function to prepare the whole data set through the steps below. Furthermore, you can use skeleton code skeleton_ml_0.py
to solve task.
Delete the instances that have empty values
Perform stratified random sampling where:
Encode the non-numerical values, i.e., srcip and dstip.
Standardize the values
Normalize the values between 0 and 1
Return the new data set pre-processed.
Note: At the end, the data set must contain 200 instances, and it would look something like the following (rows were shuffled here):
srcip srcport dstip dstport proto duration 109 0.500000 0.835867 0.242857 0.006227 0.3125 0.196619 115 0.500000 0.547144 0.628571 0.000431 0.3125 0.193576 181 0.142857 0.287867 0.157143 0.278581 1.0000 0.964189 87 0.500000 0.751349 0.171429 0.159641 0.3125 0.000003 163 0.500000 0.616890 0.542857 0.006227 0.3125 0.098573
Answer the following points:
Tips: - Useful Python functions could be
fit_transform()
. - Search for the documentation of the functionsLabelEncoder()
,StandardScaler()
,MinMaxScaler()
to perform the steps above in case of the libraryscikit-learn
.
Reminder: Document operations and code used.
To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.
You can get up to 30 points for this assignment:
Task 1
Task 2
Task 3 - Prepare data set for Machine Learning purposes (2p) - Answer the questions appropriately (4p)
The quality of the report (bonus 2p)
For the assignment, your submission must contain (Please don’t contain original data in your submission):
Regarding the report, your report must have: