ELEC-E7130 Assignment 6. Distributions and sampling

Markus Peuhkuri

Esa Hyytiä

Seyud Mortezaei

Tran Thien Thi

Weixuan Jiang

César Iván Olvera Espinosa

Prerequisites

  1. To complete this assignment, students are required to have prior knowledge about how to use R or Python, statistics, and how to leverage the libraries available to process, analyze and plot data.

    If you lack the relevant skills, you may want to

    1. Take a look through related slides

    2. Refer to earlier assignments where you can learn some R and Python knowledge.

    3. Read the supporting materials.

Learning outcomes

At the end of this assignment, students should be able to

  1. Understand the purpose of using distributions
  2. Get to know about the different probability distributions available
  3. Find the best distribution of unknown datasets by fittings and more
  4. Validate the distribution through appropriate plots
  5. Have a good understanding of how to sample their data set for simplicity in computations

Introduction

The present assignment covers the main topics related to probability distributions, fitting different distributions, validating the model of the distribution and the first steps to sampling to be aware of the utility and simplicity of computations. This assignment contains three tasks:

Thereby, the students must understand how to find a distribution of unknown datasets by fittings and more as well as learn the importance, advantage, and disadvantage of using samples taken from the data set.

All data can be found from sampling-data.zip archive located in the directory /work/courses/unix/T/ELEC/E7130/general/r-data or, using the environment variable, $RDATA/sampling-data.zip (extracted into the directory $RDATA/sampling/) at Aalto IT computers.

Note: You may type the command kinit before accessing to the directory to avoid issues related to the permissions.

$ source /work/courses/unix/T/ELEC/E7130/general/use.sh
$ cd $RDATA
$ ls
...

Task 1: Introduction to distribution and sampling

In the first task, answer the following questions:

  1. What is sampling in statistics, and how does it help us understand data distributions?

  2. Choose at least three distributions from the following options and explain their respective parameters and typical applications.

  3. What are the components of the following goodness-of-fit plots used to validate a model?

Report, task 1:

Task 2: Distributions

The present task addresses the modeling of measurement data with distributions.

Note: There are several benefits to find suitable distribution to fit the data. For example, distributions will briefly describe the underlying data values and could also be utilized to generate new data to have a larger dataset in certain cases. Furthermore, some learning algorithms assume some distribution to fit the data, which can help us understand the low-level details of how the learning algorithms work.

Tips: - Useful Python functions could be distfit(). - Useful R functions could be fitdist().

Report, task 2

For each dataset:

Reminder: Document the process and operations.

Task 3: Sampling

This task provides an opportunity to practice random sampling, analyze the results, and understand the significance of sampling techniques in data analysis.

Download the file flowdata.txt which contains the following information for a set of flows as seen before:

Complete the following tasks:

  1. Overview of the data set
  2. Number of bytes against packets

    Note: The average packet size of a flow is calculated with the number of bytes in a flow divided by its number of packets, that is, as the formula below:
    $$ Average\ packet\ size\ of\ a\ flow = \frac{total\ bytes\ of\ a\ flow}{total\ packets\ of\ a\ flow} $$

  3. Average throughput

    Note: The average throughput of a flow is the number of bytes transferred divided by the transfer time, that is, the difference between the arrival time of the last packet and the first packet.

Tips: - Useful Python functions could be pandas.sample(), pandas.plotting.parallel_coordinates(), matplotlib.pyplot.plot(). - Useful R functions could be sample(), ggparcoord(), plot().

Report, task 3

Grading standard

To pass this course, you need to achieve at least 15 points in this assignment. And if you submit the assignment late, you can get a maximum of 15 points.

You can get up to 30 points for this assignment:

Task 1

Task 2

Task 3

The quality of the report (bonus 2p)

The instruction of assignment

For the assignment, your submission must contain (Please don’t contain original data in your submission):

Regarding the report, your report must have: