2023-09-13
For this assignment you need to know:
Programming basics (preferably Python)
Especially using an IDE to debug and run your code such as Vim, Visual Studio Code, Sublime Text 3, and more
The use of some basic Linux commands
If you are not very familiar with Linux and Python, you can
Watch the introductory video for this assignment. (You can find it in the video section of the course.) From the video, you can learn how to use the awk
command. You may wish to apply it in tasks.
Additional information is provideed supporting documents
Some code snippets from above document are available in an archive for easier use.
An introduction video about data formats.
Some ways to set up the environment, see examples for VScode and Windows Subsystem for Linux.
At the end of this assignment, students should be able to
This assignment contains three tasks:
awk
For each task, complete the exercises and write the report. In addition to task-specific questions, describe your solution, including samples of produced data (few lines). Besides, you can add scripts/programs as a zip archive (submission is instructed separately).
Always make sure you have included all details in your answers and have answered every item.
Note: You can also perform task 2 and 3 with your own computer (real or virtual) by downloading the files needed.
Recommendation: There are several cheat sheet availables that you can consult related to the tools (
awk
, Python, R) and libraries (pandas, matplotlib) with the most useful information related to syntax, functions, variables, conditions, formulas, and more.
In the first task, answer the following questions:
What is the function of the command awk
? How does the awk
command work? Could you give at least three examples highlighting its usefulness?
Compare the similarities and differences between Python and R, and explain in which situations Python is more suitable and in which situations R is more suitable. Provide three examples for each.
HINT: Consider in terms of programming experience, applications, plotting, or more.
What are three commonly used data analysis libraries in Python and R? Provide a brief description of the functionality of each library.
How would you personally define latency and throughput based on your understanding? Please provide two methods for measuring latency and two methods for measuring throughput.
awk
For this task, you need to compute statistical values from a large (462 MiB) CSV file called log_tcp_complete
which reports every TCP connection that has been tracked by the tool called tsat
(more information in the documentation).
! The file is a space-separated CSV file with 130 columns and 886467 records (where the first line refers to the header). As it is of significant size, you may not be able to copy it over to your Aalto home directory. Tip: use symbolic link as a shorthand.
The course has its folder in the Aalto Linux computers located in the directory: /work/courses/unix/T/ELEC/E7130/
. Following this path, it has the next directory general/trace/tstat/2017_04_11_18_00.out/
to find the file for this task.
Note: You may need to type the command
kinit
before to get access to the folder.
Provide the following answers (in addition to the description of your solution):
HINT: Use the command awk
(awk cheatsheet) to process the information requested from the second exercise. Moreover, you can use the built-in variable FNR
as a condition, which refers to the record number (typically the line number) in the current file.
You must develop a code to process two CSV files: one of latency data and another related to throughput data, to compute some basic measurements and plot basic graphs using Python or R as a first approaching to processing and analyzing the data.
Note: The files can be found in the directory:
/work/courses/unix/T/ELEC/E7130/general/basic_data
.
3.1 Latency data using ping
The file ping_data.csv
contains the latency data with the following information:
Datetime | Server | Transmitted packets | Successful packets | Avg RTT (ms) |
---|---|---|---|---|
1656437401.124931 | 195.148.124.36 | 5 | 5 | 0.92 |
1656438001.463204 | 195.148.124.36 | 5 | 0 | inf |
1656438602.081979 | 195.148.124.36 | 5 | 3 | 0.949 |
… | … | … | … | … |
NOTE: Every row or line of the CSV file refers every
ping
executed to send 5 ICMP echo requests (packets) to the server195.148.124.36
every 10 minutes. Moreover, there are some packet losses to be consider during the computing and plotting.
The CSV file was created from ping
outputs extracting the useful parameters in terms of latency and connectivity (as shown in the figure below).
On the other hand, the next figure shows the scenario when there are packet losses during the transmission, which is considered important in terms of measurement and analysis, where it is defined as ‘inf’ in the CSV file when all packets sent were lost, or there may be a value in the ‘Avg RTT (ms)’ column but not all packets were sent successfully.
! You must be aware that there are pings without any answer or with some losses
You need to complete the following points:
HINT: The column ‘Avg RTT (ms)’ only considers the successful RTT
3.2 Throughput data using iperf3
The CSV file called iperf_data.csv
contains the throughput data both normal and reverse direction, that is, client-server (client sends, server receives) and server-client (server sends, client receives) respectively; in order to compute and plot the measurements requested.
Datetime | Server | Port | Type | Mode | Sent bitrate (bps) | Sent bytes | Retransmissions |
---|---|---|---|---|---|---|---|
1656437461 | ok1.iperf.comnet-student.eu | 5206 | TCP | 0 | 141364340.31318378 | 176706680 | 23 |
1656437472 | ok1.iperf.comnet-student.eu | 5201 | TCP | 1 | 620303404.7144992 | 775452064 | 13 |
-1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 |
… | … | … | … | … | … | … | … |
NOTE: Every row of the file defines an
iperf3
executed every hour either the mode normal or reverse. Besides, the column ‘Mode’ defines the direcion where ‘0’ refers the normal mode (client-server) and ‘1’ defines the reverse mode (server-client).
The CSV file was created from JSON files created by after running iperf3
extracting the useful parameters (as shown in the figure below) related to throughput and connectivity.
It is important to mention there are sometimes issues related to the connection between client and server causing an error or failure (as shown in the next figure) to measure the throughput which is represented with values ‘-1’ in the CSV file.
You must be aware that there are JSON files with an error as the next sample.
You need to complete the following exercises:
Hint: As recommendation, you need to handle the data set using Dataframes with either Python or R due to in it will be useful when you have to work in terms of machine learning.
To pass this course, you need to achieve at least 15 points in this assignment. Moreover, if you submit the assignment late, you can get a maximum of 15 points.
You can get up to 30 points for this assignment:
Task 1
Task 2
Task 3
ping
iperf3
The quality of the report (bonus 2p)
For the assignment, your submission must contain (Please do not contain original data in your submission):
Regarding the report, your report must have: