ELEC-E7130 Assignment 1. Basic programming and processing data

Markus Peuhkuri

César Iván Olvera Espinosa

Yu Fu

2023-09-13

Prerequisites

For this assignment you need to know:

  1. Programming basics (preferably Python)

    Especially using an IDE to debug and run your code such as Vim, Visual Studio Code, Sublime Text 3, and more

  2. The use of some basic Linux commands

    If you are not very familiar with Linux and Python, you can

    1. Watch the introductory video for this assignment. (You can find it in the video section of the course.) From the video, you can learn how to use the awk command. You may wish to apply it in tasks.

    2. Additional information is provideed supporting documents

    3. Some code snippets from above document are available in an archive for easier use.

    4. An introduction video about data formats.

    5. Some ways to set up the environment, see examples for VScode and Windows Subsystem for Linux.

Learning outcomes

At the end of this assignment, students should be able to

  1. Learn the main tools that can be useful for the course: awk, Python and R
  2. Be aware of the leading libraries useful for statistical plotting and data analysis
  3. Define the most suitable tool for them
  4. Develop codes to do data processing

Introduction

This assignment contains three tasks:

For each task, complete the exercises and write the report. In addition to task-specific questions, describe your solution, including samples of produced data (few lines). Besides, you can add scripts/programs as a zip archive (submission is instructed separately).

Always make sure you have included all details in your answers and have answered every item.

Note: You can also perform task 2 and 3 with your own computer (real or virtual) by downloading the files needed.

Recommendation: There are several cheat sheet availables that you can consult related to the tools (awk, Python, R) and libraries (pandas, matplotlib) with the most useful information related to syntax, functions, variables, conditions, formulas, and more.

Task 1: Programming tools

In the first task, answer the following questions:

  1. What is the function of the command awk? How does the awk command work? Could you give at least three examples highlighting its usefulness?

  2. Compare the similarities and differences between Python and R, and explain in which situations Python is more suitable and in which situations R is more suitable. Provide three examples for each.

    HINT: Consider in terms of programming experience, applications, plotting, or more.

  3. What are three commonly used data analysis libraries in Python and R? Provide a brief description of the functionality of each library.

  4. How would you personally define latency and throughput based on your understanding? Please provide two methods for measuring latency and two methods for measuring throughput.

Task 2: Processing CSV data using awk

For this task, you need to compute statistical values from a large (462 MiB) CSV file called log_tcp_complete which reports every TCP connection that has been tracked by the tool called tsat (more information in the documentation).

! The file is a space-separated CSV file with 130 columns and 886467 records (where the first line refers to the header). As it is of significant size, you may not be able to copy it over to your Aalto home directory. Tip: use symbolic link as a shorthand.

The course has its folder in the Aalto Linux computers located in the directory: /work/courses/unix/T/ELEC/E7130/. Following this path, it has the next directory general/trace/tstat/2017_04_11_18_00.out/ to find the file for this task.

Note: You may need to type the command kinit before to get access to the folder.

Provide the following answers (in addition to the description of your solution):

  1. How can you peek at a file if it is too large to fit into memory?
  2. Print the first line (i.e. headers of the columns 3, 7, 10, 17, 21, 24)
  3. Calculate the average of the columns 3, 7, 10, 17, 21, 24
  4. Calculate the percentage of records where column10/column7 exceeds a) 0.01, b) 0.10, c) 0.20 (in other words, the value in column 10 is divided by the value in column 7 for each line and the result must exceed the values indicated)
  5. Calculate the maximum of each column: 3, 9, 17, 23, 31

HINT: Use the command awk (awk cheatsheet) to process the information requested from the second exercise. Moreover, you can use the built-in variable FNR as a condition, which refers to the record number (typically the line number) in the current file.

Task 3: Processing throughput and latency data

You must develop a code to process two CSV files: one of latency data and another related to throughput data, to compute some basic measurements and plot basic graphs using Python or R as a first approaching to processing and analyzing the data.

Note: The files can be found in the directory: /work/courses/unix/T/ELEC/E7130/general/basic_data.

3.1 Latency data using ping

HINT: The column ‘Avg RTT (ms)’ only considers the successful RTT

3.2 Throughput data using iperf3

Hint: As recommendation, you need to handle the data set using Dataframes with either Python or R due to in it will be useful when you have to work in terms of machine learning.

Grading standard

To pass this course, you need to achieve at least 15 points in this assignment. Moreover, if you submit the assignment late, you can get a maximum of 15 points.

You can get up to 30 points for this assignment:

Task 1

Task 2

Task 3

The quality of the report (bonus 2p)

The instruction of assignment

For the assignment, your submission must contain (Please do not contain original data in your submission):

Regarding the report, your report must have: