Final Assignment — From measurements to conclusions

Markus Peuhkuri

Tran Thien Thi

Weixuan Jiang

Yu Fu

2023-10-09

General guidelines for this final assignment

Please note this assignment might require quite an extended amount of time and work, especially if you are not familiar with software oriented analysis methods and tools, so please take this into your considerations when you are planning for your schedule and deadlines!

Some tasks are repetitive, i.e. same analysis is done for multiple distinct data sets. It is much easier to do if you create small functions or scripts that will just take different data. Or even run all analysis one go for all the data. Also note that you are able to take advance of scripts and tools used in earlier assignments.

Assessment

Please note that in the review session, the assignment must include at least a draft state of most sections even if final graphs, tables and conclusions need not to be available.

Final assignment has a total weight of 60% in the final grade and will be graded on a continuous scale ranging from 0 to 100 points, where points less than 50 are considered as rejected. Both the final assignment and the weekly assignments must be completed successfully (you should get at least a grade 1 for all) in order to pass the course.

The assignment is individual work. You may cooperate with others by discussing the tasks - this is in fact encouraged, but all output should be produced by yourself. The detailed scoring rules can be found in Grading standard section.

Support

The assignment is meant to be individual work, but there are three kinds of support available for the students:

Remember the correct discussion principles: write a descriptive subject in forums and clearly describe your question or problem. Also, describe what you have already tried to do but had problems with. Course staff monitors Zulip channels mainly during office hours but may not be able to give timely responses all the time because of other tasks. For code debug and quick questions, Zulip works nicely. If you have more very long text (more than few tens of lines of output / commands) use some Pastebin services like dpaste.org, fpaste.org, pastebin.ca, paste.ee, gist.github.com or an attachment.

Introduction

This final exercise will cover almost all the concepts taught in this course, ranging from data measurements to deriving results and conclusions from datasets. Upon completing this exercise, students will have a solid understanding of how to obtain the desired data and final results from measured data related to network traffic.

This final assignment contains three main tasks both with several sub-tasks and final conclusions:

In Task 1, you will capture your own “data set PS,” which you will utilize to solve the required tasks. In Task 2, you will be provided with “data set FS,” which you will need to use to solve the required tasks. In Task 3, you will analyze active measurement data, “data set AS”.

Prerequisites

This exercise requires students to have a good understanding and hands-on experience with all concepts and techniques mentioned so far in this course to properly answer the questions.

More information about available tutorials be found from material section of course web page on MyCourses There is a ELEC-E7130 Network capture tutorial at supporting material section.

Task 1: Capturing data

Data set I is obtained by packet capture, so first you will capture packets on your own. Then this captured data set will be pre-processed in three different ways so that at the end of the pre-processing, you will have three data sets: PS1, PS2, and PS3. PS1 will contain packets, PS2 will contain flows, and PS3 will only contain TCP connections. All these data sets will be analyzed separately in the data analysis phase.

Acquiring packet capture data

The recommended way to get the packet trace is to carry out your own measurements. You will need to use your own computer or a network where you have access and the right permission to perform packet capture to get the data.

You can use dumpcap (Wireshark) or tcpdump for getting those data. More information about the Wireshark and TCPdump can be found from the material section of the course web page on MyCourses.

The measurement period should be at least two hours long, while a day-long trace is much better as the more data there is, the more interesting it is. You can use your own computer to perform the packet capture. In a case where you do not have a personal computer to do so, you can ask course staff for instructions on how you can loan a computer that can be used to perform the packet capture. As a last resort, you can use some publicly available traces

For this part, your report must clearly include packet capture metadata:

Data pre-processing

After you have the raw packet data, you need to convert it to a suitable format. The data will be analyzed both at packet level and at flow level.

In the first phase, you can anonymise your traces using crl_to_pcap utility. This is not mandatory but if you choose to anonymise the trace, use the anonymised trace consistently in all your analyses to avoid confusion. Note that anonymisation will render geo-locating IP addresses impossible (can be problematic in 1.6).

Three data sets will be distilled from the raw data. We refer to these as PS1, PS2, and PS3, respectively.

For this part, your report must include:

Following is the precise structure we need for each dataset:

Cleaning the data packets (PS1)

Regarding pre-processing of PS1, it will vary based on the specific requirements of the data analysis tasks. To determine the necessary information on individual packets for different sections, refer to the required tasks in the data analysis section. Clean the collected data to retain only the relevant columns accordingly. In other words, pre-processing PS1 depends on the specific tasks. Remember to thoroughly document the selection process made during the pre-processing.

Converting packet trace to flow data (PS2)

Regarding pre-processing of PS2, you have multiple options to convert the captured packets into flow data. To generate flow data, you could use the crl_flow utility from CoralReef package with time-out of 60 seconds, you could use tstat, or you could use your own script to extract the flow data. These choices offer effective ways to preprocess the PS2 data.

TCP connection statistics (PS3)

Regarding pre-processing of PS3, you can use tcptrace command on your captured file to produce statistics from TCP connections as follows:

tcptrace -l -r -n --csv myown.pcap > myown-tcp.csv

The provided command will generate statistics for each TCP connection observed in the captured file. If you omit the --csv option, you will receive more detailed output (feel free to try it to get an overview of the data items, but keep in mind that the CSV format is easier to parse by programs). For additional information, you can refer to the manual page of the tcptrace command by using man tcptrace.

Data analysis

Analyse the data set carefully. The minimum requirements are detailed below, but additional plots and insights are welcomed. Each plot should contain a short description and also descriptive labels for the axis.

Packet data PS1

Flow data PS2

TCP connection data PS3

For the TCP connection statistics, we are interested in retransmissions. Study the association of retransmissions to:

Conclusions

Explain your conclusions for:

Task 2: Flow data

In task 2, we will use data set II, which will be provided to you. First, you need to obtain access to the dataset. Once you have access, you will pre-process the data set to extract only the relevant subnetwork data. After completing the pre-processing, you will proceed with the data analysis and work on solving the required tasks.

Acquiring flow data

Data set II consists of anonymised flow measurements from an access network (if interested, see how they were created in the Network capture tutorial). A sample of users has been selected for the data collection. The time stamps on the flows are given in terms of UNIX epoch time.

This flow data is available at /work/courses/unix/T/ELEC/E7130/general/trace under three directories (please note the file sizes!). After sourcing use script, directory is in environment variable $TRACE.

Directories contain the following data:

Note: Performing any file-handling operations in these directories is not possible with normal user privileges. You will need to redirect all operations to, for example, your home directory or /tmp directory if your home folder does not have enough space. Note that files in the /tmp folder can be deleted at any time, so use it only for intermediate files, not your code files.

Data pre-processing

The given data set FS1 contains flow data from an entire day, which can be quite large. For your analysis, you do not need to examine the entire data set (except for task 2.3). Instead, you can select one of the three directories that best suits your analysis type. Please focus on a single /24 network from the list below, based on the last digit of your student number. This selected data set will be referred to as FS2.

Subnetwork based on the last digit of student number.
digit subnetwork
0 163.35.10.0/24
1 163.35.158.0/24
2 163.35.94.0/24
3 163.35.139.0/24
4 163.35.138.0/24
5 163.35.93.0/24
6 163.35.92.0/24
7 163.35.250.0/24
8 163.35.235.0/24
9 163.35.116.0/24

As an example, let’s assume you have selected the 1200.t2 file from the tstat-log directory. If you want to extract relevant data for your own network with the IP address range of 192.0.2.0/24, you can use the gawk command as follows:

gawk$1~/^192\.0\.2\./||$15~/^192\.0\.2\./’ 1200.t2 > ~/my_1200.t2

In this command, the gawk program searches for rows in the 1200.t2 file where the IP address in either the 1st column or the 15th column matches the pattern “192.0.2.”. The matched rows are then saved into a new file named my_1200.t2 in your home directory.

Please note that in tstat-log files, IP addresses can be found in the 1st and 15th fields.

In addition to this, other pre-processing may be needed. Document for your notes

Data analysis

After pre-processing, analyse the data set FS2 carefully. The minimum requirements are detailed below, but additional insight and plots supporting those are welcomed. Each plot should contain a short description and also descriptive labels for the axis.

2.1: Plot traffic volume

Select one of the previous tasks (1.4-1.5, 1.7-1.9) and perform the same analysis for the FS2 data set. This means that you should choose either tasks 1.4 and 1.5, or tasks 1.7, 1.8, and 1.9. Once you have chosen the task, apply the analysis steps to the FS2 data set.

2.2: Per user data volume

Compute the aggregate data volume for each user and draw a histogram to visualise distribution of user aggregated data. In other words, make one histogram that contains all users, no need to identify users from each other. (user would be one IP address within your assigned subnetwork)

2.3: Flow sampling

For this task, use FS1 and take ALL flow data into account (i.e., not limiting the scope solely on your subnetwork).

Make two random selections from all flows by sampling flows from the 24h flow data: first selection to only include IPv4 traffic and the other only IPv6. Define your sampling process such that you will get about the same number of flows for this all flow data as in your assigned subnetwork. Document your selection process.

Select one of the previous tasks (2.1-2.2) and perform the same analysis for both sampled data sets you just collected. Compare the results to the original task where you used your subnetwork (FS2) only. Can you say the characteristics of your subnetwork is representative? Is there a difference between IPv4 and IPv6?

2.4: Conclusions

Based on the results above, explain your conclusions on data for:

Please feel free to use additional visualisations to support your claims and conclusions if necessary.

Task 3: Analysing active measurements

As a result of the Basic Measurements, you should have at least two weeks worth of measurement data:

3.1 Latency data plots (AS1.x)

3.2 Latency data time series

3.3 Throughput

3.4 Throughput time series

Conclusion

Discuss on conclusion on Task 3 for at least the following topics:

Final conclusions

After you have completed Task 1-3, you are now almost done. Based on these tasks, answer the following questions.

Grading standard

To pass this course, you need to achieve at least 50 points in this assignment. And if you submit the assignment late, you can get a maximum of 50 points.

You can get up to 100 points for this assignment:

Task 1

Task 2

Task 3

Final conclusion

The quality of the report (bonus 5p)

The instruction of assignment

For the assignment, your submission must contain (Please don’t contain original data in your submission):

Report

You should prepare a report based on your analysis by including all the details of the results in a written report. Submission of the report consists of two phases:

The report should have two parts:

  1. Main document explaining results and findings without technical details. This is like information that would be given to the customer who hired you to make an analysis.

  2. Appendix contains detailed explanations on what has been done supplemented by commands used to get a result or draw a figure, if appropriate. Plain commands, scripts, or codes without comments are not sufficient. This is like information you would hand out to your colleague who needs to do a similar analysis for another customer.

    Also include samples of data sources, like 5-10 first relevant lines when appropriate. Do not include full data.

When you are asked to plot or visualise a certain parameter, make sure that your figures are as informative as possible and are really visualising a parameter(s) in question by a selection of appropriate plot, units, and scales (linear vs. logarithmic, ranges) and not just plotting some numbers and figures with the default setting.

It is recommended to go through the following processes for each dataset:

Address all the sections carefully and in the order where they come. Organise your report clearly, using sections for data sets, subsections for pre-processing, analysis, and conclusions for each data set. Always refer to task number in your report. Easiest way is to use same numbering scheme in chapters.

It is recommended that each plot contains a short description and also descriptive labels for the axis. Pay enough attention to the conclusions as they are considered to be one of the most important parts of evaluations.

Of course, you need a cover page indicating your name, student ID, and e-mail address.