TP PARPING TLP⚓
Source codes can be found in this repository: Sources
A good resource for introduction to openmp : Programming parallel computers
Doc OpenMP : Documentation
Exercise I : Warm up⚓
The code bellow is a simple code spawning several threads -- the number of which is determined by the environment variable OMP_NUM_THREADS
-- each of which will print their thread ID before exiting. The only OpenMP construct here is the #pragma omp parallel
section. It tells OpenMP to execute the code contained in the section on all the configured threads. The omp_get_thread_num()
and omp_get_num_threads()
return the executing thread's id and the total number of threads, respectively.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Question 1
Before running this code, try to predict the following:
- What happens when
OMP_NUM_THREADS
\(= 1\)? - What happens when
OMP_NUM_THREADS
\(> 1\)? - In what order will the different lines be printed when
OMP_NUM_THREADS
\(> 1\)? - Are you anticipating any issues with this code that would cause it to execute differently from expected (think about sharing)?
The code for this exercise can be compiled and run as:
1 2 |
|
The codes for the following exercises can be compiled with the same call, by adjusting the source file name.
Question 2
Verify your hypotheses by running the code!
Exercise II : Parfor⚓
In this exercise you will create an std::vector
filled as
and \(t_n = n d_t\) using the skeleton provided for this exercise. The important parts of the code are reproduced here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Question 3
First, analyze the code:
- What does it do? What do the values stored in the vector correspond to?
- Can it be efficiently parallelized? If so, why? With what technology?
- Can issues similar to Exercise 1 be encountered when parallelizing this code?
Now you can use main_2.cpp
that implements this signal generation in
along with some other boilerplate. Compile it and run it with
1 |
|
2.1: Let's speed things up by hand⚓
OpenMP provides several levels of increasingly more complex constructs for parallelization. One of the simpler concepts are the OpenMP tasks. Tasks are portions of code enclosed in a #pragma omp parallel
section, that can be executed independently from each other (no ordering, no shared state). Their independence means that the tasks can trivially be executed in parallel without particular care for ordering, or side-effects.
Create a #pragma omp single
section (executed only once) in which you will define your tasks. This nested structure will allow for the tasks to be executed by several threads (since it is in an #pragma omp parallel
), but the tasks must only be created by a single thread (or there would be only \(3 N\) tasks, where N is the number of threads). Task creation is different from task execution, typically tasks are created and put into a task pool by a single thread; the other threads will then remove and execute tasks from the pool one by one.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
Question 4
Use 2 and 3 OpenMP tasks to accelerate the computation of the vector.
- Is the speedup as expected?
- Analyze the speedup in function of the number of tasks, threads, and number of tasks per thread!
- Is the whole vector initialized correctly for several values (even, odd, ...) of
num_samples
? Check the range of the plot to make sure everything is fine! If there is an issue, why and how can you fix it?
2.2: Stand on the shoulder of giants⚓
OpenMP provides the #pragma omp for
construct for parallelizing for loops. Check section 2.11.4 of the official OpenMP documentation (available here) and use #pragma omp for
to compute \(v\)
Question 5
Compare the two parallelisation schemes you have implemented (tasks and #pragma omp for
). How do they differ in flexibility, ease of use, ...?
Question 6
Now that your code has been parallelized, you must check the efficiency of the parallelization.
- How does the number of threads impact the speed of the program? (Check if we see the effect of hyperthreading).
- What happens with a number of OpenMP threads up to the number of cores?
- What happens when you have more OpenMP threads than hardware threads?
Exercise III : Fierce competition⚓
Verify that \(\sum_{i=0}^{10^4} i^3 = 2500500025000000,\) but fast!
Question 7
Using the code provided, verify that
- everything is working as expected without parallelization
- everything is working as expected with parallelization
Question 8
You should have identified issues in the previous step. Try to solve them using partial sums for each threads that are summed together after the end of the for loop.
Question 9
The problem you have identified and solved is a standard one that can be solved using (among other techniques)
std::atomic
;- OpenMP reductions
#pragma omp atomic
- ... Use each at least two of these techniques (one at a time) to solve the issues with the original code, and compare their ease of use and the potential performance impact.
Exercise IV : Intermission⚓
Now that you have an idea of how to spread the work between threads, you should need to learn how to divide the work (or how to have OpenMP do it for you...)
We will focus on the following code, in which the call to sleep
simulates a
call to a function that takes more or less time to complete, depending on the
value of its parameter.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Details on scheduling can be found here.
Question 10
First, analyze the code above and the corresponding code in
main_3.5.cpp
:
- how do the code differ and why?
- how long should the code take to complete with a single thread?
- how long should the code take to complete with \(N\) threads, ideally?
Now check your hypotheses by compiling and timing the execution with a single thread, and then with multiple threads:
1 2 3 4 5 |
|
Question 11
Can you explain why the execution time in parallel is harder to predict than in the previous exercises?
One way to improve the situation is to split the work among threads in a more
balanced way. Look at the scheduling part of the OpenMP documentation for omp
for
here, and experiment
with the different schedulers.
Question 12
Looking at your results:
- Do you always see an improvement?
- Which is the best scheduler for the task at hand and why?
Question 13
Go back to the signal processing exercise and experiment with the different schedulers.
- Are there some cases in which you loose performance?
- Why?
Exercise V : Putting it all together⚓
In this exercise, we will combine the previous results to compute the discrete Fourier transform (DFT) of the signal you have computed in Exercise II. You will need to compute the DFT matrix
where \(\omega = \exp\left(-2 \pi \mathrm{i} / N\right)\) and implement a (parallel) matrix-vector product (MVP)
This way of computing the DFT is extremely inefficient and is only used here for didactic purposes. The fast Fourier transform (FFT) algorithm should be used in any other context!
Question 14
The code provided in main_4.cpp
provides all you need to compute the DFT matrix by completing the TODOs.
- How would you store the DFT matrix in memory? What are the alternatives and what are their respective adavantages?
- Choose one of these alternatives and implement it, along with the signal and DFT computation.
- Does the DFT graph displayed correspond to what you would expect given your signal?
Exercise VI : Brute force it, try again! (Bonus)⚓
The program in main_5.cpp
attempt to find the string that was used to generate the sha512
hash you will provide as input. Hash functions are functions that map an arbitrary sequence of bits into a finite sequence of bits. By construction these functions are surjective and do not admit an inverse, which makes them perfect for security applications. One of their most common use cases is to verify passwords: any decent website or service provider will not store your password on their system as it would be at risk of being divulged if a data breach were to occur; instead they will store a hash of your password. When you send your password for authentication they will compute its hash and compare it to the stored value, which will (almost) unequivocally verify that your password is correct. Hashes can also be used to verify data authenticity, integrity, ... in addition to applications outside of cryptography, such as constructing data maps (see std::map
).
When attackers obtain password hashes out of data breach they usually want to recover the original password. But given the non-invertibility of hash functions they have to use brute force and try several combination of words until they can guess one that generates the correct hash (more advanced techniques exist to speed up the process).
The objective of this exercise is to recover the original word out of the hash using a brute force approach on the computer at your disposal. You will have to compute the hashes of all the words in the dictionary file using several threads, until you get the find the correct one.
To set up the environement, you will download a dictionary of words and passwords specially designed for the kind of task we want to accomplish! The download and decompressing might take a bit of time...
1 2 |
|
Question 15
Compare the dictionary you have downloaded to the system dictionary /usr/share/dict/words
in terms of size, content, ...
To verify if your code runs, start by generating the hash of a common word:
1 |
|
1 2 |
|
crackstation-human-only.txt
file you have downloaded as dictionary.
Question 16
The serial version of the code should run out of the box. Using it answer these questions:
- Load the entire dictionary (63941069 words); how much time does it take to find the word
azerty
andzoo
? Is this normal? - How would you parallelize the brute force search of the hash?
- With this in mind you can start implementing the search parallel search function.
Question 17
Now that you have a running code answer these questions:
- How does the computation time scale with the number of threads/words?
- Try loading all the words in the
crackstation-human-only.txt
, are you encountering any issues? - how much time does it take to find the word
azerty
andzoo
when using your parallel code? Is this optimal? If not why? How can this be fixed?