TP PARPING TLP⚓

Source codes can be found in this repository: Sources

A good resource for introduction to openmp : Programming parallel computers

Exercise I : Warm up⚓

The code bellow is a simple code spawning several threads -- the number of which is determined by the environment variable OMP_NUM_THREADS -- each of which will print their thread ID before exiting. The only OpenMP construct here is the #pragma omp parallel section. It tells OpenMP to execute the code contained in the section on all the configured threads. The omp_get_thread_num() and omp_get_num_threads() return the executing thread's id and the total number of threads, respectively.

#include <omp.h>
#include <iostream>

int main()
{
    #pragma omp parallel
    {
        std::cerr<< "Hello World... from thread " << omp_get_thread_num() <<
                    " / " << omp_get_num_threads() << std::endl;
    }
    return EXIT_SUCCESS;
}

Question 1

Before running this code, try to predict the following:

What happens when OMP_NUM_THREADS $= 1$ ?
What happens when OMP_NUM_THREADS $> 1$ ?
In what order will the different lines be printed when OMP_NUM_THREADS $> 1$ ?
Are you anticipating any issues with this code that would cause it to execute differently from expected (think about sharing)?

The code for this exercise can be compiled and run as:

g++ -std=c++17 -Wall -O3 main_1.cpp -fopenmp -o exo_1
OMP_NUM_THREADS=XX ./exo_1

The codes for the following exercises can be compiled with the same call, by adjusting the source file name.

Question 2

Verify your hypotheses by running the code!

Exercise II : Parfor⚓

In this exercise you will create an std::vector filled as

{[v]}_{n} = (1 + A \cos (2 π f_{m} t_{n})) \cos (2 π f_{p} t_{n})

and $t_{n} = n d_{t}$ using the skeleton provided for this exercise. The important parts of the code are reproduced here.

#include <omp.h>
#include <vector>
#include <iostream>

int main(int argc, char argv[])
{
    std::vector<float> samples(10);

    for(size_t i = 0; i < samples.size(); ++i)
    {
        float t = i * dt;
        samples[i] = (1 + A * std::cos(2 * pi * fm * t)) * std::cos(2 * pi * fp * t);
    }

    // Do a plot with GNU plot!
    return EXIT_SUCCESS;
}

Question 3

First, analyze the code:

What does it do? What do the values stored in the vector correspond to?
Can it be efficiently parallelized? If so, why? With what technology?
Can issues similar to Exercise 1 be encountered when parallelizing this code?

Now you can use main_2.cpp that implements this signal generation in along with some other boilerplate. Compile it and run it with

./exo_2 100000 1 10 0 10

which will generate and display the signal in the equation above.

2.1: Let's speed things up by hand⚓

OpenMP provides several levels of increasingly more complex constructs for parallelization. One of the simpler concepts are the OpenMP tasks. Tasks are portions of code enclosed in a #pragma omp parallel section, that can be executed independently from each other (no ordering, no shared state). Their independence means that the tasks can trivially be executed in parallel without particular care for ordering, or side-effects.

Create a #pragma omp single section (executed only once) in which you will define your tasks. This nested structure will allow for the tasks to be executed by several threads (since it is in an #pragma omp parallel), but the tasks must only be created by a single thread (or there would be only $3 N$ tasks, where N is the number of threads). Task creation is different from task execution, typically tasks are created and put into a task pool by a single thread; the other threads will then remove and execute tasks from the pool one by one.

#pragma omp parallel
{
    #pragma omp single
    {
        #pragma omp task
        {
            std::cerr << "Task 1" << std::endl;
        }

        #pragma omp task
        {
            std::cerr << "Task 2" << std::endl;
        }

        #pragma omp task
        {
            std::cerr << "Task 3" << std::endl;
        }
    }
}

Question 4

Use 2 and 3 OpenMP tasks to accelerate the computation of the vector.

Is the speedup as expected?
Analyze the speedup in function of the number of tasks, threads, and number of tasks per thread!
Is the whole vector initialized correctly for several values (even, odd, ...) of num_samples? Check the range of the plot to make sure everything is fine! If there is an issue, why and how can you fix it?

2.2: Stand on the shoulder of giants⚓

OpenMP provides the #pragma omp for construct for parallelizing for loops. Check section 2.11.4 of the official OpenMP documentation (available here) and use #pragma omp for to compute $v$

Question 5

Compare the two parallelisation schemes you have implemented (tasks and #pragma omp for). How do they differ in flexibility, ease of use, ...?

Question 6

Now that your code has been parallelized, you must check the efficiency of the parallelization.

How does the number of threads impact the speed of the program? (Check if we see the effect of hyperthreading).
What happens with a number of OpenMP threads up to the number of cores?
What happens when you have more OpenMP threads than hardware threads?

Exercise III : Fierce competition⚓

Verify that $\sum_{i = 0}^{10^{4}} i^{3} = 2500500025000000,$ but fast!

Question 7

Using the code provided, verify that

everything is working as expected without parallelization
everything is working as expected with parallelization

Question 8

You should have identified issues in the previous step. Try to solve them using partial sums for each threads that are summed together after the end of the for loop.

Question 9

The problem you have identified and solved is a standard one that can be solved using (among other techniques)

std::atomic;
OpenMP reductions
#pragma omp atomic
... Use each at least two of these techniques (one at a time) to solve the issues with the original code, and compare their ease of use and the potential performance impact.

Exercise IV : Intermission⚓

Now that you have an idea of how to spread the work between threads, you should need to learn how to divide the work (or how to have OpenMP do it for you...)

We will focus on the following code, in which the call to sleep simulates a call to a function that takes more or less time to complete, depending on the value of its parameter.

    #pragma omp parallel
    {
        #pragma omp for schedule(static)
        for(size_t i = 0; i < num_iters; ++i)
        {
            std::cout << "Thread " << omp_get_thread_num() <<
                         " going to sleep for " << i << "s\n";
            std::this_thread::sleep_for(std::chrono::seconds{i});
        }

        std::cout << "Thread " << omp_get_thread_num() << " done!\n";
    }

Details on scheduling can be found here.

Question 10

First, analyze the code above and the corresponding code in main_3.5.cpp:

how do the code differ and why?
how long should the code take to complete with a single thread?
how long should the code take to complete with $N$ threads, ideally?

Now check your hypotheses by compiling and timing the execution with a single thread, and then with multiple threads:

g++ -std=c++17 -Wall -O3 main_3.5.cpp -fopenmp -o exo_3.5
time OMP_NUM_THREADS=1 ./exo_3.5 5
time OMP_NUM_THREADS=2 ./exo_3.5 5
time OMP_NUM_THREADS=3 ./exo_3.5 5
time OMP_NUM_THREADS=4 ./exo_3.5 5

Question 11

Can you explain why the execution time in parallel is harder to predict than in the previous exercises?

One way to improve the situation is to split the work among threads in a more balanced way. Look at the scheduling part of the OpenMP documentation for omp for here, and experiment with the different schedulers.

Question 12

Looking at your results:

Do you always see an improvement?
Which is the best scheduler for the task at hand and why?

Question 13

Go back to the signal processing exercise and experiment with the different schedulers.

Are there some cases in which you loose performance?
Why?

Exercise V : Putting it all together⚓

In this exercise, we will combine the previous results to compute the discrete Fourier transform (DFT) of the signal you have computed in Exercise II. You will need to compute the DFT matrix

{[W]}_{m n} = \frac{ω^{m n}}{\sqrt{N}}, (m, n) \in [0, N - 1]^{2},

where $ω = \exp (- 2 π i / N)$ and implement a (parallel) matrix-vector product (MVP)

{[W s]}_{m} = \sum_{k = 0}^{N - 1} {[W]}_{m k} {[s]}_{k} m \in [0, N - 1] .

This way of computing the DFT is extremely inefficient and is only used here for didactic purposes. The fast Fourier transform (FFT) algorithm should be used in any other context!

Question 14

The code provided in main_4.cpp provides all you need to compute the DFT matrix by completing the TODOs.

How would you store the DFT matrix in memory? What are the alternatives and what are their respective adavantages?
Choose one of these alternatives and implement it, along with the signal and DFT computation.
Does the DFT graph displayed correspond to what you would expect given your signal?

Exercise VI : Brute force it, try again! (Bonus)⚓

The program in main_5.cpp attempt to find the string that was used to generate the sha512 hash you will provide as input. Hash functions are functions that map an arbitrary sequence of bits into a finite sequence of bits. By construction these functions are surjective and do not admit an inverse, which makes them perfect for security applications. One of their most common use cases is to verify passwords: any decent website or service provider will not store your password on their system as it would be at risk of being divulged if a data breach were to occur; instead they will store a hash of your password. When you send your password for authentication they will compute its hash and compare it to the stored value, which will (almost) unequivocally verify that your password is correct. Hashes can also be used to verify data authenticity, integrity, ... in addition to applications outside of cryptography, such as constructing data maps (see std::map).

When attackers obtain password hashes out of data breach they usually want to recover the original password. But given the non-invertibility of hash functions they have to use brute force and try several combination of words until they can guess one that generates the correct hash (more advanced techniques exist to speed up the process).

The objective of this exercise is to recover the original word out of the hash using a brute force approach on the computer at your disposal. You will have to compute the hashes of all the words in the dictionary file using several threads, until you get the find the correct one.

To set up the environement, you will download a dictionary of words and passwords specially designed for the kind of task we want to accomplish! The download and decompressing might take a bit of time...

wget https://crackstation.net/files/crackstation-human-only.txt.gz
gzip -d crackstation-human-only.txt.gz

Question 15

Compare the dictionary you have downloaded to the system dictionary /usr/share/dict/words in terms of size, content, ...

To verify if your code runs, start by generating the hash of a common word:

echo -e -n "hello" | sha512sum | cut -f1 -d" "

then you can try to compile and run the code

g++ -std=c++17 -Wall -O3 main_5.cpp -o exo_5 -lcrypto
OMP_NUM_THREADS=XX ./exo_5 crackstation-human-only.txt 10000 YOUR_HASH

Once everything is working you can start loading larger parts of the dictionary by changing the first argument, and try to find more original words. If you feel the system dictionary is too poor, you can use the crackstation-human-only.txt file you have downloaded as dictionary.

Question 16

The serial version of the code should run out of the box. Using it answer these questions:

Load the entire dictionary (63941069 words); how much time does it take to find the word azerty and zoo ? Is this normal?
How would you parallelize the brute force search of the hash?
With this in mind you can start implementing the search parallel search function.

Question 17

Now that you have a running code answer these questions:

How does the computation time scale with the number of threads/words?
Try loading all the words in the crackstation-human-only.txt, are you encountering any issues?
how much time does it take to find the word azerty and zoo when using your parallel code? Is this optimal? If not why? How can this be fixed?