[logo] [title]

4th Year High Performance Computing

Practical Sessions

Session 1 - OpenMP

  1. Write an OMP program that will print
    "Hello world from thread"my_thread_number "of" total_num_threads in parallel on as many threads as available. Remember:
    • need a PARALLEL ... END PARALLEL block
    • can use OMP_GET_NUM_THREADS and OMP_GET_THREAD_NUM functions
  2. Write a serial program that will calculate the sum of the integers 1->N as follows:
    • allocate an array of size N, (e.g N = 1000000)
    • set each value of the array to its index (i.e. array(1)=1, etc)
    • calculate the sum of all the elements of the array using a simple do-loop
    • print out the result of the sum
    Is your code correct? You should know the analytic result!
  3. Now parallelise your program (2) using OMP. What speedup can you gain? Is your answer correct? Remember:
    • In Fortran, add a PARALLEL DO directive immediately before the start of the do-loop and a matching END PARALLEL DO at the end.
    • In C/C++, add a parallel for directive immediately before the start of the for-loop.
    • consider which variables should be shared, which private, and which need to have a REDUCTION at the end of the do-loop.
    Extension: how can you time this program? It would be good to have a timer per thread, so use the OMP_GET_WTIME() function within a PARALLEL section, before and after the lines of code you wish to time.

Sample solutions to these OpenMP programs are omp_hello.f90 - with a solution in C++ available. and omp_sum.f90. There is also a C version.

OpenMP Hints

Session 2 - more advanced OMP

Download the file advanced_OMP.tar. DO NOT UNPACK IT USING A GUI as this will cause a permissions problem. Instead, unpack it using

tar -xf advanced_OMP.tar; cd advanced_OMP

This contains all the source materials for the following exercises in both C and f90 format.

  1. Look at the MolDyn directory - it contains a suite of routines to do a simple MD simulation. Use the Makefile to build a binary by just typing

    make

    Then execute the binary and time how long it takes in serial, and record the correct answer for the output. Each time you change the code, you can rebuild the binary by just typing make. You can also change the compiler flags by editing the FFLAGS or CFLAGS variable in the Makefile.
    • The computational bottleneck is in the force calculation. Try speeding up the code by adding a simple OMP "parallel do" to the i loop. What is the speed up going from 1 -> 8 cores? And beyond? Check the answer is unchanged ...
    • Whilst this shows the parallel speedup potential the code is now broken. Can you work out why? HINT: look into the i loop and check for race conditions.
    • There are various potential solutions to a race condition, including OMP CRITICAL and OMP ATOMIC. Try them both and see if this fixes the problem. How is the parallel speedup affected?
    • Another approach that will work here (Fortran only) is to consider an array reduction - can you work out how to do this WITHOUT needing an ATOMIC or CRITICAL section? What is the parallel speedup now?
  2. Look at the Mandelbrot directory - it contains a simple program to generate a Mandelbrot fractal and measure the area of the figure. Use the Makefile to build a binary and then execute it, and time how long it takes in serial, and record the correct answer for the area.
    • Add a simple OMP "parallel do" to speed up the code. What is the speed up going from 1 -> 8 cores? And beyond? Check the answer is unchanged!
    • The problem is that the innermost "while loop" has a very variable amount of work and this impacts the load balancing. Try changing the OMP SCHEDULE and see if you can make it better.
    • Try packaging the work as an OMP TASK. Does it matter at which level you create the TASK? Try doing it inside the nested i,j loops, and then inside the i loop, and then outside the i loop. Can you explain the speedup you see (or not?)

If you want to learn more about OMP loop scheduling, and the differences between STATIC, DYNAMIC and GUIDED, you might be interested in this Python visualization (that uses F2Py to convert an F90 OMP code into python and the PyQt5 library for the GUI) that was written by Jacob Wilkins.

Model solutions (in Fortran) are available:

  1. MolDyn forces:
  2. Mandelbrot area:
and in C:
  1. MolDyn forces using CRITICAL forces.c
  2. Mandelbrot area using TASK area_task.c
NB It is also straight forwards to extend the task-parallel Mandelbrot code to parallelize over the creation of tasks (there are lots of them) and avoid the SINGLE section. This gives an additional speedup and is left as an exercise for the student!

Session 3 - basic MPI

  1. Write an MPI program that will print
    "Hello world from rank"my_rank_number "of" size in parallel on as many nodes as available. Remember:
    • include appropriate header files
    • call MPI_Init which will create MPI_COMM_WORLD
    • call MPI_Comm_size and MPI_Comm_rank to get the information
    • call MPI_Finalize to shut down MPI
  2. Write an MPI program - Ping-Pong - that will send/receive data as follows:
    • include appropriate header files etc as before
    • get each rank and total size as before
    • now set my_data to be the rank of the current node
    • use simple blocking sends and receives (i.e. MPI_Send and MPI_Recv) to send my_data from rank 0 to rank 1
    • print out my_data on all nodes
    • then reset my_data to be the current rank
    • and then send it back again!
    • print out final values and shut down nicely
  3. Hello World Extension - force all the processes to write the "Hello world" message in the correct order. HINT: you will need a loop and an MPI_Barrier.
  4. Ping-Pong Extension - Pass the Parcel - if time allows
    • on RANK 0 ONLY - set a random integer 'numpasses' between 0 and 100. This is the time until the music stops! You should consider reusing the random number code from simple_md.f90.
    • use MPI_Bcast to send this time to all other nodes
    • on RANK 0 ONLY - set a random double precision 'parcel' between -1 and +1
    • send the value of 'parcel' from rank 0 to 1. This is the first pass. Then perform multiple hops:
      • rank 1 passes to rank 2
      • rank 2 passes to rank 3
      • ... rank (size-1) passes to rank 0 which then completes the circle.
    • unwrap a layer of the parcel by adding a random number between -1 and +1 to 'parcel'.
    • continue until the number of passes is numpasses.
    • when the last pass is complete, the node currently holding the parcel should print its contents to STDOUT.
    Test your code on 2,4,8 and 16 processes.

Sample solutions to these MPI programs are mpi_hello.f90, mpi_pingpong.f90 and mpi_passparcel.f90. There are also C versions of mpi_hello.c, mpi_pingpong.c and mpi_passparcel.c There are also C++ versions of mpi_hello.cpp, mpi_pingpong.cpp and mpi_passparcel.cpp

NB You may encounter C++ examples on the internet (and older versions of this course) that use MPI C++ bindings. These were declared obsolete in v2.2 of the MPI standard and removed from v3.0. Hence codes that use these bindings will NOT work with newer versions of the MPI libraries and are best avoided. The recommended C++ approach is to use the C bindings.

Cluster Queues

As your parallel programs become more demanding in this session, you must run on GViking and use the queuing system.

To execute a program on a remote computer, e.g. GViking, you will need to copy your files onto it, e.g.
scp * uid@gviking.york.ac.uk:scratch
and then logon to it:
ssh -X uid@gviking.york.ac.uk
where "uid" is your username. You should move your files onto the scratch partition and only run from there.

If you have never used GViking before then you should see GViking wiki page for latest configuration and queuing information.

GViking has 2x 24 cores/node with hyperthreading and so an apparent 96 cores/node. Hence you can get a speedup with a lot of threads/node. You must only ever execute programs on GViking using the queuing system - this is to be fair to other users and for you to get exclusive access to the node.

Normally, you would just load some modules to get access to the MPI compilers etc, but GViking is a little odd, in that the login node is a single-core VM running on an Intel Haswell CPU, whilst the slave nodes are using Intel Xeon Gold 6240 chips which support AVX512 instructions (which the login node does not). Hence code compiled on the slave nodes using AVX512 instructions will crash if executed on the head node. This includes the MPI compilers! Hence we need to tell GViking to only use the AVX2 version of the modules:

module unuse /opt/apps/easybuild/modules/all
module use /opt/apps/easybuild-avx2/modules/all

and then we can load the appropriate compilers:
module load toolchain/foss
which will then give you access to the gfortran/gcc versions of mpif90/mpicc.

At the moment GViking does not have the Intel compilers, but Viking and other UoY systems do, so to get them you would do something like module load toolchain/iccifort
to get access to the Intel fortran/C/C++ versions of mpif90/mpicc, etc.

You must NOT run any calculations interactively - everything should be done via the queuing system. This is to ensure a "fair share" policy:

MPI Hints

Session 4 - more advanced MPI

In this session you will do some (very simple) physics using a mixture of point-to-point and collective communications. The aim is to compute the total potential energy of a system of 4 beads connected in a ring by springs in 2D as shown:

[springs]

Here are model solutions in F90, F2008 and C.