4th Year High Performance Computing

Practical Sessions

Session 1 - OpenMP
OpenMP Hints
Session 2 - more advanced OMP
Session 3 - basic MPI
MPI Hints
Cluster queues
Session 4 - more advanced MPI

Session 1 - OpenMP

Write an OMP program that will print
"Hello world from thread"my_thread_number "of" total_num_threads in parallel on as many threads as available. Remember:
- need a PARALLEL ... END PARALLEL block
- can use OMP_GET_NUM_THREADS and OMP_GET_THREAD_NUM functions
Write a serial program that will calculate the sum of the integers 1->N as follows:
- allocate an array of size N, (e.g N = 1000000)
- set each value of the array to its index (i.e. array(1)=1, etc)
- calculate the sum of all the elements of the array using a simple do-loop
- print out the result of the sum
Is your code correct? You should know the analytic result!
Now parallelise your program (2) using OMP. What speedup can you gain? Is your answer correct? Remember:
- In Fortran, add a PARALLEL DO directive immediately before the start of the do-loop and a matching END PARALLEL DO at the end.
- In C/C++, add a parallel for directive immediately before the start of the for-loop.
- consider which variables should be shared, which private, and which need to have a REDUCTION at the end of the do-loop.
Extension: how can you time this program? It would be good to have a timer per thread, so use the OMP_GET_WTIME() function within a PARALLEL section, before and after the lines of code you wish to time.

Sample solutions to these OpenMP programs are omp_hello.f90 - with a solution in C++ available. and omp_sum.f90. There is also a C version.

OpenMP Hints

To compile a Fortran program with OpenMP use
- ifort -openmp -fpp myOMP.f90
- pathf90 -mp myOMP.f90
- gfortran -fopenmp myOMP.f90
Depending on system configuration you may not have access to all the compilers. You may find that different versions are available using the module system. Use module avail to see what modules are available or module spider prog to see which modules contain prog. Then do module load X to load module X and do module purge to reset all the loaded modules so you can start again. When you have found a set of modules that suit, do module save to save those settings for future use.
To set the number of OpenMP threads, use export OMP_NUM_THREADS=1 if using a bash shell or equivalently setenv OMP_NUM_THREADS 1 with a csh shell.
To execute an OpenMP program on one of the lab computers, just execute it from the command-line. The timing of your results will vary according to the instantaneous load on the PC (use top to see what is running).
The OpenMP website has full details and resources for all supported languages.
FORTRAN programmers might like this concise guide to OpenMP v4 for FORTRAN. There is also a longer and more detailed tutorial on OpenMP for F95.
C/C++ programmers might like this concise guide to OpenMP v4 for C/C++.

Session 2 - more advanced OMP

Download the file advanced_OMP.tar. DO NOT UNPACK IT USING A GUI as this will cause a permissions problem. Instead, unpack it using

tar -xf advanced_OMP.tar; cd advanced_OMP

This contains all the source materials for the following exercises in both C and f90 format.

Look at the MolDyn directory - it contains a suite of routines to do a simple MD simulation. Use the Makefile to build a binary by just typing

make

Then execute the binary and time how long it takes in serial, and record the correct answer for the output. Each time you change the code, you can rebuild the binary by just typing make. You can also change the compiler flags by editing the FFLAGS or CFLAGS variable in the Makefile.
- The computational bottleneck is in the force calculation. Try speeding up the code by adding a simple OMP "parallel do" to the i loop. What is the speed up going from 1 -> 8 cores? And beyond? Check the answer is unchanged ...
- Whilst this shows the parallel speedup potential the code is now broken. Can you work out why? HINT: look into the i loop and check for race conditions.
- There are various potential solutions to a race condition, including OMP CRITICAL and OMP ATOMIC. Try them both and see if this fixes the problem. How is the parallel speedup affected?
- Another approach that will work here (Fortran only) is to consider an array reduction - can you work out how to do this WITHOUT needing an ATOMIC or CRITICAL section? What is the parallel speedup now?
Look at the Mandelbrot directory - it contains a simple program to generate a Mandelbrot fractal and measure the area of the figure. Use the Makefile to build a binary and then execute it, and time how long it takes in serial, and record the correct answer for the area.
- Add a simple OMP "parallel do" to speed up the code. What is the speed up going from 1 -> 8 cores? And beyond? Check the answer is unchanged!
- The problem is that the innermost "while loop" has a very variable amount of work and this impacts the load balancing. Try changing the OMP SCHEDULE and see if you can make it better.
- Try packaging the work as an OMP TASK. Does it matter at which level you create the TASK? Try doing it inside the nested i,j loops, and then inside the i loop, and then outside the i loop. Can you explain the speedup you see (or not?)

If you want to learn more about OMP loop scheduling, and the differences between STATIC, DYNAMIC and GUIDED, you might be interested in this Python visualization (that uses F2Py to convert an F90 OMP code into python and the PyQt5 library for the GUI) that was written by Jacob Wilkins.

Model solutions (in Fortran) are available:

MolDyn forces:
- using CRITICAL forces_critical.f90
- using ATOMIC forces_atomic.f90
- using array reduction forces_reduction.f90
Mandelbrot area:
- using loops area_loop.f90
- using TASK area_task.f90

and in C:

MolDyn forces using CRITICAL forces.c
Mandelbrot area using TASK area_task.c

NB It is also straight forwards to extend the task-parallel Mandelbrot code to parallelize over the creation of tasks (there are lots of them) and avoid the SINGLE section. This gives an additional speedup and is left as an exercise for the student!

Session 3 - basic MPI

Write an MPI program that will print
"Hello world from rank"my_rank_number "of" size in parallel on as many nodes as available. Remember:
- include appropriate header files
- call MPI_Init which will create MPI_COMM_WORLD
- call MPI_Comm_size and MPI_Comm_rank to get the information
- call MPI_Finalize to shut down MPI
Write an MPI program - Ping-Pong - that will send/receive data as follows:
- include appropriate header files etc as before
- get each rank and total size as before
- now set my_data to be the rank of the current node
- use simple blocking sends and receives (i.e. MPI_Send and MPI_Recv) to send my_data from rank 0 to rank 1
- print out my_data on all nodes
- then reset my_data to be the current rank
- and then send it back again!
- print out final values and shut down nicely
Hello World Extension - force all the processes to write the "Hello world" message in the correct order. HINT: you will need a loop and an MPI_Barrier.
Ping-Pong Extension - Pass the Parcel - if time allows
- on RANK 0 ONLY - set a random integer 'numpasses' between 0 and 100. This is the time until the music stops!
- use MPI_Bcast to send this time to all other nodes
- on RANK 0 ONLY - set a random double precision 'parcel' between -1 and +1
- send the value of 'parcel' from rank 0 to 1. This is the first pass. Then perform multiple hops:
  - rank 1 passes to rank 2
  - rank 2 passes to rank 3
  - ... rank (size-1) passes to rank 0 which then completes the circle.
- unwrap a layer of the parcel by adding a random number between -1 and +1 to 'parcel'.
- continue until the number of passes is numpasses.
- when the last pass is complete, the node currently holding the parcel should print its contents to STDOUT.
Test your code on 2,4,8 and 16 processes.

Sample solutions to these MPI programs are mpi_hello.f90, mpi_pingpong.f90 and mpi_passparcel.f90. There are also C versions of mpi_hello.c, mpi_pingpong.c and mpi_passparcel.c There are also C++ versions of mpi_hello.cpp, mpi_pingpong.cpp and mpi_passparcel.cpp

NB You may encounter C++ examples on the internet (and older versions of this course) that use MPI C++ bindings. These were declared obsolete in v2.2 of the MPI standard and removed from v3.0. Hence codes that use these bindings will NOT work with newer versions of the MPI libraries and are best avoided. The recommended C++ approach is to use the C bindings.

MPI Hints

To compile an MPI program use the wrapper scripts:
- Fortran mpif90 myMPI.f90
- C mpicc myMPI.c
- C++ mpic++ myMPI.cpp
To compile a Fortran 08 (or later) program to make use of the mpi_f08 module (full type checking) then you will need an appropriate MPI library (e.g. recent OpenMPI) and you may need to change the compiler name, e.g.
- F08 mpifort myMPI.f08
You cannot execute MPI programs on any of the lab desktop computers.
If you have never used Viking before then you ought to download viking.modulerc, edit it as required and then add the contents to the end of your .bashrc file in your home directory and read the Viking docs page.
IBM provide a useful MPI guide in PDF format.
The Software Carpentry team also have a nice multi-language web-based tutorials available.

Cluster Queues

As your parallel programs become more demanding in this session, you must run on Viking and use the queuing system.

To execute a program on a remote computer, e.g. Viking, you will need to copy your files onto it, e.g.
scp * uid@viking.york.ac.uk:scratch
and then logon to it:
ssh -X uid@viking.york.ac.uk
where "uid" is your username. You should move your files onto the scratch partition and only run from there.

If you have never used Viking before then you should see Viking docs page for latest configuration and queuing information.

Viking has 2x 48 cores/node with hyperthreading and so an apparent 96 cores/node. Hence you can get a speedup with a lot of threads/node. You must only ever execute programs on Viking using the queuing system - this is to be fair to other users and for you to get exclusive access to the node.

You must NOT run any calculations interactively - everything should be done via the queuing system. This is to ensure a "fair share" policy:

To execute an MPI program on Viking, edit the run_mpi_viking.sh script to point to your own directories and executable name, and then submit it to the queuing system using sbatch run_mpi_viking.sh.
On Viking, you can only submit jobs from the local /scratch disk and not the shared $HOME disk. You should have a directory /scratch/uid where uid is your username and you should put your run_mpi_viking.sh script and your compiled binary (and any necessary input files) in that directory, and submit from that directory.
You can also use Viking for OpenMP programs, and use the queuing system in order to get systematic timings. You can use the run_omp_viking.sh script as a starting point.
You submit your job to the queuing system using sbatch run_mpi_viking.sh etc. Your program will then execute on the slave nodes when there is a sufficient number free - in the meantime your job will be queued.
If you try to request too many nodes or an inappropriate amount of time, you will get the error message sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit). If this happens, you should edit your script to request less resources and resubmit it.
You can examine the state of the queues using squeue to see whether your job is still waiting (if state is "Q") or which slave nodes your job is executing on (if state is "R"), etc. You can examine the status of all your jobs using sacct to see which jobs have succeeded and which have failed. You can also use sinfo to learn about all the different queues and their limits.
If your program hangs, then you can delete it from the queuing system using scancel <jobid> where <jobid> is the queue job id number.
Once your program has finished you will find a new file in your directory: MPI_test_<jobid>.log where <jobid> is the queue job id number. This is the output of the submission script, the result of your program including anything written to stdout (the screen), and any system messages (including efficiency information). You can examine these files using more MPI_test_<jobid>.log to see what has happened.

Session 4 - more advanced MPI

In this session you will do some (very simple) physics using a mixture of point-to-point and collective communications. The aim is to compute the total potential energy of a system of 4 beads connected in a ring by springs in 2D as shown:

Assume the spring constant k=1 for all springs, and the natural length of each spring is zero. Hence the energy stored in any one spring is 0.5*x².
Use a distributed data paradigm for modelling the ring. In this instance there will be four beads and your code is to run on four processes. Each process is initially only aware of the two-dimensional co-ordinates for a SINGLE bead. You will need a 2-element array r0 on each process with initial values as follows:
- r0 = (0.0,0.0) on process rank 0.
- r0 = (1.0,1.0) on process rank 1.
- r0 = (0.5,0.5) on process rank 2.
- r0 = (0.2,0.7) on process rank 3.
Now each bead can calculate part of its PE by sending its own co-ordinates clockwise to its left neighbour and receiving the coordinates of its right neighbour. This can then be used for part of the local_energy. Then repeat this in the anti-clockwise direction, sending to the right and receiving from the left. You can now complete the local_energy calculation. Beware deadlock and memory leaks.
Now compute the PE of the entire ring using MPI_Reduce and store result in energy on rank 0 (the root node).
The correct answer should be 1.58 ...
If time allows, experiment with using a periodic communicator to automatically handle the "edge cases" using MPI_Cart_create to create a 1D periodic Cartesian communicator. Use with MPI_Cart_shift to automatically calculate source/destination ranks for each operation. NB - Cartesian communciators are numbered from zero, hence a 1D ring has shifts in dimension 0.
You might like to look at this F90 code for a demonstration of using Cartesian communicators.
You could also consider generalising your code to an arbitary number of processes and beads.

Here are model solutions in F90, F2008 and C.