4th Year High Performance Computing

Weekly Problems

Each week you will be set a 'practical challenge' that is to be submitted via the VLE by 4pm Friday. It will be marked and returned via the VLE. You can find your personal mark & feedback by clicking on your name at the very top RHS of the VLE screen and then click 'My Marks'. A model solution will also be posted on the VLE. Please read and absorb.

For each question you should submit your answer as a PDF or DOC formatted file (NOT ODT or similar) and a source code file if appropriate. Please name your uploaded files as QX_abc123.doc and QX_abc123.c, etc. where 'X' is the problem number and subsitute your username for 'abc123'. You may also submit a selection of files as a single TAR or ZIP file (NOT RAR or similar), with the same naming convention for the archive AND the included files.

NB The questions are also repeated on the VLE module website and the VLE Y4 problem site - in case of discrepancy use the version on the VLE module website as the definitive version. Submission is ONLY via the VLE.

Q0) Write a program in Fortran or C or C++, to calculate a value for pi by evaluating this function [pi series formula]
with different values of N. What factors affect the accuracy of the value of pi that you calculate? What is the most accurate value of pi you can get, and what is the corresponding value of N?

Submit your answer using the submission point on the VLE. You should include a document containing a discussion of these issues, a code listing and your numerical results (either Word DOC or PDF format with embedded fonts - other formats are not acceptable) AND a separate file containing your source code. Please name all your files in the style Q0_abc123.doc and Q0_abc123.c etc with your username substituted for abc123 etc.

Q1) Consider two arrays of data A and B each containing 1024 4-byte numbers. Assume that B is stored immediately after A in main memory (A and B are contiguous). The following operation is to be performed:

FORTRAN

do i = 1,1024
A(i) = A(i) + B(I)
end do

C

for (i=0;i<1024;i++){
A[i] = A[i] + B[i];
}

Explain (with a diagram if needed) why this sequence of operations will suffer poor performance on a machine with a 4kB (= 4096 bytes) direct mapped cache. (Hint: make sure you understand the different types of cache policy). What changes could be made to the hardware to overcome the problem? What changes could be made to the algorithm to overcome the problem?

b) A simple computer is built with both L1 and L2 caches. The CPU can obtain data from the L1 cache in 1 clock cycle. If the data is not available, then there is an L1 cache miss, and so it must go to L2 cache which takes a total of 10 clock cycles. If there is an L2 cache miss then it must go to main memory which takes a total of 100 clock cyles.

Analysis of a particular program shows that it requires 1000 memory references. If this generates 40 L1 cache misses and 20 L2 cache misses, what is the average memory access time (in clock cycles)?

Submit your answer using the submission point on the VLE. You should include a document containing a discussion of these issues and your numerical results (either Word DOC or PDF format with embedded fonts - other formats are not acceptable). Please name all your files in the style Q1_abc123.doc with your username substituted for abc123 etc.

Q2) For 2021: the Benchmarks lecture has been moved to much later in the course - now #17 - so this problem should not be attempted until the end of the course!

Download the stream.tar file and unpack in order to create the STREAMS benchmarks, e.g.

tar -xf stream.tar

Compile these simple benchmarking programs on Viking and any other machines to which you have access, using

make

If you need an account on Viking, please ask as soon as possible. If you need help with compiling programs, see the basic introduction to compilers and flags at http://www-users.york.ac.uk/~mijp1/teaching/4th_year_HPC/intro_linux.shtml

Then execute each benchmark:

./fort_stream
./c_stream

You should repeat each benchmark run several times in order to eliminate timing noise.

Compare your results from Viking and any other machines with RELEVANT machines tabulated at the STREAMS website.

Find out as much as you can about the underlying memory architecture (e.g. cat /proc/cpuinfo /proc/meminfo ) in each system and use this to interpret your findings.

If you wish to experiment with compiler flags you should do this by editing the variables CFLAGS and FFLAGS in the Makefile and then doing

make clean && make

and then rerunning the benchmarks as before.

Submit your answer using the submission point on the VLE. You should include a document containing a discussion of these issues and your numerical results (either Word DOC or PDF format with embedded fonts - other formats are not acceptable). Please name all your files in the style Q2_abc123.doc and Q2_abc123.c etc with your username substituted for abc123 etc.

Q3) Download the test_inv.f90 code and compile & execute it using at least 2 different compilers, e.g. gfortran & ifort:

a) Hence determine the number of inverses of x, where (1 < x < 1000) that are not perfectly accurate.

b) Investigate what happens when you change the compiler optimisation settings - can you explain what has happened and why? You can get extra insight by using the "compiler explorer" at godbolt.org. Explain the effect on the number of improper inverses of commenting out the print statement at line 14.

c) If you change the precision of all the real variables, e.g. by changing line 3 to
integer, parameter :: my_kind=kind(1.d0)
then how do your answers to 'a' and 'b' change? Explain what is happening and why.

d) Write your own program to determine the number of digits of precision available with both kind(1.0) and kind(1.d0) [Fortran] or float and double [C/C++] . Include a listing of your code in the doc AND attach your code as a separate file. State clearly what the final results are.

Submit your answer using the submission point on the VLE.

Q4) For 2021: optional question - not set for assessment. Download the simple_md.tar file and unpack it, e.g.

tar -xf simple_md.tar

This contains two versions of a very basic serial MD program, one in Fortran and one in C. Build either

Fortran: make simple_md
or C: make simple_md_C

Your task is to profile either version of this program, identify the bottlenecks, and optimize it without significantly changing any of the answers. Hence you should first establish a reference time for the unmodified code using no optimization [this may take several minutes to run depending on the speed of your computer] and then back up the results:

cp simple_md.log simple_md.log.ref

You can then investigate the effect of compiler optimization:

Fortran: make simple_md_opt
or C: make simple_md_C_opt

and compare timings. Can you explain what is happening and why? You should also check the results by comparing the simple_md.log file with the reference one.

Now build a profiled binary using:

Fortran: make simple_md_pg
or C: make simple_md_C_pg

and use the profiling tools (gprof and gcov) to identify the bottlenecks. What improvements can you make (including floating point optimizations, physics-based and maths-based algorithmic optimizations, compiler flags, loop optimizations, etc but NOT adding any parallelism) can you make to improve the runtime further without significantly changing any results generated? Run your new code, and verify that you have not signicantly changed any calculated results.

Include a listing of your code improvements in your answer AND attach your code as a separate file.

NB It should be possible to get a dramatic speedup with a little effort! Compiler flags alone should get you at least a x4 speedup, and code improvements should get at least x8 improvement ON TOP OF the compiler improvements.

Submit your answer using the submission point on the VLE.

Q5) For 2021: optional question - not set for assessment. Write a program using the 4 different methods (standard, fast, matmul and BLAS) described in the lectures to perform matrix multiplication (A=B*C) with B and C being NxN matrices. Test your program works correctly and gives the same answer for each method. If you are using C or C++ you do not need to implement the matmul method.

Then test and time your program with N=100, 128,250, 256, 500, 512, 1000 and 1024, using different matrices filled with random double-precision real numbers. [An example of using the Fortran built-in random number generator was given in the simple_md.f90 used in Q4.]

Plot a graph of the %peak performance achieved against N (as in the lectures) and explain all the main features seen. Include a printout of the code with your answer.

For each of the different algorithms you should assume the same fundamental amount of floating point arithmetic, just different load/store patterns, hence the number of FLOPS is 2*N^3 for all the algorithms. You will probably find that there is a degree of noise in the timings of any single run, so it is always a good idea to average over a number of different runs, each containing a different set of initial random numbers.

NB You can use 'cat /proc/cpuinfo' to find the theoretical peak performance capability of each CPU.

If you need help with linking libraries such as BLAS see the basic instructions at http://www-users.york.ac.uk/~mijp1/teaching/4th_year_HPC/intro_linux.shtml

Include a listing of your code in your answer AND attach your code as a separate file.

Submit your answer using the submission point on the VLE.

Q6)
a) Write an OpenMP program to correctly do the following:

iterate over N=10^3,2*10^3,5*10^3, 10^4,2*10^4,5*10^4, 10^5,2*10^5,5*10^5, 10^6,2*10^6,5*10^6
allocate two double precision vectors, A and B, of size N
start timer
sum=0
loop over i=1,N
     A(i)=i
     B(i)=A(i)*A(i)
     A(i)=A(i)+SQRT(ABS(SIN(B(i))))*2.34
     sum=sum+A(i)
end loop
stop timer
print N, sum, time for this loop
deallocate A and B
repeat for next value of N

If you need help with compiler flags for OpenMP code see the instructions at http://www-users.york.ac.uk/~mijp1/teaching/4th_year_HPC/practicals.shtml

What is the value of sum for N=10^6? Make sure you get the same result for one and multiple threads! What is the speedup going from 1 to 2 and multiple threads? You should test on at least 20 threads with Viking and repeat on any other multi-core machines to which you have access. If the code is too fast to time reliably, you should add an extra loop over Nrepetitions within the timer block.

Include a graph showing parallel speedup. Include a listing of your code in your answer AND attach your code as a separate file.

NB You can change the number of threads by typing export OMP_NUM_THREADS=1 in a Bash shell. You can make sure you have the correct number of threads from within your code by adding the following to F90 (and similar in C):

!$OMP PARALLEL DEFAULT(PRIVATE)
    nthreads=OMP_GET_NUM_THREADS()
    tnumber=OMP_GET_THREAD_NUM()
    print *, "Thread",tnumber+1," of",nthreads
!$OMP END PARALLEL

See the OpenMP website if you want more details on OpenMP.

You should also check out the Wikipedia article on the "Kahan summation algorithm".

Submit your answer using the submission point on the VLE.

Q7) THIS IS A 2-WEEK QUESTION. Consider the following system:

A square with corners at (0,0), (0,10), (10,0) and (10,10) is surrounded by a heatbath which maintains a T=0 boundary on all sides. There are 3 heat sources inside the square:

i) T= +10.0 at (x,y)=(5,5)
ii) T= +7.2 at (x,y)=(4,6)
iii)T= -1.2 at (x,y)=(7,2.5)

The temperature at each point on a 2D grid may be found using an iterative approach, where the new temperature at each point is calculated as the average of its four nearest neighbours and itself, until a steady state is reached. You will need two copies of the temperature grid - one for the current values and one for the new values - and you should update all the new data in one sweep through the grid, and then repeat for next iteration. In general, you would want to increase the number of points in the grid until the calculation was converged, but for this question, you will just work with 1 fixed size grid.

a) Write a serial program that solves this problem to find the equilibrium temperature distribution of the system. What is the equilibrium temperature at the point (5.5,5.5) for a 100x100 grid?

b) Parallelise your program using MPI with the grid distributed over a 1D arrangement of processors. Your program should be capable of running on different numbers of processors. For simplicity, it is reasonable to restrict this to only those numbers of processors that are an integer factor of the number of grid points, otherwise there are extra complications in the load balancing. Hence you should only consider 1,2,4,5,10 processors for this size grid (ie for 10 processors the local data should be 10x100 on each process).

What is the parallel speedup your code can achieve on Viking? Can you interpret your findings using your knowledge of hardware?

Include a graph showing parallel speedup. Include a listing of both serial and parallel codes in your answer AND attach as separate files.

Hint: you might also like to try visualising the 2D grid data: you can use gnuplot or a picture file format, such as pgm using this simple F90 code snippet for generating a grayscale picture from a 2D data grid, which you can view using the 'display' command or convert to other formats using 'pnmtopng' etc. See 'man pgm' etc for more options.

Submit your answer using the submission point on the VLE.