12
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
12
pages
English
Documents
Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres
Publié par
Langue
English
Universität Karlsruhe (TH)
Rechenzentrum
KaSC Benchmark Suite
Version 1.0
AUGUST 2004
- 2 -
1 DIRECTORY STRUCTURE ....................................................................................................................... 4
1.1 KERNELS .............................................................................................................................................. 4
2 BENCHMARK PROGRAMS...................................................................................................................... 4
2.1 CONFIGURATION OF BENCHMARK PROGRAMS......................................................................................... 6
2.2 COMPILATION OF BENCHMARK PROGRAMS ............................................................................................. 7
2.3 RUNNING THE BENCHMARK P.................................................................................................. 8
2.4 RUNNING DIFFERENT COMMUNICATION BENCHMARKS ............................................................................. 9
2.4.1 Communication within one SMP node 9
2.4.1.1 Bandwidth and latency for single ping pong between two tasks within one SMP node .......... 9
2.4.1.2 dth ansingle exchange between two tasks within one SMP node........... 9
2.4.1.3 Bisection bandwidth for ping pong within one SMP node...................................................... 10
2.4.1.4 dwidth for exchange 10
2.4.2 Communication between two SMP nodes................................................................................. 11
2.4.2.1 Single ping pong test between two tasks running on different SMP nodes ........................... 11
2.4.2.2 Single exchange between two tasks running on different SMP nodes .................................. 11
2.4.2.3 Multiple ping pong test between all CPUs of two SMP nodes ............................................... 11
2.4.2.4 Multiple exchange between all CPUs of two SMP nodes 11
2.4.3 Bisection Bandwidth................................................................................................................... 11
2.4.4 Latency Hiding ........................................................................................................................... 11
3 CONTACT................................................................................................................................................ 12
- 3 -
1 Directory structure
All files belonging to this benchmark are supplied in a compressed tar file benchmark.tar.gz which
should be uncompressed and untared using the commands
gunzip benchmark.tar.gz
tar -xvf benchmark.tar
The actual directory also contains a file make.inc. Within this file compiler names and options, library path
etc. are defined to compile the various benchmark programs. The Makefile includes this file in order to
read all these definitions. For some systems sample files make_<SYSTEM>.inc are included in the actual
directory.
1.1 Kernels
The actual directory contains some low level programs measuring some specific data of one processor, one
SMP node or the whole system. It contains a set of low level kernels that are typical for applications in the
field of scientific computing. Some programs only run on a single CPU, some programs run on a single CPU
as well as on all CPUs of an SMP node. The small benchmark suite also includes some programs to
measure the communication performance of typical MPI point-to-point communications.
2 Benchmark Programs
All programs are written in Fortran 90 and call a C function seconds to measure CPU time and wall clock
time. The C function seconds.c itself calls the function gettimeofday from the system library to get the
timing data. Depending on whether Fortran adds an underscore to the names of external functions or not you
should add the option –DFTNLINKSUFFIX and depending on whether Fortran changes the characters of the
name seconds to uppercase (SECONDS) or not you should add the option -DUPPERCASE to the list of the
compiler options for the C compiler (CFLAGS in make.inc). All programs contain a structure that is similar
to
call SECONDS(tim1,tdum)
do j=1,repfactor
call DUMMY(x1,x2,x3,n)
do i=1,n
simple operation
enddo
enddo
call SECONDS(tim2,tdum)
With this loop construct we want to measure the time needed to complete the inner loop do I=1,n. In order
to increase the time which is measured and to increase the accuracy of the measurement, this loop is
executed repeatedly (outer loop do j=1,repfactor). The call to the external subroutine DUMMY is included
to guarantee that the outer loop is executed as often as requested. Otherwise optimization by the compiler
could change this loop construct. You should make sure that neither the compiler nor the linker will do an
optimization that changes this sequence and nest of loops. This could mean that interprocedural optimization
should be switched off for linking, e.g. used options in FFLAGS (in make.inc) should not switch on any
interprocedural optimization during compile or link step.
The benchmark suite includes the following serial programs running on one processor (for these programs
except of bm116 exists a parallel version):
- 4 - x3(i) = x1(i) + x2(i) bm111: vector add
bm112: vector multiply x3(i) = x1(i) * x2(i)
x3(i) = x1(i) / x2(i)bm113: vector divide
x3(i) = x1(i) + s * x2(i)bm114: vector triad with scalar (saxpy)
x7(i) =(x1(i)+x2(i))*x3(i)+(x4(i)-x5(i))/x6(i)bm116: vector compound operation
s = s + x1(i) * x2(i)bm117: dot product
x4(i) = x2(i) + x3(i) * x4(i) bm118: vector triad
The benchmark suite includes the following serial programs running only on one processor:
bm111str: vector add with stride x3(i*istr) = x1(i*istr) + x2(i*istr) gth: vector adgather x3(i) = x1(ix(i)) + x2(i)
bm111sct: vector add with scatter x3(ix(i)) = x1(i) + x2(i)
bm117gth: dot product with gather s = s + x1(ix(i)) * x2(i)
bm121: matrix multiply - rowwise version C = A * B
bm122: matrix mutiply - dot product version
bm123: matrix multiply - columnwise version
bm123u4: matrix multiply - columnwise
C = A * B version with 4-fold unrolling
bm124: matrix multiply - library version
bm131: scalar performance test 1 a = a * b + c
bm132: scalar performance test 2 a = b*c + d; b = a - c*b; c = a/b
The benchmark suite includes the following programs that run usually in parallel on one node and measure
the performance on the first processor:
x3(i) = x1(i) + x2(i)bm111smp: vector add on all processors
x3(i) = x1(i) * x2(i)bm112smp: vector multiply on all processors
x3(i) = x1(i) / x2(i)bm113smp: vector divide on all processors
x3(i) = x1(i) + s * x2(i) bm114smp: vector triad with scalar on all
processors s = s + x1(i) * x2(i)bm117smp: dot product on all processors
x4(i) = x1(i) + x2(i) * x3(i) bm118smp: vector triad on all processors
x3(i) = x1(i) + s * x2(i) on 1. proc. bm1148smp: vector triad with scalar on the first
processor and vector triad with large,constant x4(i) = x1(i) + x2(i) * x3(i) on other
vectorlength on all other processors procs.
s = s + x1(i) * x2(i) on 1. processor. bm1178smp: dot product on the first processor
and vector with large,constant vectorlength triad
on all other processors procs.
x4(i) = x1(i) + x2(i) * x3(i) on 1. proc. bm1188smp: vector triad on the first processor
and vector triad with large,constant vectorlength
on all other processors procs.
The programs measuring the communication performance are:
bmmpi1: MPI ping-pong benchmark
bmmpi2: MPI double ping-pong (exchange)
bmmpi3: MPI overlap test (short messages)
- 5 - bmmpi4: MPI overlap test (long messages)
2.1 Configuration of Benchmark Programs
Before starting the installation process you should check and adapt the file make.inc. A sample make.inc is
shown below:
# make.inc
#
# Utility file used by the KaSC Benchmark.
#
########################################################################
#
# Part 1: Utilities
#
# Name of the make utility that will be used to compile and link the
# benchmarking program.
# Most of the benchmarking programs are based on using the GNU make utility.
MAKE = gmake
# Number of processors of one node
NP =
# Number of processors for the communication benchmarks
NP_COMM =
# Name of the command to run parallel programs, e.g. mpirun
PAR_CMD =
# Option for the number of processors (spatially) in front of the executable
# together with the number of processors for the serial run, e.g. -np 1
PAR_OPTS1 =
# together with the number of processors for the parallel run, e.g. -np $(NP)
PAR_OPTP1 =
# Option for the number of processors (spatially) behind the executable together
# with the number of processors for the serial run, e.g. -procs 1
PAR_OPTS2 =
# with the number of processors for the parallel run, e.g. -procs $(NP) –nodes 1
PAR_OPTP2 =