HPCC with open64 compiler, ACML and base FFT
Make
Here are som modifications on Mills based on recommendations from Using ACML (AMD Core Math Library) In High Performance Computing Challenge (HPCC)
Changes to the make file hpl/setup/Make.Linux_ATHLON_FBLAS copied to hpl/Make.open64-acml
- Comment lines beginning in
MPorLA - Change
/usr/bin/gcctompicc - Change
/usr/bin/g77tompif77 - Append
-DHPCC_FFT_235toCCFLAGS
The Valet commands are
vpkg_devrequire acml/5.2.0-open64-fma4 vpkg_devrequire openmpi/1.6.1-open64
Exported variables (to set values for commented LAinc and LAlib)
export LAinc="$CPPFLAGS" export LAlib="$LDFLAGS -lacml"
Make command with 4 threads
make -j 4 arch=open64-acml
runs: N = 30000
package `acml/5.2.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64`
N = 30000, NB = 100, P = 6, Q=8
These runs need 48 processes (6 per row and 8 per column.) The same number of processes are run with 48 or 96 slots.
| Options | Grid Engine | MPI flags |
|---|---|---|
| NCPU=1 | -pe openmpi 48 | –bind-to-core |
| NCPU=2 | -pe openmpi 96 | –bind-to-core –bycore –cpus-per-proc 2 -np 48 |
HPCC benchmark results for two runs:
| result | NCPU=1 | NCPU=2 |
|---|---|---|
| HPL_Tflops | 0.0769491 | 1.54221 |
| StarDGEMM_Gflops | 1.93686 | 14.6954 |
| SingleDGEMM_Gflops | 11.5042 | 15.6919 |
| MPIRandomAccess_LCG_GUPs | 0.0195047 | 0.00352421 |
| MPIRandomAccess_GUPs | 0.0194593 | 0.00410853 |
| StarRandomAccess_LCG_GUPs | 0.0113424 | 0.0302748 |
| SingleRandomAccess_LCG_GUPs | 0.0448261 | 0.0568664 |
| StarRandomAccess_GUPs | 0.0113898 | 0.0288637 |
| SingleRandomAccess_GUPs | 0.0521811 | 0.053262 |
| StarFFT_Gflops | 0.557555 | 1.14746 |
| SingleFFT_Gflops | 1.2178 | 1.45413 |
| MPIFFT_Gflops | 5.31624 | 34.3552 |
runs: N = 72000
package `acml/5.3.0-open64-fma4` package `open64/4.5` package `openmpi/1.4.4-open64` or `openmpi/1.6.1-open64`
N = 72000, NB = 100, P = 12, Q = 16
nproc = 2x192 (384 slots with 192 MPI workers bound to a bulldozer core pair)
Two runs mostly differ by the use of Qlogic PSM endpoints
| Result | ^PSM (v1.4.4) | PSM (v1.6.1) |
|---|---|---|
| HPL_Tflops | 1.68496 | 2.08056 |
| StarDGEMM_Gflops | 14.6933 | 14.8339 |
| SingleDGEMM_Gflops | 15.642 | 15.536 |
| PTRANS_GBs | 9.25899 | 18.4793 |
| StarFFT_Gflops | 1.19982 | 1.25452 |
| StarSTREAM_Triad | 3.62601 | 3.65631 |
| SingleFFT_Gflops | 1.44111 | 1.44416 |
| MPIFFT_Gflops | 7.67835 | 77.603 |
| RandomlyOrderedRingLatency_usec | 65.8478 | 2.44898 |
more
N = 72000, NB = 100, P = 12, Q=16, NP=384 package `acml/5.2.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.4.4-open64` to your environment package `acml/5.3.0-open64-fma4` to your environment package `open64/4.5` to your environment package `openmpi/1.6.1-open64` to your environment
| Result | 139765 | 145105 |
|---|---|---|
| HPL_Tflops | 1.54221 | 0.364243 |
| StarDGEMM_Gflops | 14.6954 | 13.6194 |
| SingleDGEMM_Gflops | 15.6919 | 15.453 |
| PTRANS_GBs | 1.14913 | 1.07982 |
| MPIRandomAccess_GUPs | 0.00410853 | 0.00679052 |
| StarSTREAM_Triad | 3.39698 | 2.83863 |
| StarFFT_Gflops | 1.14746 | 0.737805 |
| SingleFFT_Gflops | 1.45413 | 1.3756 |
| MPIFFT_Gflops | 34.3552 | 32.3555 |
| RandomlyOrderedRingLatency_usec | 77.9332 | 76.9595 |