Differences
This shows you the differences between two versions of the page.
| software:scalapack:linsolve [2017-12-02 15:23] – created sraskar | software:scalapack:linsolve [2021-04-27 16:21] (current) – external edit 127.0.0.1 | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Scalapack linsolve benchmark ====== | ||
| + | |||
| + | ===== Fortran 90 source code ===== | ||
| + | |||
| + | We base this benchmark in the '' | ||
| + | |||
| + | We get the program with | ||
| + | <code bash> | ||
| + | if [ ! -f " | ||
| + | wget http:// | ||
| + | fi | ||
| + | </ | ||
| + | |||
| + | This program reads one line to start the benchmark. The input must contain 5 numbers: | ||
| + | * N: order of linear system | ||
| + | * NPROC_ROWS: number of rows in process grid | ||
| + | * NPROC_COLS: number of columns in process grid | ||
| + | * ROW_BLOCK_SIZE: | ||
| + | * COL_BLOCK_SIZE: | ||
| + | |||
| + | Where '' | ||
| + | |||
| + | For this benchmark we will set '' | ||
| + | <code bash> | ||
| + | let N=3000 | ||
| + | let ROW_BLOCK_SIZE=500 | ||
| + | let COL_BLOCK_SIZE=500 | ||
| + | let NPROC_ROWS=$N/ | ||
| + | let NPROC_COLS=$N/ | ||
| + | echo "$N $NPROC_ROWS $NPROC_ROWS $ROW_BLOCK_SIZE $COL_BLOCK_SIZE" | ||
| + | </ | ||
| + | |||
| + | <note tip> | ||
| + | To allow larger blocks you could extend the two MAX parameters in the '' | ||
| + | |||
| + | MAX_VECTOR_SIZE from 1000 to 2000 | ||
| + | MMAX_MATRIX_SIZE from 250000 to 1000000 | ||
| + | | ||
| + | To accommodate these larger sizes some of the FORMAT statements should have I4 instead of I2 and I3. | ||
| + | </ | ||
| + | |||
| + | ===== Compiling ===== | ||
| + | |||
| + | First set the variables | ||
| + | |||
| + | * $packages set to VALET packages | ||
| + | * $libs set to libraries | ||
| + | * $f90flags set to compiler flags | ||
| + | |||
| + | Since this test is completely contained in one Fortran 90 program you can compile with one compile, link and load with one command. | ||
| + | |||
| + | < | ||
| + | vpkg_rollback all | ||
| + | vpkg_devrequire $packages | ||
| + | |||
| + | mpif90 $f90flags -o solve linsolve.f90 $LDFLAGS $libs | ||
| + | </ | ||
| + | |||
| + | Some version of the '' | ||
| + | |||
| + | |||
| + | | ||
| + | |||
| + | ===== Grid engine script file ===== | ||
| + | |||
| + | You must run the '' | ||
| + | a script, which we will copy | ||
| + | from ''/ | ||
| + | |||
| + | * $MY_EXEC: '' | ||
| + | * NPROC: '' | ||
| + | * vpkg_require line includes the Valet packages for the benchmark. | ||
| + | |||
| + | For example, with the '' | ||
| + | <code bash> | ||
| + | let NPROC=$NPROC_ROWS*$NPROC_COLS | ||
| + | if [ ! -f " | ||
| + | sed -e ' | ||
| + | / | ||
| + | echo "new copy of template in template.qs" | ||
| + | fi | ||
| + | sed " | ||
| + | </ | ||
| + | |||
| + | The file '' | ||
| + | Also '' | ||
| + | |||
| + | <note tip> | ||
| + | There is only one executable, '' | ||
| + | </ | ||
| + | |||
| + | ===== Submitting ===== | ||
| + | |||
| + | There is only linear system solve, and it should take just a few seconds. | ||
| + | <code bash> | ||
| + | qsub -N $name$N -l standby=1 -l h_rt=04: | ||
| + | </ | ||
| + | |||
| + | ===== Tests ===== | ||
| + | |||
| + | ==== gcc ==== | ||
| + | |||
| + | <code bash> | ||
| + | name=gcc | ||
| + | packages=scalapack/ | ||
| + | libs=" | ||
| + | f90flags='' | ||
| + | </ | ||
| + | |||
| + | |||
| + | ==== gcc and atlas ==== | ||
| + | |||
| + | <code bash> | ||
| + | name=gcc_atlas | ||
| + | packages=' | ||
| + | libs=" | ||
| + | f90flags='' | ||
| + | </ | ||
| + | |||
| + | The documentation in ''/ | ||
| + | |||
| + | Also from the same documentation: | ||
| + | <code text> | ||
| + | ATLAS does not provide a full LAPACK library. | ||
| + | </ | ||
| + | |||
| + | This means the order the VALET packages are added is important. | ||
| + | |||
| + | But this may not be optimal: | ||
| + | < | ||
| + | Just linking in ATLAS' | ||
| + | performance, | ||
| + | of ATLAS' | ||
| + | </ | ||
| + | |||
| + | With these variables set and '' | ||
| + | <code text> | ||
| + | packages=' | ||
| + | </ | ||
| + | we get '' | ||
| + | |||
| + | <code text> | ||
| + | ... | ||
| + | / | ||
| + | xerbla.f: | ||
| + | ... | ||
| + | </ | ||
| + | |||
| + | Explanation: | ||
| + | |||
| + | <code bash> | ||
| + | find /usr -name libg2c.a | ||
| + | </ | ||
| + | <code text> | ||
| + | find: `/ | ||
| + | / | ||
| + | / | ||
| + | </ | ||
| + | To remove these errors, change: | ||
| + | <code bash> | ||
| + | libs=" | ||
| + | </ | ||
| + | |||
| + | New '' | ||
| + | <code text> | ||
| + | ... | ||
| + | | ||
| + | ... | ||
| + | </ | ||
| + | |||
| + | Explanation: | ||
| + | |||
| + | <code bash> | ||
| + | nm -g / | ||
| + | </ | ||
| + | <code bash> | ||
| + | nm -g / | ||
| + | </ | ||
| + | <code text> | ||
| + | U slarnv_ | ||
| + | 0000000000000000 T slarnv_ | ||
| + | U slarnv_ | ||
| + | U slarnv_ | ||
| + | </ | ||
| + | |||
| + | No output from first '' | ||
| + | |||
| + | You can copy the full atlas directory in your working direction and then follow the directions | ||
| + | in ''/ | ||
| + | <code text> | ||
| + | **** GETTING A FULL LAPACK LIB **** | ||
| + | </ | ||
| + | |||
| + | We call this library '' | ||
| + | |||
| + | ==== gcc and myatlas ==== | ||
| + | |||
| + | <code bash> | ||
| + | name=gcc_myatlas | ||
| + | packages=' | ||
| + | libs=" | ||
| + | f90flags='' | ||
| + | </ | ||
| + | |||
| + | This requires a copy of atlas in your directory, '' | ||
| + | You need to build your own copy of '' | ||
| + | |||
| + | Assuming | ||
| + | you do not have a '' | ||
| + | <code bash> | ||
| + | cp -a / | ||
| + | ar x lib/ | ||
| + | cp / | ||
| + | ar r lib/ | ||
| + | rm *.o | ||
| + | cp / | ||
| + | </ | ||
| + | |||
| + | Now you have a '' | ||
| + | |||
| + | ==== gcc and myptatlas ==== | ||
| + | |||
| + | <code bash> | ||
| + | name=gcc_myptatlas | ||
| + | packages=' | ||
| + | libs=" | ||
| + | f90flags='' | ||
| + | </ | ||
| + | |||
| + | Parallel threads will dynamically uses all the cores available at compile time (24), but only if problem size indicates they will help. | ||
| + | |||
| + | ==== pgi and acml ==== | ||
| + | |||
| + | <code bash> | ||
| + | name=pgi_acml | ||
| + | packages=scalapack/ | ||
| + | libs=" | ||
| + | f90flags='' | ||
| + | </ | ||
| + | |||
| + | ==== intel and mkl ==== | ||
| + | |||
| + | <code bash> | ||
| + | name=intel_mkl | ||
| + | packages=openmpi/ | ||
| + | libs=" | ||
| + | f90flags=" | ||
| + | </ | ||
| + | |||
| + | ===== Results N=4000===== | ||
| + | |||
| + | ==== BLOCK=1000, NPROCS=16 ==== | ||
| + | |||
| + | Each test is repeated three time. | ||
| + | ^ File name ^ Time ^ | ||
| + | | gcc4000.o91943 | Elapsed time = 0.613728D+05 milliseconds | | ||
| + | | gcc4000.o92019 | Elapsed time = 0.862935D+05 milliseconds | | ||
| + | | gcc4000.o92030 | Elapsed time = 0.826695D+05 milliseconds | | ||
| + | | gcc_atlas4000.o91945 | Elapsed time = 0.386161D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92023 | Elapsed time = 0.433195D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92035 | Elapsed time = 0.424980D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92009 | Elapsed time = 0.448106D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92026 | Elapsed time = 0.461706D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92032 | Elapsed time = 0.441593D+04 milliseconds | | ||
| + | | intel_mkl4000.o91611 | Elapsed time = 0.222194D+05 milliseconds | | ||
| + | | intel_mkl4000.o92016 | Elapsed time = 0.215223D+05 milliseconds | | ||
| + | | intel_mkl4000.o92039 | Elapsed time = 0.214088D+05 milliseconds | | ||
| + | | pgi_acml4000.o91466 | ||
| + | | pgi_acml4000.o92017 | ||
| + | | pgi_acml4000.o92040 | ||
| + | |||
| + | ==== BLOCK=800, NPROCS=25 ==== | ||
| + | |||
| + | Each test is repeated three time. | ||
| + | ^ File name ^ Time ^ | ||
| + | | gcc4000.o92335 | Elapsed time = 0.638246D+05 milliseconds | | ||
| + | | gcc4000.o92386 | Elapsed time = 0.633060D+05 milliseconds | | ||
| + | | gcc4000.o92412 | Elapsed time = 0.629561D+05 milliseconds | | ||
| + | | gcc_atlas4000.o92336 | Elapsed time = 0.314615D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92389 | Elapsed time = 0.358208D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92413 | Elapsed time = 0.334147D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92337 | Elapsed time = 0.363176D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92390 | Elapsed time = 0.306922D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92414 | Elapsed time = 0.333779D+04 milliseconds | | ||
| + | | intel_mkl4000.o92339 | Elapsed time = 0.433877D+05 milliseconds | | ||
| + | | intel_mkl4000.o92393 | Elapsed time = 0.400862D+05 milliseconds | | ||
| + | | intel_mkl4000.o92417 | Elapsed time = 0.409855D+05 milliseconds | | ||
| + | | pgi_acml4000.o92338 | Elapsed time = 0.234248D+04 milliseconds | | ||
| + | | pgi_acml4000.o92392 | Elapsed time = 0.276856D+04 milliseconds | | ||
| + | | pgi_acml4000.o92415 | Elapsed time = 0.211567D+04 milliseconds | | ||
| + | ==== BLOCK=500, NPROCS=64 ==== | ||
| + | |||
| + | Each test is repeated three time. | ||
| + | ^ File name ^ Time ^ | ||
| + | | gcc4000.o92123 | Elapsed time = 0.284893D+05 milliseconds | | ||
| + | | gcc4000.o92144 | Elapsed time = 0.278744D+05 milliseconds | | ||
| + | | gcc4000.o92150 | Elapsed time = 0.289137D+05 milliseconds | | ||
| + | | gcc_atlas4000.o92130 | Elapsed time = 0.296471D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92142 | Elapsed time = 0.264463D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92148 | Elapsed time = 0.269103D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92133 | Elapsed time = 0.280457D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92138 | Elapsed time = 0.312135D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92153 | Elapsed time = 0.286337D+04 milliseconds | | ||
| + | | intel_mkl4000.o92134 | Elapsed time = 0.436288D+05 milliseconds | | ||
| + | | intel_mkl4000.o92140 | Elapsed time = 0.413780D+05 milliseconds | | ||
| + | | intel_mkl4000.o92152 | Elapsed time = 0.401095D+05 milliseconds | | ||
| + | | pgi_acml4000.o92137 | Elapsed time = 0.234475D+04 milliseconds | | ||
| + | | pgi_acml4000.o92145 | Elapsed time = 0.214514D+04 milliseconds | | ||
| + | | pgi_acml4000.o92149 | Elapsed time = 0.293480D+04 milliseconds | | ||
| + | |||
| + | ==== BLOCK=250, NPROCS=256 ==== | ||
| + | |||
| + | Each test is repeated three time. | ||
| + | ^ File name ^ Time ^ | ||
| + | | gcc4000.o92164 | Elapsed time = 0.148302D+05 milliseconds | | ||
| + | | gcc4000.o92168 | Elapsed time = 0.144862D+05 milliseconds | | ||
| + | | gcc4000.o92317 | Elapsed time = 0.160144D+05 milliseconds | | ||
| + | | gcc_atlas4000.o92167 | Elapsed time = 0.785104D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92171 | Elapsed time = 0.749285D+04 milliseconds | | ||
| + | | gcc_atlas4000.o92318 | Elapsed time = 0.798376D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92165 | Elapsed time = 0.797618D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92222 | Elapsed time = 0.792745D+04 milliseconds | | ||
| + | | gcc_myatlas4000.o92320 | Elapsed time = 0.720193D+04 milliseconds | | ||
| + | | intel_mkl4000.o92162 | Elapsed time = 0.636915D+05 milliseconds | | ||
| + | | intel_mkl4000.o92169 | Elapsed time = 0.733785D+05 milliseconds | | ||
| + | | intel_mkl4000.o92324 | Elapsed time = 0.653791D+05 milliseconds | | ||
| + | | pgi_acml4000.o92161 | Elapsed time = 0.740457D+04 milliseconds | | ||
| + | | pgi_acml4000.o92170 | Elapsed time = 0.733668D+04 milliseconds | | ||
| + | | pgi_acml4000.o92322 | Elapsed time = 0.769606D+04 milliseconds | | ||
| + | |||
| + | ===== Summary ===== | ||
| + | ==== 4000 x 4000 matrix ==== | ||
| + | === Time to solve linear system === | ||
| + | |||
| + | A randomly generated matrix is solved using ScaLAPACK with different block sizes. | ||
| + | The times are the average elapsed time in seconds, as reported by '' | ||
| + | ^ Test ^ N=4000 | ||
| + | ^ name ^ np=16 ^ np=25 ^ np=64 ^ np=256 ^ | ||
| + | | [[# | ||
| + | | [[# | ||
| + | | [[# | ||
| + | | [[# | ||
| + | | [[# | ||
| + | |||
| + | |||
| + | There is not much difference between '' | ||
| + | |||
| + | === Speedup === | ||
| + | |||
| + | The speedup for '' | ||
| + | |||
| + | {{: | ||
| + | |||
| + | |||
| + | ==== 16000 x 16000 matrix ==== | ||
| + | === Time to solve linear system === | ||
| + | |||
| + | A randomly generated matrix is solved using ScaLAPACK with different block sizes. | ||
| + | The times are the average elapsed time in seconds, as reported by '' | ||
| + | ^ Test ^ N=16000 | ||
| + | ^ name^ np=16^ np=64^ np=256 ^ | ||
| + | | [[# | ||
| + | | [[# | ||
| + | | [[# | ||
| + | | [[# | ||
| + | | [[# | ||
| + | | [[# | ||
| + | |||
| + | === Speedup === | ||
| + | |||
| + | Speedup for ATLAS, MKL and ACML compared to the reference GCC with no optimized library. | ||
| + | |||
| + | {{: | ||
| + | |||
| + | |||
| + | === Time plot === | ||
| + | |||
| + | Elapsed time for ATLAS, MKL and ACML. | ||
| + | |||
| + | {{: | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||