Differences
This shows you the differences between two versions of the page.
software:scalapack:linsolve [2017-12-02 15:23] – created sraskar | software:scalapack:linsolve [2021-04-27 16:21] (current) – external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Scalapack linsolve benchmark ====== | ||
+ | |||
+ | ===== Fortran 90 source code ===== | ||
+ | |||
+ | We base this benchmark in the '' | ||
+ | |||
+ | We get the program with | ||
+ | <code bash> | ||
+ | if [ ! -f " | ||
+ | wget http:// | ||
+ | fi | ||
+ | </ | ||
+ | |||
+ | This program reads one line to start the benchmark. The input must contain 5 numbers: | ||
+ | * N: order of linear system | ||
+ | * NPROC_ROWS: number of rows in process grid | ||
+ | * NPROC_COLS: number of columns in process grid | ||
+ | * ROW_BLOCK_SIZE: | ||
+ | * COL_BLOCK_SIZE: | ||
+ | |||
+ | Where '' | ||
+ | |||
+ | For this benchmark we will set '' | ||
+ | <code bash> | ||
+ | let N=3000 | ||
+ | let ROW_BLOCK_SIZE=500 | ||
+ | let COL_BLOCK_SIZE=500 | ||
+ | let NPROC_ROWS=$N/ | ||
+ | let NPROC_COLS=$N/ | ||
+ | echo "$N $NPROC_ROWS $NPROC_ROWS $ROW_BLOCK_SIZE $COL_BLOCK_SIZE" | ||
+ | </ | ||
+ | |||
+ | <note tip> | ||
+ | To allow larger blocks you could extend the two MAX parameters in the '' | ||
+ | |||
+ | MAX_VECTOR_SIZE from 1000 to 2000 | ||
+ | MMAX_MATRIX_SIZE from 250000 to 1000000 | ||
+ | | ||
+ | To accommodate these larger sizes some of the FORMAT statements should have I4 instead of I2 and I3. | ||
+ | </ | ||
+ | |||
+ | ===== Compiling ===== | ||
+ | |||
+ | First set the variables | ||
+ | |||
+ | * $packages set to VALET packages | ||
+ | * $libs set to libraries | ||
+ | * $f90flags set to compiler flags | ||
+ | |||
+ | Since this test is completely contained in one Fortran 90 program you can compile with one compile, link and load with one command. | ||
+ | |||
+ | < | ||
+ | vpkg_rollback all | ||
+ | vpkg_devrequire $packages | ||
+ | |||
+ | mpif90 $f90flags -o solve linsolve.f90 $LDFLAGS $libs | ||
+ | </ | ||
+ | |||
+ | Some version of the '' | ||
+ | |||
+ | |||
+ | | ||
+ | |||
+ | ===== Grid engine script file ===== | ||
+ | |||
+ | You must run the '' | ||
+ | a script, which we will copy | ||
+ | from ''/ | ||
+ | |||
+ | * $MY_EXEC: '' | ||
+ | * NPROC: '' | ||
+ | * vpkg_require line includes the Valet packages for the benchmark. | ||
+ | |||
+ | For example, with the '' | ||
+ | <code bash> | ||
+ | let NPROC=$NPROC_ROWS*$NPROC_COLS | ||
+ | if [ ! -f " | ||
+ | sed -e ' | ||
+ | / | ||
+ | echo "new copy of template in template.qs" | ||
+ | fi | ||
+ | sed " | ||
+ | </ | ||
+ | |||
+ | The file '' | ||
+ | Also '' | ||
+ | |||
+ | <note tip> | ||
+ | There is only one executable, '' | ||
+ | </ | ||
+ | |||
+ | ===== Submitting ===== | ||
+ | |||
+ | There is only linear system solve, and it should take just a few seconds. | ||
+ | <code bash> | ||
+ | qsub -N $name$N -l standby=1 -l h_rt=04: | ||
+ | </ | ||
+ | |||
+ | ===== Tests ===== | ||
+ | |||
+ | ==== gcc ==== | ||
+ | |||
+ | <code bash> | ||
+ | name=gcc | ||
+ | packages=scalapack/ | ||
+ | libs=" | ||
+ | f90flags='' | ||
+ | </ | ||
+ | |||
+ | |||
+ | ==== gcc and atlas ==== | ||
+ | |||
+ | <code bash> | ||
+ | name=gcc_atlas | ||
+ | packages=' | ||
+ | libs=" | ||
+ | f90flags='' | ||
+ | </ | ||
+ | |||
+ | The documentation in ''/ | ||
+ | |||
+ | Also from the same documentation: | ||
+ | <code text> | ||
+ | ATLAS does not provide a full LAPACK library. | ||
+ | </ | ||
+ | |||
+ | This means the order the VALET packages are added is important. | ||
+ | |||
+ | But this may not be optimal: | ||
+ | < | ||
+ | Just linking in ATLAS' | ||
+ | performance, | ||
+ | of ATLAS' | ||
+ | </ | ||
+ | |||
+ | With these variables set and '' | ||
+ | <code text> | ||
+ | packages=' | ||
+ | </ | ||
+ | we get '' | ||
+ | |||
+ | <code text> | ||
+ | ... | ||
+ | / | ||
+ | xerbla.f: | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | Explanation: | ||
+ | |||
+ | <code bash> | ||
+ | find /usr -name libg2c.a | ||
+ | </ | ||
+ | <code text> | ||
+ | find: `/ | ||
+ | / | ||
+ | / | ||
+ | </ | ||
+ | To remove these errors, change: | ||
+ | <code bash> | ||
+ | libs=" | ||
+ | </ | ||
+ | |||
+ | New '' | ||
+ | <code text> | ||
+ | ... | ||
+ | | ||
+ | ... | ||
+ | </ | ||
+ | |||
+ | Explanation: | ||
+ | |||
+ | <code bash> | ||
+ | nm -g / | ||
+ | </ | ||
+ | <code bash> | ||
+ | nm -g / | ||
+ | </ | ||
+ | <code text> | ||
+ | U slarnv_ | ||
+ | 0000000000000000 T slarnv_ | ||
+ | U slarnv_ | ||
+ | U slarnv_ | ||
+ | </ | ||
+ | |||
+ | No output from first '' | ||
+ | |||
+ | You can copy the full atlas directory in your working direction and then follow the directions | ||
+ | in ''/ | ||
+ | <code text> | ||
+ | **** GETTING A FULL LAPACK LIB **** | ||
+ | </ | ||
+ | |||
+ | We call this library '' | ||
+ | |||
+ | ==== gcc and myatlas ==== | ||
+ | |||
+ | <code bash> | ||
+ | name=gcc_myatlas | ||
+ | packages=' | ||
+ | libs=" | ||
+ | f90flags='' | ||
+ | </ | ||
+ | |||
+ | This requires a copy of atlas in your directory, '' | ||
+ | You need to build your own copy of '' | ||
+ | |||
+ | Assuming | ||
+ | you do not have a '' | ||
+ | <code bash> | ||
+ | cp -a / | ||
+ | ar x lib/ | ||
+ | cp / | ||
+ | ar r lib/ | ||
+ | rm *.o | ||
+ | cp / | ||
+ | </ | ||
+ | |||
+ | Now you have a '' | ||
+ | |||
+ | ==== gcc and myptatlas ==== | ||
+ | |||
+ | <code bash> | ||
+ | name=gcc_myptatlas | ||
+ | packages=' | ||
+ | libs=" | ||
+ | f90flags='' | ||
+ | </ | ||
+ | |||
+ | Parallel threads will dynamically uses all the cores available at compile time (24), but only if problem size indicates they will help. | ||
+ | |||
+ | ==== pgi and acml ==== | ||
+ | |||
+ | <code bash> | ||
+ | name=pgi_acml | ||
+ | packages=scalapack/ | ||
+ | libs=" | ||
+ | f90flags='' | ||
+ | </ | ||
+ | |||
+ | ==== intel and mkl ==== | ||
+ | |||
+ | <code bash> | ||
+ | name=intel_mkl | ||
+ | packages=openmpi/ | ||
+ | libs=" | ||
+ | f90flags=" | ||
+ | </ | ||
+ | |||
+ | ===== Results N=4000===== | ||
+ | |||
+ | ==== BLOCK=1000, NPROCS=16 ==== | ||
+ | |||
+ | Each test is repeated three time. | ||
+ | ^ File name ^ Time ^ | ||
+ | | gcc4000.o91943 | Elapsed time = 0.613728D+05 milliseconds | | ||
+ | | gcc4000.o92019 | Elapsed time = 0.862935D+05 milliseconds | | ||
+ | | gcc4000.o92030 | Elapsed time = 0.826695D+05 milliseconds | | ||
+ | | gcc_atlas4000.o91945 | Elapsed time = 0.386161D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92023 | Elapsed time = 0.433195D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92035 | Elapsed time = 0.424980D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92009 | Elapsed time = 0.448106D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92026 | Elapsed time = 0.461706D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92032 | Elapsed time = 0.441593D+04 milliseconds | | ||
+ | | intel_mkl4000.o91611 | Elapsed time = 0.222194D+05 milliseconds | | ||
+ | | intel_mkl4000.o92016 | Elapsed time = 0.215223D+05 milliseconds | | ||
+ | | intel_mkl4000.o92039 | Elapsed time = 0.214088D+05 milliseconds | | ||
+ | | pgi_acml4000.o91466 | ||
+ | | pgi_acml4000.o92017 | ||
+ | | pgi_acml4000.o92040 | ||
+ | |||
+ | ==== BLOCK=800, NPROCS=25 ==== | ||
+ | |||
+ | Each test is repeated three time. | ||
+ | ^ File name ^ Time ^ | ||
+ | | gcc4000.o92335 | Elapsed time = 0.638246D+05 milliseconds | | ||
+ | | gcc4000.o92386 | Elapsed time = 0.633060D+05 milliseconds | | ||
+ | | gcc4000.o92412 | Elapsed time = 0.629561D+05 milliseconds | | ||
+ | | gcc_atlas4000.o92336 | Elapsed time = 0.314615D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92389 | Elapsed time = 0.358208D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92413 | Elapsed time = 0.334147D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92337 | Elapsed time = 0.363176D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92390 | Elapsed time = 0.306922D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92414 | Elapsed time = 0.333779D+04 milliseconds | | ||
+ | | intel_mkl4000.o92339 | Elapsed time = 0.433877D+05 milliseconds | | ||
+ | | intel_mkl4000.o92393 | Elapsed time = 0.400862D+05 milliseconds | | ||
+ | | intel_mkl4000.o92417 | Elapsed time = 0.409855D+05 milliseconds | | ||
+ | | pgi_acml4000.o92338 | Elapsed time = 0.234248D+04 milliseconds | | ||
+ | | pgi_acml4000.o92392 | Elapsed time = 0.276856D+04 milliseconds | | ||
+ | | pgi_acml4000.o92415 | Elapsed time = 0.211567D+04 milliseconds | | ||
+ | ==== BLOCK=500, NPROCS=64 ==== | ||
+ | |||
+ | Each test is repeated three time. | ||
+ | ^ File name ^ Time ^ | ||
+ | | gcc4000.o92123 | Elapsed time = 0.284893D+05 milliseconds | | ||
+ | | gcc4000.o92144 | Elapsed time = 0.278744D+05 milliseconds | | ||
+ | | gcc4000.o92150 | Elapsed time = 0.289137D+05 milliseconds | | ||
+ | | gcc_atlas4000.o92130 | Elapsed time = 0.296471D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92142 | Elapsed time = 0.264463D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92148 | Elapsed time = 0.269103D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92133 | Elapsed time = 0.280457D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92138 | Elapsed time = 0.312135D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92153 | Elapsed time = 0.286337D+04 milliseconds | | ||
+ | | intel_mkl4000.o92134 | Elapsed time = 0.436288D+05 milliseconds | | ||
+ | | intel_mkl4000.o92140 | Elapsed time = 0.413780D+05 milliseconds | | ||
+ | | intel_mkl4000.o92152 | Elapsed time = 0.401095D+05 milliseconds | | ||
+ | | pgi_acml4000.o92137 | Elapsed time = 0.234475D+04 milliseconds | | ||
+ | | pgi_acml4000.o92145 | Elapsed time = 0.214514D+04 milliseconds | | ||
+ | | pgi_acml4000.o92149 | Elapsed time = 0.293480D+04 milliseconds | | ||
+ | |||
+ | ==== BLOCK=250, NPROCS=256 ==== | ||
+ | |||
+ | Each test is repeated three time. | ||
+ | ^ File name ^ Time ^ | ||
+ | | gcc4000.o92164 | Elapsed time = 0.148302D+05 milliseconds | | ||
+ | | gcc4000.o92168 | Elapsed time = 0.144862D+05 milliseconds | | ||
+ | | gcc4000.o92317 | Elapsed time = 0.160144D+05 milliseconds | | ||
+ | | gcc_atlas4000.o92167 | Elapsed time = 0.785104D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92171 | Elapsed time = 0.749285D+04 milliseconds | | ||
+ | | gcc_atlas4000.o92318 | Elapsed time = 0.798376D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92165 | Elapsed time = 0.797618D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92222 | Elapsed time = 0.792745D+04 milliseconds | | ||
+ | | gcc_myatlas4000.o92320 | Elapsed time = 0.720193D+04 milliseconds | | ||
+ | | intel_mkl4000.o92162 | Elapsed time = 0.636915D+05 milliseconds | | ||
+ | | intel_mkl4000.o92169 | Elapsed time = 0.733785D+05 milliseconds | | ||
+ | | intel_mkl4000.o92324 | Elapsed time = 0.653791D+05 milliseconds | | ||
+ | | pgi_acml4000.o92161 | Elapsed time = 0.740457D+04 milliseconds | | ||
+ | | pgi_acml4000.o92170 | Elapsed time = 0.733668D+04 milliseconds | | ||
+ | | pgi_acml4000.o92322 | Elapsed time = 0.769606D+04 milliseconds | | ||
+ | |||
+ | ===== Summary ===== | ||
+ | ==== 4000 x 4000 matrix ==== | ||
+ | === Time to solve linear system === | ||
+ | |||
+ | A randomly generated matrix is solved using ScaLAPACK with different block sizes. | ||
+ | The times are the average elapsed time in seconds, as reported by '' | ||
+ | ^ Test ^ N=4000 | ||
+ | ^ name ^ np=16 ^ np=25 ^ np=64 ^ np=256 ^ | ||
+ | | [[# | ||
+ | | [[# | ||
+ | | [[# | ||
+ | | [[# | ||
+ | | [[# | ||
+ | |||
+ | |||
+ | There is not much difference between '' | ||
+ | |||
+ | === Speedup === | ||
+ | |||
+ | The speedup for '' | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | ==== 16000 x 16000 matrix ==== | ||
+ | === Time to solve linear system === | ||
+ | |||
+ | A randomly generated matrix is solved using ScaLAPACK with different block sizes. | ||
+ | The times are the average elapsed time in seconds, as reported by '' | ||
+ | ^ Test ^ N=16000 | ||
+ | ^ name^ np=16^ np=64^ np=256 ^ | ||
+ | | [[# | ||
+ | | [[# | ||
+ | | [[# | ||
+ | | [[# | ||
+ | | [[# | ||
+ | | [[# | ||
+ | |||
+ | === Speedup === | ||
+ | |||
+ | Speedup for ATLAS, MKL and ACML compared to the reference GCC with no optimized library. | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | === Time plot === | ||
+ | |||
+ | Elapsed time for ATLAS, MKL and ACML. | ||
+ | |||
+ | {{: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||