software:matlab:shared-node

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software:matlab:shared-node [2018-12-03 11:19] – [Computational models for running Matlab on a shared cluster] anitasoftware:matlab:shared-node [2019-08-27 17:00] (current) – [Computational models for running Matlab on a shared cluster] anita
Line 1: Line 1:
 +====== Computational models for running Matlab on a shared cluster ======
 +By default, Matlab uses multiple computational threads.  From the MATLAB R2011b documentation
  
 +<code>
 +matlab -singleCompThread limits MATLAB to a single computational thread. 
 +By default, MATLAB makes use of the multithreading capabilities of the 
 +computer on which it is running.
 +</code>
 +
 +The default, multiple computational threads, is never a good option when you are sharing a node.
 +So either use ''-singleCompThread'' option when you start MATLAB or schedule the Matlab job using the exclusive option based on the job scheduler on that cluster such as ''-l exclusive=1'' option for Grid Engine or ''#SBATCH --exclusive'' for Slurm.
 +
 +<note important>
 +Using a node with exclusive access does not mean MATLAB will use all the cores and memory.  You should
 +watch it to see memory and core requirement.  To take advantage of the multiple cores you must use the
 +built-in, matrix functions.   You should see your CPU utilization as over 100% when the matrix function
 +are being executed.
 +</note>
 +
 +Matlab can, with the distributed computing toolbox, create a parallel pool of workers to be dispatched
 +in parallel.
 +
 +===== Multiple computational threads on one node =====
 +Matlab makes use
 +of the multithreading capabilities of the computer on which it is running.  Matlab uses MKL as its BLAS and LAPACK backend. The versions can be determined by the Matlab commands.
 +<code>
 +version -blas
 +version -lapack
 +</code> 
 +To make full use of the MKL computational threads you need to use the built-in matrix functions.  The work needed to execute the built-in function will be distribute to multiple cores using MKL threads, which are compatible with OpenMP threads.  All
 +the cores share the same memory, so this is also called the shared memory model for parallel computing.  A simple model of how
 +the total Matlab job performs is
 +
 +<code>
 +  CPU = (p*20 + (1-p))*WALL
 +</code>
 +
 +
 +The actual number of computational threads is not explicitly mentioned in the Unix documentation.  For windows, the documentation specifies that Matlab will use all the cores on the machine.  This is clearly not appropriated for Unix clusters. Observations on Mills show that Matlab may use all the cores, but average much less.  To use more than one core the Matlab job must be written to use the standard high performance libraries (MKL) linked in the Matlab executable.  This works 
 +well, but is not optimized for Mills processor or threading libraries.
 +
 +==== Test batch jobs using GridEngine ====
 +
 +Several copies of the same MATLAB script was submitted to run simultaneously. The variance was in the batch script directives.
 +
 +
 +=== Batch job with exclusive access (only job on node) ===
 +
 +Part of batch script file:<code>
 +$ tail -4 batche.qs
 +#$ -l exclusive=1
 +
 +vpkg_require matlab/r2014b
 +matlab -nodisplay -nojvm -r 'script'
 +</code>
 +
 +CGROUP report from batch output file:<code>
 +$ grep CGROUPS *.o425422
 +[CGROUPS] No /cgroup/memory/UGE/425422.1 exists for this job
 +[CGROUPS] UD Grid Engine cgroup setup commencing
 +[CGROUPS] Setting none bytes (vmem none bytes) on n171 (master)
 +[CGROUPS]   with 20 cores = 
 +[CGROUPS] done.
 +</code>
 +
 +
 +Memory and timing results:<code>
 +$ qacct -h n171 -j 425422 | egrep '(start|maxvmem|maxrss|cpu|wallclock|failed)'
 +start_time   02/16/2016 13:52:16.213
 +failed          
 +ru_wallclock 8037.427     
 +ru_maxrss    658584              
 +cpu          53089.736    
 +maxvmem      2.882G
 +maxrss       644.949M
 +</code>
 +
 +=== Batch job with 5 slots 370 MB per core (1.85 GB total) ===
 +
 +Part of batch script file:<code>
 +$ tail -6 batch5.qs
 +#$ -pe threads 5
 +#$ -l mem_total=1.9G
 +#$ -l m_mem_free=370M
 +
 +vpkg_require matlab/r2015a
 +matlab -nodisplay -nojvm -r 'script'
 +</code>
 +
 +CGROUP report from batch output file:<code>
 +$ grep CGROUPS *.o428562
 +[CGROUPS] UD Grid Engine cgroup setup commencing
 +[CGROUPS] Setting 388050944 bytes (vmem none bytes) on n139 (master)
 +[CGROUPS]   with 5 cores = 
 +[CGROUPS] done.
 +</code>
 +
 +Memory and timing results:<code>
 +$ qacct -h n139 -j 428562 | egrep '(start|maxvmem|maxrss|cpu|wallclock|failed)'
 +start_time   02/17/2016 18:22:54.254
 +failed          
 +ru_wallclock 5.297        
 +ru_maxrss    165232              
 +cpu          3.090        
 +maxvmem      1017.906M
 +maxrss       155.109M
 +</code>
 +=== Batch job with 4 slots 1 GB per core (4 GB total) ===
 +
 +Part of batch script file:<code>
 +$ cat batch.qs
 +#$ -pe threads 4
 +#$ -l m_mem_free=1G
 +
 +vpkg_require matlab/r2014b
 +matlab -nodisplay -nojvm -r 'script'
 +</code>
 +
 +CGROUP report from batch output file:<code>
 +$ grep CGROUPS *.o418695
 +[CGROUPS] UD Grid Engine cgroup setup commencing
 +[CGROUPS] Setting 1073741824 bytes (vmem none bytes) on n036 (master)
 +[CGROUPS]   with 4 cores = 0 2 4 6
 +[CGROUPS] done.
 +</code>
 +This is sharing the node with the previous job on cores 5-8.
 +
 +Memory and timing results:<code>
 +$ qacct -h n036 -j 418695 | egrep '(maxvmem|maxrss|cpu|wallclock|failed)'
 +failed          
 +ru_wallclock 826.759      
 +ru_maxrss    595188              
 +cpu          1629.194     
 +maxvmem      1.801G
 +maxrss       583.039M
 +</code>
 +
 +=== Batch job with 3 slots 1 GB per core (3 GB total) ===
 +
 +Part of batch script file:<code>
 +$ cat batch.qs
 +#$ -pe threads 3
 +#$ -l m_mem_free=1G
 +
 +vpkg_require matlab/r2014b
 +matlab -nodisplay -nojvm -r 'script'
 +</code>
 +
 +CGROUP report from batch output file:<code>
 +$ grep CGROUPS *.o408597
 +[CGROUPS] UD Grid Engine cgroup setup commencing
 +[CGROUPS] Setting 3221225472 bytes (vmem 9223372036854775807 bytes) on n039 (master)
 +[CGROUPS]   with 3 cores = 0-2
 +[CGROUPS] done.
 +</code>
 +
 +Memory and timing results:<code>
 +$ qacct -h n039 -j 408597 | egrep '(maxvmem|maxrss|cpu|wallclock)'
 +ru_wallclock 13877.991    
 +ru_maxrss    2089812             
 +cpu          90776.109    
 +maxvmem      4.180G
 +maxrss       0.000
 +</code>
 +=== Batch job with 2 slots 3.1 GB per core (6.2 GB total) ===
 +
 +3.1 GB per core on a 20 core node is 62 GB, which allows 20 jobs to fit with 2 GB to spare for system overhead
 +
 +Part of batch script file:<code>
 +$ cat batch.qs
 +# -pe threads 2
 +# -l m_mem_free=3.1G
 +
 +vpkg_require matlab/r2014b
 +matlab -nodisplay -nojvm -r 'script'
 +</code>
 +
 +CGROUP report from batch output file:<code>
 +$ grep CGROUPS *.o408598
 +[CGROUPS] UD Grid Engine cgroup setup commencing
 +[CGROUPS] Setting 6657200128 bytes (vmem 9223372036854775807 bytes) on n039 (master)
 +[CGROUPS]   with 2 cores = 3-4
 +[CGROUPS] done.
 +</code>
 +This is sharing the node with the previous job, being on cores 3-4.
 +
 +Memory and timing results:<code>
 +$ qacct -h n039 -j 408598 | egrep '(maxvmem|maxrss|cpu|wallclock)'
 +ru_wallclock 13904.972    
 +ru_maxrss    2152212             
 +cpu          92110.859    
 +maxvmem      4.208G
 +maxrss       0.000
 +</code>
 +
 +=== Batch job with 1 slots 3.1 GB per core (3.1 GB total) ===
 +
 +3.1 GB per core on a 20 core node is 62 GB, which allows 20 jobs to fit with 2 GB to spare for system overhead
 +
 +Part of batch script file:<code>
 +$ cat batch.qs
 +#$ -l m_mem_free=3.1G
 +
 +vpkg_require matlab/r2014b
 +matlab -nodisplay -nojvm -r 'script'
 +</code>
 +
 +CGROUP report from batch output file:<code>
 +$ grep CGROUPS *.o408599
 +[CGROUPS] UD Grid Engine cgroup setup commencing
 +[CGROUPS] Setting 3328602112 bytes (vmem 9223372036854775807 bytes) on n036 (master)
 +[CGROUPS]   with 1 core = 0
 +[CGROUPS] done.
 +</code>
 +
 +Memory and timing results:<code>
 +$ qacct -h n036 -j 408599 | egrep '(maxvmem|maxrss|cpu|wallclock)'
 +ru_wallclock 8607.872     
 +ru_maxrss    1935860             
 +cpu          51805.427    
 +maxvmem      4.036G
 +maxrss       0.000
 +</code>
 +
 +==== Table ====
 +
 +^ ^^ requested ^^ used memory and time ^^^
 +^ jobid ^ host ^ cores ^ memory ^ maxvem ^ cpu ^ wallclock ^
 +| 408594 | n038 | all 20 | all <64GB | 4.155G | 51321.533 | 8613.132 |
 +| 408595 | n037 | 5 | 5G | 4.043G | 86578.676 | 13051.171 |
 +| 408596 | n037 | 4 | 4G | 4.301G | 86330.547 | 13067.863 |
 +| 408597 | n039 | 3 | 3G | 4.180G | 90776.109 | 13877.991 |
 +| 408598 | n039 | 2 | 6.2G | 4.208G | 92110.859 | 13904.972 |
 +| 408599 | n031 | default 1 | 3.1G | 4.036G | 51805.427 | 8607.872 |
 +
 +==== Table new spread over nodes ==== 
 +
 +^ ^^ requested ^^ used memory and time ^^^
 +^ jobid ^ host ^ cores ^ memory ^ maxvem ^ cpu ^ wallclock ^
 +| 418705 | n172 | all 20 | all <64GB | 2.904G | 5553.820 | 1089.789 |
 +| 418704 | n039 | 5 | 5G | 1.874G | 1778.309 | 804.490 |
 +| 418695 | n036 | 4 | 4G | 1.801G | 1629.194 | 826.759 |
 +| 418693 | n037 | 3 | 3G | 1.735G | 1475.837 | 863.386 |
 +| 418691 | n040 | 2 | 6.2G | 1.662G | 1334.752 | 944.711 |
 +| 418690 | n038 | default 1 | 1G | 1.536G | 1164.087 | 1173.832 |
 +
 +
 +==== Table new same node ==== 
 +
 +^ ^^ requested ^^ used memory and time ^^^^
 +^ jobid ^ host ^ cores ^ memory ^ maxvem ^ maxrss ^ cpu ^ wallclock ^
 +| 418768 | n172 | all 20 | all <64GB | 3.805G | 1.633G | 5246.490  | 882.568 |
 +| 418773 | n036 | 5 | 5G | 1.852G |  578.457M | 1953.868 |930.284 |
 +| 418772 | n036 | 4 | 4G | 1.779G | 579.109M |1800.191 | 949.475 |
 +| 418771 | n036 | 3 | 3G | 1.709G | 570.246M |1660.543 | 996.545 |
 +| 418770 | n036 | 2 | 6.2G | 1.640G | 557.363M | 1543.664 | 1106.315 |
 +| 418769 | n036 | default 1 | 1G | 1.514G | 564.840M | 1356.694 |1356.256 |
 +
 +==== Graphs ==== 
 +
 +As number of cores increases both the CPU time and memory usage increase linearly.  The increased memory is easy to explain by the needed for //private memory//, memory that is not shared.  Sometime parallel algorithms can achieve faster wall clock time by recalculating some values, and thus the total CPU time increases.
 +
 +{{:clusters:matlab:maxeigcpu.png?nolink&640|}}
 +
 +{{:clusters:matlab:maxeigmem.png?640|}}
 +
 +Both CPU time and memory are costs to running you algorithm, since they limit the number of other users that can use the node.
 +To chart both consider a simple cost of CPU*Memory in GB hours.  Thus we have two objectives:
 +
 +  * Reduce the run time
 +  * Reduce the cost
 +
 +
 +{{:clusters:matlab:maxeigcost.png?640|}}
 +
 +The two extremes on the Pareto optimization curve and good choices.  All the nodes in the fastest run time and one node is the least costly (so you can simultaneously run 20 jobs.)  The 4 core job is a good compromise.
 +
 +==== Commands while running ====
 +
 +<code>
 +$ n=n182
 +</code>
 +
 +**''ps''** command
 +
 +<code>
 +$ ssh $n ps -eo pid,ruser,pcpu,pmem,thcount,stime,time,command | egrep '(COMMAND|matlab)'
 +   PID RUSER    %CPU %MEM THCNT STIME     TIME COMMAND
 + 96970 traine    182  0.8    10 13:52 05:51:25 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 + 96971 traine    160  0.8     9 13:52 05:09:03 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 + 96972 traine    119  0.8     7 13:52 03:50:15 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 + 96974 traine    141  0.8     8 13:52 04:33:14 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 + 97005 traine   99.5  0.8     5 13:52 03:11:43 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 + 97130 traine   99.4  0.8     5 13:52 03:11:27 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -singleCompThread -r script -nojvm
 +</code>
 +
 +**''ps''** command to get threads for one PID
 +
 +<code>
 +$ ssh $n ps -eLf | egrep '(PID|96970)' | grep -v ' 0  '
 +UID         PID   PPID    LWP  C NLWP STIME TTY          TIME CMD
 +traine    96970  96222  97281 95   10 13:52 ?        03:04:20 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96970  96222  97314 21   10 13:52 ?        00:41:58 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96970  96222  97315 21   10 13:52 ?        00:41:43 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96970  96222  97316 21   10 13:52 ?        00:40:54 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96970  96222  97317 22   10 13:52 ?        00:43:43 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +</code>
 +
 +<code>
 +$ ssh $n ps -eLf | egrep '(PID|96971)' | grep -v ' 0  '
 +UID         PID   PPID    LWP  C NLWP STIME TTY          TIME CMD
 +traine    96971  96223  97283 95    9 13:52 ?        03:04:30 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96971  96223  97310 21    9 13:52 ?        00:42:07 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96971  96223  97311 21    9 13:52 ?        00:41:39 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96971  96223  97312 21    9 13:52 ?        00:42:18 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +</code>
 +
 +<code>
 +$ ssh $n ps -eLf | egrep '(PID|96972)' | grep -v ' 0  '
 +UID         PID   PPID    LWP  C NLWP STIME TTY          TIME CMD
 +traine    96972  96278  97284 97    7 13:52 ?        03:09:31 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +traine    96972  96278  97308 21    7 13:52 ?        00:41:50 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +</code>
 +
 +<code>
 +$ ssh $n ps -eLf | egrep '(PID|97005)' | grep -v ' 0  '
 +UID         PID   PPID    LWP  C NLWP STIME TTY          TIME CMD
 +traine    97005  96342  97275 99    5 13:52 ?        03:13:49 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -r script -nojvm
 +</code>
 +
 +<code>
 +$ ssh $n ps -eLf | egrep '(PID|97130)' | grep -v ' 0  '
 +UID         PID   PPID    LWP  C NLWP STIME TTY          TIME CMD
 +traine    97130  96443  97282 99    5 13:52 ?        03:13:53 /home/software/matlab/r2014b/bin/glnxa64/MATLAB -nodisplay -singleCompThread -r script -nojvm
 +</code>
 +**''top''** command 
 +<code>
 +$ ssh $n top -H -b -n 1 | egrep '(COMMAND|MATLAB)' | grep -v ' 0'
 +   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           
 + 97281 traine    20   0 1785m 577m  73m R 101.2  0.9 185:50.23 MATLAB           
 + 97276 traine    20   0 1646m 572m  73m R 101.2  0.9 185:11.74 MATLAB           
 + 97275 traine    20   0 1452m 562m  73m R 101.2  0.9 194:31.12 MATLAB           
 + 97284 traine    20   0 1572m 562m  73m R 99.2  0.9 190:36.63 MATLAB            
 + 97282 traine    20   0 1452m 562m  73m R 99.2  0.9 194:14.58 MATLAB            
 + 97283 traine    20   0 1716m 575m  73m R 85.6  0.9 185:40.60 MATLAB            
 + 97316 traine    20   0 1785m 577m  73m S 62.3  0.9  41:28.48 MATLAB            
 + 97317 traine    20   0 1785m 577m  73m S 62.3  0.9  44:10.42 MATLAB            
 + 97315 traine    20   0 1785m 577m  73m S 60.3  0.9  42:15.26 MATLAB            
 + 97314 traine    20   0 1785m 577m  73m S 58.4  0.9  42:25.24 MATLAB            
 + 97311 traine    20   0 1716m 575m  73m S 33.1  0.9  42:02.23 MATLAB            
 + 97310 traine    20   0 1716m 575m  73m S 17.5  0.9  42:29.42 MATLAB            
 + 97312 traine    20   0 1716m 575m  73m S 17.5  0.9  42:34.32 MATLAB            
 + 97308 traine    20   0 1572m 562m  73m R  9.7  0.9  41:57.47 MATLAB
 +</code>
 +**''mpstat''** command 
 +<code>         
 +$ ssh $n mpstat -P ALL 1 2
 +Linux 2.6.32-504.30.3.el6.x86_64 (n182) 02/16/2016 _x86_64_ (20 CPU)
 +
 +05:08:25 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
 +05:08:26 PM  all   48.50    0.00    0.50    0.00    0.00    0.05    0.00    0.00   50.95
 +05:08:26 PM    0   99.00    0.00    0.00    0.00    0.00    1.00    0.00    0.00    0.00
 +05:08:26 PM    1   14.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   86.00
 +05:08:26 PM    2   54.55    0.00    0.00    0.00    0.00    0.00    0.00    0.00   45.45
 +05:08:26 PM    3   50.00    0.00    2.00    0.00    0.00    0.00    0.00    0.00   48.00
 +05:08:26 PM    4   53.47    0.00    0.99    0.00    0.00    0.00    0.00    0.00   45.54
 +05:08:26 PM    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:26 PM    6   44.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   56.00
 +05:08:26 PM    7  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:26 PM    8  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:26 PM    9   53.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   46.00
 +05:08:26 PM   10   74.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   26.00
 +05:08:26 PM   11   52.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   47.00
 +05:08:26 PM   12    9.00    0.00    4.00    0.00    0.00    0.00    0.00    0.00   87.00
 +05:08:26 PM   13   53.54    0.00    1.01    0.00    0.00    0.00    0.00    0.00   45.45
 +05:08:26 PM   14    0.99    0.00    0.99    0.00    0.00    0.00    0.00    0.00   98.02
 +05:08:26 PM   15   11.88    0.00    0.99    0.00    0.00    0.00    0.00    0.00   87.13
 +05:08:26 PM   16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +05:08:26 PM   17    1.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.00
 +05:08:26 PM   18    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +05:08:26 PM   19   99.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00
 +
 +05:08:26 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
 +05:08:27 PM  all   49.50    0.00    0.55    0.00    0.00    0.00    0.00    0.00   49.95
 +05:08:27 PM    0  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:27 PM    1   12.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   88.00
 +05:08:27 PM    2   58.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00   41.00
 +05:08:27 PM    3   31.68    0.00    0.99    0.00    0.00    0.00    0.00    0.00   67.33
 +05:08:27 PM    4   63.64    0.00    0.00    0.00    0.00    0.00    0.00    0.00   36.36
 +05:08:27 PM    5  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:27 PM    6   26.26    0.00    0.00    0.00    0.00    0.00    0.00    0.00   73.74
 +05:08:27 PM    7  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:27 PM    8  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:27 PM    9   57.00    0.00    2.00    0.00    0.00    0.00    0.00    0.00   41.00
 +05:08:27 PM   10  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +05:08:27 PM   11   60.40    0.00    1.98    0.00    0.00    0.00    0.00    0.00   37.62
 +05:08:27 PM   12   11.00    0.00    3.00    0.00    0.00    0.00    0.00    0.00   86.00
 +05:08:27 PM   13   57.43    0.00    0.99    0.00    0.00    0.00    0.00    0.00   41.58
 +05:08:27 PM   14    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +05:08:27 PM   15   12.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   88.00
 +05:08:27 PM   16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +05:08:27 PM   17    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +05:08:27 PM   18    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +05:08:27 PM   19  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +
 +Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
 +Average:     all   49.00    0.00    0.53    0.00    0.00    0.03    0.00    0.00   50.45
 +Average:         99.50    0.00    0.00    0.00    0.00    0.50    0.00    0.00    0.00
 +Average:         13.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   87.00
 +Average:         56.28    0.00    0.50    0.00    0.00    0.00    0.00    0.00   43.22
 +Average:         40.80    0.00    1.49    0.00    0.00    0.00    0.00    0.00   57.71
 +Average:         58.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00   41.00
 +Average:        100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +Average:         35.18    0.00    0.00    0.00    0.00    0.00    0.00    0.00   64.82
 +Average:        100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +Average:        100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
 +Average:         55.00    0.00    1.50    0.00    0.00    0.00    0.00    0.00   43.50
 +Average:      10   87.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   13.00
 +Average:      11   56.22    0.00    1.49    0.00    0.00    0.00    0.00    0.00   42.29
 +Average:      12   10.00    0.00    3.50    0.00    0.00    0.00    0.00    0.00   86.50
 +Average:      13   55.50    0.00    1.00    0.00    0.00    0.00    0.00    0.00   43.50
 +Average:      14    0.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00   99.00
 +Average:      15   11.94    0.00    0.50    0.00    0.00    0.00    0.00    0.00   87.56
 +Average:      16    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +Average:      17    0.50    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.50
 +Average:      18    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
 +Average:      19   99.50    0.00    0.50    0.00    0.00    0.00    0.00    0.00    0.00
 +</code>
 +**''qhost''** command 
 +<code>
 +$ qhost -h $n
 +HOSTNAME                ARCH         NCPU NSOC NCOR NTHR NLOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
 +----------------------------------------------------------------------------------------------
 +global                  -                  -    -    -                             -
 +n182                    lx-amd64       20    2   20   20  0.38   62.8G    5.1G    2.0G   11.5M
 +
 +</code>
 +
 +
 +
 +
 +
 +
 +
 +===== Multiple distributed workers =====
 +===== Single computational threads =====
 +
 +
 +
 +===== Monitoring Tools =====
 +
 +There are several tools you can run on your node to monitor the computational threads on your node.  In this example n093 is running several MATLAB jobs.
 +  * Ganglia (real time)  ''http://mills.hpc.udel.edu/ganglia/?c=mills.hpc&h=n093''
 +  * top
 +  * ps
 +==== Using top ====
 +
 +<code>
 +dnairn@mills dnairn]$ ssh n093 top -b -n 1 | egrep '(COMMAND|MATLAB)'
 +  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 + 8209 matusera  20   0 12.7g 6.2g  62m S 1103.5  9.9 202:33.96 MATLAB           
 + 2622 matusera  20   0 6917m 256m  62m S  0.0  0.4   9783:37 MATLAB             
 + 4386 matusera  20   0 6928m 231m  62m S  0.0  0.4   2850:19 MATLAB             
 +14939 matusera  20   0 6926m 230m  62m S  0.0  0.4  20139:22 MATLAB             
 +16308 matusera  20   0 6930m 242m  62m S  0.0  0.4  24928:39 MATLAB
 +</code>
 +
 +==== Using ps command ====
 +
 +<code>
 +[dnairn@mills dnairn]$ ssh n093 ps -eo pid,ruser,pcpu,pmem,thcount,stime,time,command | egrep '(COMMAND|matlab)'
 +  PID RUSER    %CPU %MEM THCNT STIME  TIME       COMMAND
 + 2622 matusera 21.1  0.3    90 Jul29 6-19:03:37  /home/software/matlab/R2011b/bin/glnxa64/MATLAB
 + 4386 matusera  4.7  0.3    90 Jul19 1-23:30:19  /home/software/matlab/R2011b/bin/glnxa64/MATLAB
 + 8209 matusera 1019  6.9    90 13:18 02:34:48    /home/software/matlab/R2011b/bin/glnxa64/MATLAB
 +14939 matusera 27.3  0.3    90 Jul10 13-23:39:21 /home/software/matlab/R2011b/bin/glnxa64/MATLAB
 +16308 matusera 46.6  0.3    90 Jul24 17-07:28:38 /home/software/matlab/R2011b/bin/glnxa64/MATLAB
 +</code>
 +
 +Description of the custom column values from ps man page:
 +<code>
 +pid        PID      process ID number of the process.
 +</code>
 +<code>
 +ruser      RUSER    real user ID. This will be the textual user ID, if it can be obtained and the field
 +                    width permits, or a decimal representation otherwise.
 +</code>
 +<code>
 +%cpu       %CPU     cpu utilization of the process in "##.#" format. Currently, it is the CPU time used
 +                    divided by the time the process has been running (cputime/realtime ratio), expressed
 +                    as a percentage. It will not add up to 100% unless you are lucky. (alias pcpu).
 +</code>
 +<code>                   
 +%mem       %MEM     ratio of the process’s resident set size  to the physical memory on the machine,
 +                    expressed as a percentage. (alias pmem).
 +</code>
 +<code>
 +thcount    THCNT    see nlwp. (alias nlwp). number of kernel threads owned by the process.
 +</code>
 +<code>
 +bsdstart   START    time the command started. If the process was started less than 24 hours ago, the
 +                    output format is " HH:MM", else it is "mmm dd" (where mmm is the three letters of the
 +                    month). See also lstart, start, start_time, and stime.
 +</code>
 +<code>
 +time       TIME     cumulative CPU time, "[dd-]hh:mm:ss" format. (alias cputime).
 +</code>
 +<code>
 +args       COMMAND  command with all its arguments as a string. Modifications to the arguments may be
 +                    shown. The output in this column may contain spaces. A process marked <defunct> is
 +                    partly dead, waiting to be fully destroyed by its parent. Sometimes the process args
 +                    will be unavailable; when this happens, ps will instead print the executable name in
 +                    brackets. (alias cmd, command). See also the comm format keyword, the -f option, and
 +                    the c option.
 +                    
 +                    When specified last, this column will extend to the edge of the display. If ps can not
 +                    determine display width, as when output is redirected (piped) into a file or another
 +                    command, the output width is undefined. (it may be 80, unlimited, determined by the
 +                    TERM variable, and so on) The COLUMNS environment variable or --cols option may be
 +                    used to exactly determine the width in this case. The w or -w option may be also be
 +                    used to adjust width.
 +</code>
 +==== ps for threads ====
 +
 +Select thread with PID 12035 with some activity, that is not C = 0.
 +<code>
 +[dnairn@mills dnairn]$ ssh n093 ps -eLf | egrep '(PID|12035)' | grep -v ' 0  ' 
 +UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
 +matusera 12035 11918 12082 98   90 16:39 pts/2    00:43:21 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12132 67   90 16:39 pts/2    00:29:49 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12133 67   90 16:39 pts/2    00:29:42 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12134 67   90 16:39 pts/2    00:29:43 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12135 67   90 16:39 pts/2    00:29:34 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12136 67   90 16:39 pts/2    00:29:47 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12137 67   90 16:39 pts/2    00:29:50 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12138 67   90 16:39 pts/2    00:29:48 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12139 67   90 16:39 pts/2    00:29:45 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12140 67   90 16:39 pts/2    00:29:40 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12141 67   90 16:39 pts/2    00:29:33 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +matusera 12035 11918 12142 67   90 16:39 pts/2    00:29:32 /home/software/matlab/R2011b/bin/glnxa64/MATLAB -nosplash -nodesktop
 +</code>
 +
 +twelve of the 90 threads are doing computation.  These are the computation threads.