technical:slurm:darwin:auto_tmpdir

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

technical:slurm:darwin:auto_tmpdir [2020-03-12 11:44] – external edit 127.0.0.1technical:slurm:darwin:auto_tmpdir [2021-01-06 12:37] (current) frey
Line 1: Line 1:
 +====== The auto_tmpdir Plugin for Per-job Temporary Directories ======
  
 +Many jobs that run on a cluster create //temporary files// of some kind:
 +
 +  * Open MPI stores information regarding the topology and configuration of the worker processes in ''$TMPDIR''
 +  * VALET stores environment snapshots (on each ''vpkg_require'') in ''$TMPDIR''
 +  * The PSM2 library (for accelerated MPI over Intel OPA) and VADER plugin create shared memory segments in ''/dev/shm''
 +  * Some programs are hard-coded to use the ''/tmp'' directory directly (not ''$TMPDIR'')
 +
 +It is helpful to have Slurm automatically manage the lifetime of temporary files associated with jobs running on the cluster.
 +
 +==== Bind mounts ====
 +
 +The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path// That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.).  In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree.  On DARWIN, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes.  Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system.
 +
 +A //bind mount// in Linux takes an existing directory in the VFS tree and mounts it at another path.  This seems like a solution to our problem above:  if the per-job temporary directory (e.g. ''/tmp/job-8451'') were bind-mounted at ''/tmp'', then programs that use ''/tmp'' rather than ''$TMPDIR'' would not leave files in ''/tmp'' when they crash — the files would be in ''/tmp/job-8451'' By the same token, bind-mounting ''/dev/shm/job-8451'' at ''/dev/shm'' will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs.
 +
 +Unfortunately, once ''/tmp/job-8451'' is bind-mounted as ''/tmp'' every program on the node (including Slurm itself) will store its temporary files in ''/tmp/job-8451'' If the node is shared by multiple jobs this would be a major problem, since each subsequent job that starts will modify what's mounted at ''/tmp'':
 +
 +^Position^Job^What's actually mounted^
 +|1|8451|''/tmp/job-8451''|
 +|2|8456|''/tmp/job-8451/job-8456''|
 +|3|8460|''/tmp/job-8451/job-8456/job-8460''|
 +
 +Once job 8456 has altered what's mounted at ''/tmp'', job 8451 will no longer see the temporary files it created in ''/tmp/job-8451'' and the program will likely crash.  The same problem would exist for Slurm and OS programs that were using files in ''/tmp'' prior to job 8451's executing.
 +
 +==== Namespaces to the rescue ====
 +
 +For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node.  Linux //mount namespaces// are exactly that:
 +
 +  * every program that executes starts with its parent's VFS tree
 +  * if the program has appropriate privileges, it can clone that initial VFS tree
 +  * storage components subsequently mounted/unmounted by the program only affect its own VFS tree
 +
 +For Slurm jobs this equates to:
 +
 +  - When the job starts, the plugin creates a per-job temporary directory (''/tmp/job-8451'')
 +  - The plugin clones the VFS tree (now has a private mount namespace)
 +  - The plugin bind-mounts ''/tmp/job-8451'' as ''/tmp'' (''/tmp/job-8451'' no longer visible to this program)
 +  - When the job ends, the plugin unmounts ''/tmp'' (''/tmp/job-8451'' is again visible to this program)
 +  - The plugin removes ''/tmp/job-8451''
 +
 +With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained.
 +
 +===== The auto_tmpdir plugin =====
 +
 +The **auto_tmpdir** Slurm plugin creates the following paths and bind-mounts them:
 +
 +^Directory created^Bind mountpoint^
 +|''/tmp/job-«job-id»''| |
 +|''/tmp/job-«job-id»/tmp''|''/tmp''|
 +|''/tmp/job-«job-id»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +==== Shared tmpdir ====
 +
 +In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre'' The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this:
 +
 +^Directory created^Bind mountpoint^
 +|''/lustre/slurm/job-«job-id»''| |
 +|''/lustre/slurm/job-«job-id»/tmp''|''/tmp''|
 +|''/lustre/slurm/job-«job-id»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (''--use-shared-tmpdir=per-node''):
 +
 +^Directory created^Bind mountpoint^
 +|''/lustre/slurm/job-«job-id»''| |
 +|''/lustre/slurm/job-«job-id»/«hostname»''| |
 +|''/lustre/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''|
 +|''/lustre/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +When using the ''--use-shared-tmpdir'' flag, the plugin can also be asked to //not// remove the directories when the job exits by including the ''--no-rm-tmpdir'' flag.
 +
 +<WRAP center round important 60%>
 +The ''--no-rm-tmpdir'' flag should be used very cautiously, since leaving files behind on ''/lustre/scratch'' will consume capacity on that file system.  A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage.  Once that behavior is debugged and goes into production the user would stop using the ''--no-rm-tmpdir'' and ''--use-shared-tmpdir'' flags.
 +</WRAP>
 +
 +===== Source code =====
 +
 +The source code for the **auto_tmpdir** plugin is publicly available on [[https://github.com/jtfrey/auto_tmpdir/|Github]].