| |
technical:slurm:darwin:auto_tmpdir [2020-03-12 11:44] – external edit 127.0.0.1 | technical:slurm:darwin:auto_tmpdir [2021-01-06 12:37] (current) – frey |
---|
| ====== The auto_tmpdir Plugin for Per-job Temporary Directories ====== |
| |
| Many jobs that run on a cluster create //temporary files// of some kind: |
| |
| * Open MPI stores information regarding the topology and configuration of the worker processes in ''$TMPDIR'' |
| * VALET stores environment snapshots (on each ''vpkg_require'') in ''$TMPDIR'' |
| * The PSM2 library (for accelerated MPI over Intel OPA) and VADER plugin create shared memory segments in ''/dev/shm'' |
| * Some programs are hard-coded to use the ''/tmp'' directory directly (not ''$TMPDIR'') |
| |
| It is helpful to have Slurm automatically manage the lifetime of temporary files associated with jobs running on the cluster. |
| |
| ==== Bind mounts ==== |
| |
| The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path//. That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.). In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree. On DARWIN, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes. Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system. |
| |
| A //bind mount// in Linux takes an existing directory in the VFS tree and mounts it at another path. This seems like a solution to our problem above: if the per-job temporary directory (e.g. ''/tmp/job-8451'') were bind-mounted at ''/tmp'', then programs that use ''/tmp'' rather than ''$TMPDIR'' would not leave files in ''/tmp'' when they crash — the files would be in ''/tmp/job-8451''. By the same token, bind-mounting ''/dev/shm/job-8451'' at ''/dev/shm'' will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs. |
| |
| Unfortunately, once ''/tmp/job-8451'' is bind-mounted as ''/tmp'' every program on the node (including Slurm itself) will store its temporary files in ''/tmp/job-8451''. If the node is shared by multiple jobs this would be a major problem, since each subsequent job that starts will modify what's mounted at ''/tmp'': |
| |
| ^Position^Job^What's actually mounted^ |
| |1|8451|''/tmp/job-8451''| |
| |2|8456|''/tmp/job-8451/job-8456''| |
| |3|8460|''/tmp/job-8451/job-8456/job-8460''| |
| |
| Once job 8456 has altered what's mounted at ''/tmp'', job 8451 will no longer see the temporary files it created in ''/tmp/job-8451'' and the program will likely crash. The same problem would exist for Slurm and OS programs that were using files in ''/tmp'' prior to job 8451's executing. |
| |
| ==== Namespaces to the rescue ==== |
| |
| For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node. Linux //mount namespaces// are exactly that: |
| |
| * every program that executes starts with its parent's VFS tree |
| * if the program has appropriate privileges, it can clone that initial VFS tree |
| * storage components subsequently mounted/unmounted by the program only affect its own VFS tree |
| |
| For Slurm jobs this equates to: |
| |
| - When the job starts, the plugin creates a per-job temporary directory (''/tmp/job-8451'') |
| - The plugin clones the VFS tree (now has a private mount namespace) |
| - The plugin bind-mounts ''/tmp/job-8451'' as ''/tmp'' (''/tmp/job-8451'' no longer visible to this program) |
| - When the job ends, the plugin unmounts ''/tmp'' (''/tmp/job-8451'' is again visible to this program) |
| - The plugin removes ''/tmp/job-8451'' |
| |
| With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained. |
| |
| ===== The auto_tmpdir plugin ===== |
| |
| The **auto_tmpdir** Slurm plugin creates the following paths and bind-mounts them: |
| |
| ^Directory created^Bind mountpoint^ |
| |''/tmp/job-«job-id»''| | |
| |''/tmp/job-«job-id»/tmp''|''/tmp''| |
| |''/tmp/job-«job-id»/var_tmp''|''/var/tmp''| |
| |''/dev/shm/job-«job-id»''|''/dev/shm''| |
| |
| ==== Shared tmpdir ==== |
| |
| In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre''. The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this: |
| |
| ^Directory created^Bind mountpoint^ |
| |''/lustre/slurm/job-«job-id»''| | |
| |''/lustre/slurm/job-«job-id»/tmp''|''/tmp''| |
| |''/lustre/slurm/job-«job-id»/var_tmp''|''/var/tmp''| |
| |''/dev/shm/job-«job-id»''|''/dev/shm''| |
| |
| A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (''--use-shared-tmpdir=per-node''): |
| |
| ^Directory created^Bind mountpoint^ |
| |''/lustre/slurm/job-«job-id»''| | |
| |''/lustre/slurm/job-«job-id»/«hostname»''| | |
| |''/lustre/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''| |
| |''/lustre/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''| |
| |''/dev/shm/job-«job-id»''|''/dev/shm''| |
| |
| When using the ''--use-shared-tmpdir'' flag, the plugin can also be asked to //not// remove the directories when the job exits by including the ''--no-rm-tmpdir'' flag. |
| |
| <WRAP center round important 60%> |
| The ''--no-rm-tmpdir'' flag should be used very cautiously, since leaving files behind on ''/lustre/scratch'' will consume capacity on that file system. A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage. Once that behavior is debugged and goes into production the user would stop using the ''--no-rm-tmpdir'' and ''--use-shared-tmpdir'' flags. |
| </WRAP> |
| |
| ===== Source code ===== |
| |
| The source code for the **auto_tmpdir** plugin is publicly available on [[https://github.com/jtfrey/auto_tmpdir/|Github]]. |