Differences

This shows you the differences between two versions of the page.

--- technical:generic:farber-microcode-201904 [2019-04-23 11:54] – frey
+++ technical:generic:farber-microcode-201904 [2019-04-23 11:57] (current) – [Mitigation] frey
@@ Line 1: / Line 1: @@
+====== 2019 Farber Job Stall ======
+This document summarizes a performance issue reported after the annual cluster maintenance was performed in January of 2019.  It outlines proposed actions to mitigate the issue.
+===== The Issue =====
+In March 2019 a workgroup reported ongoing issues with jobs not completing within the wall time limits that had worked prior to the January annual maintenance.  The jobs in question used a baseline build of LAMMPS and ran on 3 or more nodes.  The jobs run a simulation through N time steps.
+  * LAMMPS has a built-in timer facility that causes the program to exit if a given amount of time has elapsed; the timer was not triggering an exit, though.
+  * Ganglia monitoring showed that the Infiniband network was in-use throughout the jobs — successful or failed — at similar transmission/reception rates.
+When a failed job was resubmitted it would run successfully in the allotted time, so the job did not simply require more processing time than expected.  The failed jobs were progressing at a rate of ~8 steps per second and seeming to freeze after a number of steps < N, while the successful runs reached N steps and were averaging over 1000 steps per second.  Stack dumps triggered just before the wall time limit was reached showed the program to be in the midst of several disparate MPI collective operations.  This made it appear to be a random MPI synchronization issue between the nodes participating in the job.
+When additional data was gathered from the workgroup, it was found that GROMACS jobs were also experiencing occasional random failures.  However, the GROMACS jobs all ran with exclusive access to a single node (e.g. ''-l exclusive=1 -pe threads 20'').  This eliminated both the Infiniband network and multiple jobs' sharing a node as suspects.
+===== Updated Kernel and Microcode =====
+The OS update applied to Farber as part of January's maintenance included changes to the Linux kernel that mitigate two side-channel information exposure issues known as //Spectre// and //Meltdown//.  Accompanying the software changes are tweaks to the Intel CPU firmware — a //microcode update//.  The head node and all of the compute nodes are running the upgraded kernel, but the microcode update was missed on the compute nodes.
+The compute node boot image had the microcode update added to it and 12 nodes owned by IT were rebooted.  Access to those 12 nodes was restricted to:
+  * Three users from the workgroup that reported the issues
+  * Standby queues
+Over the course of two weeks, the users funneled a series of jobs that had been experiencing random failures through the 12 nodes.  Single-node GROMACS jobs and multi-node LAMMPS jobs were included.  None of the jobs experienced freezing/stalling issues.
+Though no direct evidence (traces, monitoring) could be gathered to conclusively prove the lack of the microcode update is to blame, the empirical evidence from the testing seems clear enough.
+===== Mitigation =====
+All compute nodes in Farber will need to be rebooted in order to apply the microcode update to the processors.  A staged reboot procedure will be used:
+  * All queues on all nodes will be disabled.  Jobs currently running on a node will continue running, but no additional jobs will start on the node.
+  * Once all jobs running on a node have completed, the node will be rebooted.
+  * Once the node is online again, its queues will be restored to their previous state and jobs can again run on it.
+At 9:00 the morning of **April 29, 2019**, this staged process will commence.  Users wishing to accelerate the pace at which the procedure completes are invited to kill any jobs running at that time, but doing so is not mandated.  In particular, users should exit any open ''qlogin'' sessions as soon as possible.  An announcement will be sent via email when the procedure begins and when it completes.  Since the procedure could take some time to complete, periodic updates may be sent by IT staff.
+<note important>IT staff may also contact owners of long-running jobs if the procedure is taking **longer than 1 week** to discuss whether or not the job(s) could be terminated.</note>