Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
technical:generic:farber-microcode-201904 [2019-04-23 11:54] – frey | technical:generic:farber-microcode-201904 [2019-04-23 11:57] (current) – [Mitigation] frey | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== 2019 Farber Job Stall ====== | ||
+ | |||
+ | This document summarizes a performance issue reported after the annual cluster maintenance was performed in January of 2019. It outlines proposed actions to mitigate the issue. | ||
+ | |||
+ | ===== The Issue ===== | ||
+ | |||
+ | In March 2019 a workgroup reported ongoing issues with jobs not completing within the wall time limits that had worked prior to the January annual maintenance. | ||
+ | |||
+ | * LAMMPS has a built-in timer facility that causes the program to exit if a given amount of time has elapsed; the timer was not triggering an exit, though. | ||
+ | * Ganglia monitoring showed that the Infiniband network was in-use throughout the jobs — successful or failed — at similar transmission/ | ||
+ | |||
+ | When a failed job was resubmitted it would run successfully in the allotted time, so the job did not simply require more processing time than expected. | ||
+ | |||
+ | When additional data was gathered from the workgroup, it was found that GROMACS jobs were also experiencing occasional random failures. | ||
+ | |||
+ | ===== Updated Kernel and Microcode ===== | ||
+ | |||
+ | The OS update applied to Farber as part of January' | ||
+ | |||
+ | The compute node boot image had the microcode update added to it and 12 nodes owned by IT were rebooted. | ||
+ | |||
+ | * Three users from the workgroup that reported the issues | ||
+ | * Standby queues | ||
+ | |||
+ | Over the course of two weeks, the users funneled a series of jobs that had been experiencing random failures through the 12 nodes. | ||
+ | |||
+ | Though no direct evidence (traces, monitoring) could be gathered to conclusively prove the lack of the microcode update is to blame, the empirical evidence from the testing seems clear enough. | ||
+ | |||
+ | ===== Mitigation ===== | ||
+ | |||
+ | All compute nodes in Farber will need to be rebooted in order to apply the microcode update to the processors. | ||
+ | |||
+ | * All queues on all nodes will be disabled. | ||
+ | * Once all jobs running on a node have completed, the node will be rebooted. | ||
+ | * Once the node is online again, its queues will be restored to their previous state and jobs can again run on it. | ||
+ | |||
+ | At 9:00 the morning of **April 29, 2019**, this staged process will commence. | ||
+ | |||
+ | <note important> | ||