Several variants of Grid Engine claim to support integration with Linux CGroups to control resource allocations. One such variant was chosen for the University's Farber cluster partly for the sake of this feature. We were incredibly disappointed to eventually figure out for ourselves that the marketing hype was true in a very rudimentary sense. Some of the issues we elucidated were:
m_mem_free. As implemented, the multiplication is applied twice and results in incorrect resource limits. E.g. requesting m_mem_free=2G for a 20-core threaded job, the memory.limit_in_bytes applied by sge_execd to the job's shepherd was 80 GB.qrsh -inherit to a node never have CGroup limits applied to them.qrsh -inherit (e.g. in a prolog script), the sge_execd never adds the shepherd or its child processes to that CGroup.qmaster; the sge_execd native sensors do not provide feedback w.r.t. what cores are available/unavailable.qmaster would still attempt to use themqmaster selected cores for them.sge_execd that the job ended in error1).m_mem_free which is larger than the requested h_vmem value for the job is ignored and the h_vmem limit gets used for both. This is contrary to documentation.2).The magnitude of the flaws and inconsistencies – and the fact that on our HPC cluster multi-node jobs form a significant percentage of the workload – meant that we could not make use of the native CGroup integration in this Grid Engine product, even though it was a primary reason for choosing the product.
qstat, a set of read threads in the queue master are being queried. The read threads return the last-coalesced snapshot of the cluster's job state; updates of this information are controlled by a locking mechanism that surround the periodic scheduling run, etc. This means that for a lengthy scheduling run, qstat won't actually return the real state information for a job that's been started running. We had to introduce some complicated logic as well as some sleep() calls in order to fetch accurate job information. But that wasn't even enough, as we later found qstat to be returning partially-accurate job information for large array jobs. A fix to this flaw would have been necessary, but no solution other than more complexity and more sleep() usage presented itself.gecod program could not seem to reliably read cpuset.cpus from Cgroup subgroups it created. A processor binding would be produced and successfully written to e.g. /cgroup/cpuset/GECO/132.4/cpuset.cpus. When gecod scheduled the next job it would read /cgroup/cpuset/GECO/132.4/cpuset.cpus in an attempt to determine what processors were available to the new job being scheduled. However, when /cgroup/cpuset/GECO/132.4/cpuset.cpus was opened and read no data was present.sleep() delays around these reads, but to no consistently-reliable result.
ssh access to compute nodes. When an sshd was started, the user id and the environment for the process were checked to determine whether or not the sshd should be killed or allowed to execute. If the sshd were owned by root then by default nothing was done to impede or alter its startup. Unfortunately, this meant that qlogin sessions – which start an sshd under the sge-shepherd process for the job – were never getting quarantined properly. A fix for this would have been necessary for GECO to be fully-functional.sleep() blocks introduced to address issues with stale/unavailable data increased the time necessary for job quarantine to the point where in many cases (during long scheduling runs) Grid Engine reached its threshold for awaiting sge-shepherd startup for jobs and would simply mark them failed.So while GECO worked quite well for us in small-scale testing on a subset of Farber's compute nodes, when scaled-up to our actual workload it failed utterly as the complexity of adapting to that scale under the conditions and software involved proved insurmountable.