Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
technical:generic:caviness-expansion-201906 [2019-05-24 00:41] – created frey | technical:generic:caviness-expansion-201906 [2021-07-22 16:00] (current) – [June 2019 Caviness Expansion: Node Specifications] anita | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== June 2019 Caviness Expansion: Node Specifications ====== | ||
+ | |||
+ | In Generation 1 of Caviness two types of node were offered to stakeholders: | ||
+ | |||
+ | <note important> | ||
+ | |||
+ | ===== Standard features ===== | ||
+ | |||
+ | All nodes will use Intel //Cascade Lake// scalable processors, which are two generations newer than the // | ||
+ | |||
+ | * (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz) | ||
+ | * (1) 960 GB SSD local scratch disk | ||
+ | * (1) port, 100 Gbps Intel Omni-path network | ||
+ | |||
+ | ==== System memory ==== | ||
+ | |||
+ | Each Intel Cascade Lake processor has six memory channels versus a Broadwell processor' | ||
+ | |||
+ | * 192 GB — 12 x 16 GB DDR4-2666MHz (registered ECC) | ||
+ | |||
+ | Nodes can have system memory upgraded at the time of purchase: | ||
+ | |||
+ | * 384 GB — 12 x 32 GB DDR4-2666MHz (registered ECC) | ||
+ | * 768 GB — 12 x 64 GB DDR4-2666MHz (registered ECC) | ||
+ | * 1024 GB — 16 x 64 GB DDR4-2666MHz (registered ECC) | ||
+ | |||
+ | While the 1024 GB (1 TB) option does use more than six memory DIMMs per CPU, the vendor cited workload performance tests that demonstrated no significant performance impact. | ||
+ | |||
+ | ===== GPU nodes ===== | ||
+ | |||
+ | Generation 1 made available GPU nodes containing one nVidia P100 (// | ||
+ | |||
+ | - High-end GPGPU coprocessors are expensive and have features that some workloads will never utilize; would it be possible to include a less-expensive, | ||
+ | - Would it be possible to have more than two high-end GPGPU coprocessors in a node? | ||
+ | - Would it be possible to use the nVidia NVLINK GPGPU interconnect to maximise performance? | ||
+ | |||
+ | For point (1), the nVidia T4 is a lower-power, | ||
+ | |||
+ | * 2560 CUDA cores | ||
+ | * 320 Turing tensor cores | ||
+ | * 16 GB GDDR6 ECC memory | ||
+ | * 32 GB/s inter-GPGPU bandwidth (PCIe interface) | ||
+ | |||
+ | For other workloads, the nVidia V100 (//Volta//) GPGPU is a generation newer than the P100 used in Generation 1. Each V100 features: | ||
+ | |||
+ | * 5120 CUDA cores | ||
+ | * 640 Turing tensor cores | ||
+ | * 32 GB HBM2 ECC memory | ||
+ | * 32 GB/s inter-GPGPU bandwidth (PCIe interface) | ||
+ | |||
+ | To address point (3) the final metric for each V100 is augmented: | ||
+ | |||
+ | * 300 GB/s inter-GPGPU bandwidth (SXM2 interface) | ||
+ | |||
+ | These data led to the design of three GPU node variants for inclusion in Caviness Generation 2. | ||
+ | |||
+ | ==== Low-end GPU ==== | ||
+ | |||
+ | The low-end GPU design is suited for workloads that are not necessarily GPU-intensive. | ||
+ | |||
+ | * (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz) | ||
+ | * (1) 960 GB SSD local scratch disk | ||
+ | * (1) port, 100 Gbps Intel Omni-path network | ||
+ | * (1) nVidia T4 | ||
+ | * 2560 CUDA cores | ||
+ | * 320 Turing tensor cores | ||
+ | |||
+ | Since the nVidia T4 does include tensor cores, this node may also efficiently handle some A.I. inference workloads. | ||
+ | |||
+ | ==== All-purpose GPU ==== | ||
+ | |||
+ | Akin to the GPU node design in Generation 1, the all-purpose GPU design features one nVidia V100 per CPU. While any GPGPU workload is permissible on this node type, inter-GPGPU performance is not maximized: | ||
+ | |||
+ | * (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz) | ||
+ | * (1) 960 GB SSD local scratch disk | ||
+ | * (1) port, 100 Gbps Intel Omni-path network | ||
+ | * (2) nVidia V100 (PCIe interface) | ||
+ | * 10240 CUDA cores | ||
+ | * 1280 Turing tensor cores | ||
+ | |||
+ | This node type is optimal for workgroups with mixed GPGPU workloads: | ||
+ | |||
+ | ==== High-end GPU ==== | ||
+ | |||
+ | Maximizing both the number of GPGPU coprocessors and the inter-GPGPU performance, | ||
+ | |||
+ | * (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz) | ||
+ | * (1) 960 GB SSD local scratch disk | ||
+ | * (1) port, 100 Gbps Intel Omni-path network | ||
+ | * (4) nVidia V100 (SXM2 interface) | ||
+ | * 20480 CUDA cores | ||
+ | * 2560 Turing tensor cores | ||
+ | |||
+ | The high-end option is meant to address stakeholder workloads that are very GPU-intensive. | ||
+ | |||
+ | ===== Enhanced local scratch ===== | ||
+ | |||
+ | On Caviness the Lustre file system is typically leveraged when scratch storage in excess of the 960 GB provided by local SSD is necessary. | ||
+ | |||
+ | Generation 1 featured two compute nodes with dual 3.2 TB NVMe storage devices. | ||
+ | |||
+ | The Generation 2 design increases the capacity of the fast local scratch significantly: | ||
+ | |||
+ | * (2) Intel Xeon Gold 6230 (20 cores each, 2.10/3.70 GHz) | ||
+ | * (1) port, 100 Gbps Intel Omni-path network | ||
+ | * (8) 4 TB NVMe | ||
+ | |||
+ | |||