Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | |||
technical:generic:caviness-lustre-rebalance [2021-02-23 16:54] – frey | technical:generic:caviness-lustre-rebalance [2021-02-23 17:13] (current) – [Testing] frey | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Caviness 2021 Lustre Expansion ====== | ||
+ | |||
+ | Throughout 2020 and into early 2021, usage of the Lustre file system on the Caviness cluster has maintained a level around 80% of total capacity. | ||
+ | |||
+ | The capacity of a Lustre file system embodies two separate metrics (//storage classes//): | ||
+ | |||
+ | * The total metadata entries (// | ||
+ | * The total object storage (e.g. //bytes// or //blocks//) provided by object storage target (OST) devices | ||
+ | |||
+ | Having extremely large OST capacity combined with insufficient MDT capacity leads to an inability to create additional files despite their being many bytes of object storage available. | ||
+ | |||
+ | On Caviness, the existing MDT and OST capacity are being consumed at nearly the same rate. As of February 23, 2021: | ||
+ | * OST usage at 83% | ||
+ | * MDT usage at 77% | ||
+ | This is actually good news: it implies a fair balance between the two storage classes under the usage profile of all Caviness users. | ||
+ | |||
+ | ===== Additional Capacity ===== | ||
+ | |||
+ | Part of the Generation 2 addition to the Caviness cluster was: | ||
+ | * (2) OST pools, 12 x 12 TB HDDs | ||
+ | * (1) MDT pool, 12 x 1.6 TB SSDs | ||
+ | The previous components of the Lustre file system were: | ||
+ | * (4) OSTs, each 65 TB in size | ||
+ | * (1) MDT, 4 TB in size | ||
+ | |||
+ | Bringing the new capacity online will require downtime, primarily because the existing MDT and OST usage levels are so high. Every directory currently present on the Lustre filesystem **only** makes use of the existing MDT (mdt0). | ||
+ | |||
+ | ==== MDT Configuration ==== | ||
+ | |||
+ | The filename hashing presents a major issue when growing a Lustre file system' | ||
+ | |||
+ | Given this information, | ||
+ | * Each MDT should be of approximately the same capacity to promote balanced growth | ||
+ | * The filename hash function must be well-designed (to provide a balanced distributions of hash values) | ||
+ | The second requirement is outside our ability to control (hopefully the Lustre developers did a good job). The first requirement is by definition not met on Caviness since mdt0 is close to full and the new MDT(s) will be empty. | ||
+ | |||
+ | The Generation 1 Lustre metadata was configured with a single MDT serviced by a pair of MDS nodes: | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | The three disks in light blue are //parity data// (for redundancy) and the one disk in light orange is a hot spare. | ||
+ | |||
+ | With the hardware added to Generation 2 and the balanced design tenets outlined above, the 16 TB of new metadata storage will be organized as: | ||
+ | |||
+ | {{ : | ||
+ | |||
+ | The 12 disks are again SSD, this time quadruple the capacity as in Generation 1. The disks are split into three pools, with each pool being a mirror of two disks: | ||
+ | |||
+ | ==== OST Configuration ==== | ||
+ | |||
+ | Just as with the new MDT versus the old, the two new OST pools utilize storage media that are larger capacity than in Generation 1. Though metadata are fixed-size entities, objects are of arbitrary size (anywhere up to the full capacity of an OST) and thus an arbitrary number fit on each OST. When a new file is created the Lustre metadata subsystem chooses an OST (or number of OSTs if the file is striped) on which the file will be placed. | ||
+ | |||
+ | Since the metadata subsystem can allocate around OSTs that have reached full capacity, it is not quite as critical for the OSTs to be balanced in their usage. | ||
+ | |||
+ | Thus, the two new OSTs in Generation 2 are setup as two pools, each comprising nine data and two parity HDDs (RAIDZ2); an SSD read cache (L2ARC); and a single hot spare HDD. | ||
+ | |||
+ | ===== Rebalancing Lustre Storage ===== | ||
+ | |||
+ | Once the new MDTs and OSTs are brought online as part of the existing Lustre filesystem, an imbalance will exist: | ||
+ | |||
+ | Striping of metadata does not happen automatically: | ||
+ | |||
+ | A rebalancing of the filesystem will be effected by the following once the new MDTs and OSTs are online: | ||
+ | - A new directory with metadata striped across all MDTs will be created (''/ | ||
+ | - Existing directories on ''/ | ||
+ | - Once all content has been transferred, | ||
+ | - Finally, all directories under ''/ | ||
+ | With the metadata of the new copies being striped across all MDTs, and the Lustre metadata subsystem spreading the copies across the new and old OSTs, the net effect will be to rebalance MDT and OST usage across all devices. | ||
+ | |||
+ | ===== Testing ===== | ||
+ | |||
+ | All aspects of this workflow were tested using VirtualBox on a Mac laptop. | ||
+ | |||
+ | The four VMs each had a virtual NIC configured in a named internal network ('' | ||
+ | <code bash> | ||
+ | [mds0 ~]$ modprobe lnet | ||
+ | [mds0 ~]$ lnetctl net configure --all | ||
+ | </ | ||
+ | |||
+ | The following VDIs were created: | ||
+ | * 50 GB - mgt | ||
+ | * 250 GB - mdt0, mdt1 | ||
+ | * 1000 GB - ost0, ost1 | ||
+ | The mgt and mdt0 VDIs were attached to mds0 and formatted: | ||
+ | <code bash> | ||
+ | [mds0 ~]$ mkfs.lustre --mgs --reformat \ | ||
+ | --servicenode=mds0@tcp --mgsnode=mds1@tcp \ | ||
+ | --backfstype=ldiskfs \ | ||
+ | /dev/sdb | ||
+ | [mds0 ~]$ mkfs.lustre --mdt --reformat \ | ||
+ | --mgsnode=mds0@tcp --mgsnode=mds1@tcp \ | ||
+ | --servicenode=mds0@tcp --mgsnode=mds1@tcp \ | ||
+ | --backfstype=ldiskfs --fsname=demo \ | ||
+ | /dev/sdc | ||
+ | </ | ||
+ | The ost0 VDI was attached to oss0 and formatted: | ||
+ | <code bash> | ||
+ | [oss0 ~]$ mkfs.lustre --ost --reformat --index=0 \ | ||
+ | --mgsnode=mds0@tcp --mgsnode=mds1@tcp \ | ||
+ | --servicenode=oss0@tcp --mgsnode=oss1@tcp \ | ||
+ | --backfstype=ldiskfs --fsname=demo \ | ||
+ | /dev/sdb | ||
+ | </ | ||
+ | The mgt and mdt0 were brought online: | ||
+ | <code bash> | ||
+ | [mds0 ~]$ mkdir -p /lustre/mgt / | ||
+ | [mds0 ~]$ mount -t lustre /dev/sdb /lustre/mgt | ||
+ | [mds0 ~]$ mount -t lustre /dev/sdc / | ||
+ | </ | ||
+ | Finally, ost0 was brought online: | ||
+ | <code bash> | ||
+ | [oss0 ~]$ mkdir -p / | ||
+ | [oss0 ~]$ mount -t lustre /dev/sdb / | ||
+ | </ | ||
+ | |||
+ | ==== Client Setup ==== | ||
+ | |||
+ | Another VM was created with the same version of CentOS 7 and the Lustre 2.10.3 client modules. | ||
+ | |||
+ | The " | ||
+ | <code bash> | ||
+ | [client ~]$ mkdir /demo | ||
+ | [client ~]$ mount -t lustre mdt0@tcp: | ||
+ | </ | ||
+ | |||
+ | At this point, some tests were performed in order to fill the metadata to approximately 70% of capacity. | ||
+ | |||
+ | ==== Addition of MDT ==== | ||
+ | |||
+ | The new MDT was formatted and brought online: | ||
+ | <code bash> | ||
+ | [mds1 ~]$ mkfs.lustre --mdt --reformat --index=1 \ | ||
+ | --mgsnode=mds0@tcp --mgsnode=mds1@tcp \ | ||
+ | --servicenode=mds1@tcp --mgsnode=mds0@tcp \ | ||
+ | --backfstype=ldiskfs --fsname=demo \ | ||
+ | /dev/sdb | ||
+ | [mds1 ~]$ mkdir -p /lustre/mgt / | ||
+ | [mds1 ~]$ mount -t lustre /dev/sdb / | ||
+ | </ | ||
+ | |||
+ | After a few moments, the client VM received the updated file system configuration and had mounted the new MDT. MDT usage and capacity changed accordingly. | ||
+ | |||
+ | Further testing was performed to confirm that | ||
+ | * by default all metadata additions were against mdt0 | ||
+ | * creating a new directory with metadata striping over mdt0 and mdt1 initially allowed a balanced creation of new files across both MDTs | ||
+ | * once mdt0 was filled to capacity, creation of new files whose name hashed and mapped to mdt0 failed; names that hashed and mapped to mdt1 succeeded | ||
+ | |||
+ | ==== Addition of OST ==== | ||
+ | |||
+ | The new OST was formatted and brought online: | ||
+ | <code bash> | ||
+ | [mds1 ~]$ mkfs.lustre --ost --reformat --index=1 \ | ||
+ | --mgsnode=mds0@tcp --mgsnode=mds1@tcp \ | ||
+ | --servicenode=oss1@tcp --mgsnode=oss0@tcp \ | ||
+ | --backfstype=ldiskfs --fsname=demo \ | ||
+ | /dev/sdb | ||
+ | [oss1 ~]$ mkdir -p / | ||
+ | [oss1 ~]$ mount -t lustre /dev/sdb / | ||
+ | </ | ||
+ | |||
+ | After a few moments, the client VM received the updated file system configuration and had mounted the new OST. OST usage and capacity changed accordingly. | ||