Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision | |||
| technical:generic:openmpi-4-ucx-issue [2024-12-05 10:37] – frey | technical:generic:openmpi-4-ucx-issue [2024-12-05 10:47] (current) – [How is it fixed] frey | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| + | ====== Use of UCX PML in Open MPI 4.x on DARWIN ====== | ||
| + | This document explores a bug in release of Open MPI 4.x prior to 4.1.6. | ||
| + | |||
| + | ===== What is UCX ===== | ||
| + | |||
| + | UCX is a communications library that provides an abstract transport interface on top of multiple protocols and devices. | ||
| + | |||
| + | The Open MPI libraries built and maintained by IT RCI on DARWIN include several components that make use of UCX for accelerated data movement: | ||
| + | |||
| + | * the UCX Point-to-point Management Layer (PML) component | ||
| + | * the UCX One-Sided Communications (OSC) component | ||
| + | |||
| + | By default, the Open MPI libraries are configured to use the UCX PML. | ||
| + | |||
| + | ===== What is the issue ===== | ||
| + | |||
| + | While testing a workgroup' | ||
| + | |||
| + | <code fortran> | ||
| + | Integer (kind=int64) :: n_threshold | ||
| + | : | ||
| + | If (rank == 0) then | ||
| + | Call GetEnvVarInteger8(' | ||
| + | End If | ||
| + | Call MPI_Bcast(n_threshold, | ||
| + | Write(*,*) rank, n_threshold | ||
| + | </ | ||
| + | |||
| + | produced the following output: | ||
| + | |||
| + | < | ||
| + | | ||
| + | 1 1133871376384 | ||
| + | 2 1133871376384 | ||
| + | : | ||
| + | </ | ||
| + | |||
| + | At first it looked like the data were NOT broadcast to the other ranks, but '' | ||
| + | |||
| + | |10240| '' | ||
| + | |1133871376384| '' | ||
| + | |||
| + | The lowest 32-bits of the 64-bit variable //have// received the lowest 32-bits of the '' | ||
| + | |||
| + | * Changing this code to send the variable type-cast as an array of TWO 32-bit integers (totaling 64-bits of data) succeeded | ||
| + | * Changing this code to make '' | ||
| + | * just the first element of the array (ONE 64-bit integer) failed | ||
| + | * both elements of the array (TWO 64-bit integers) succeeded | ||
| + | |||
| + | In fact, a test program also showed that sending ONE double-precision floating-point value (type '' | ||
| + | |||
| + | Eventually debugging demonstrated that the UCX PML, when registering a UCX-native datatype to be associated with an MPI-native datatype, was producing an incorrect byte size for any 8-byte type. In the Open MPI 4.x code, an optimization had been added that chose to use a bit shift instead of multiplication when the size is a power of 2 (1, 2, 4, 8, 16, etc.). | ||
| + | |||
| + | <code c> | ||
| + | pml_datatype-> | ||
| + | </ | ||
| + | |||
| + | Mathematically that expression is exact and accurate; but floating-point arithmetic isn't always exact. | ||
| + | |||
| + | ==== How is it fixed ==== | ||
| + | |||
| + | Rather than using floating-point mathematics, | ||
| + | |||
| + | <code c> | ||
| + | int ctz(unsigned int v) | ||
| + | { | ||
| + | int l; | ||
| + | | ||
| + | if ( v == 0 ) return 8 * sizeof(v); | ||
| + | l = 0; | ||
| + | while ( (v & 1) == 0 ) l++, v >>= 1; | ||
| + | return l; | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | Adding the precondition that '' | ||
| + | |||
| + | <code c> | ||
| + | int ctz(unsigned int v) | ||
| + | { | ||
| + | int l = 0; | ||
| + | | ||
| + | while ( (v & 1) == 0 ) l++, v >>= 1; | ||
| + | return l; | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | and if '' | ||
| + | |||
| + | <code c> | ||
| + | int ctz(unsigned int v) | ||
| + | { | ||
| + | int l = -1; | ||
| + | | ||
| + | do { l++, v >>= 1; } while (v); | ||
| + | return l; | ||
| + | } | ||
| + | </ | ||
| + | |||
| + | The GCC compiler implements a '' | ||
| + | |||
| + | <code c> | ||
| + | #if OPAL_C_HAVE_BUILTIN_CTZ | ||
| + | pml_datatype-> | ||
| + | #else | ||
| + | size_t | ||
| + | pml_datatype-> | ||
| + | while ( lsize ) pml_datatype-> | ||
| + | #endif | ||
| + | </ | ||
| + | |||
| + | This issue has been noted by the Open MPI developers. | ||
| + | |||
| + | ===== Changes on DARWIN ===== | ||
| + | |||
| + | The source code for all versions and variants of Open MPI in the 4.x release sequence have been patched according to the information above. | ||
| + | |||
| + | DARWIN users who experienced problems with their MPI code are encouraged to try to determine if the send/ | ||