This page quantifies different aspects of the computational performance of Inciter.
Note that the vertical axis above measures wall-clock time in seconds, thus the lower the value is faster the run was. Since the
DG method stores the unknowns at cell centers, and there are a lot more cells than points of the unstructured computational mesh (specifically this mesh had
nelem=794,029,446), the DG method operates on a singificantly larger data (also requiring more memory) and thus solves the same discrete problem with larger fidelity compared to the
ALECG (node-centered) schemes. To offer a comparison between the solvers based on the same fidelity, the figure below depicts the same data but normalizing the wall-clock time by the number of degrees of freedom (NDOF). NDOF equals the total number of grid cells for
DG, and the total number of nodes for the
The figures below demonstrate typical effects of overdecomposition, partitioning the computational domain into more work units than the number of available processors. Overdecomposition helps load balancing, since the runtime system can work with finer-grained work units (compared to a single mesh partition per processing element) that are quicker to migrate and redistribute to homogenize computational load. Another positive effect of overdecomposition is discussed below where smaller work units fit better into CPU caches yielding improved performance.
The leftmost side of the figures corresponds to the case where the number of work units (chares) equal the number of CPUs – this is labelled as "classic MPI", as this is how distributed-memory-parallel codes are traditionally used with the MPI (message passing) paradigm. As the problem is decomposed into more partitions, the chunks become smaller but require more communication as the boundary/domain element ratio increases. Smaller chunks, however, are faster to migrate to other CPUs if needed and fit better into local processor cache. (Note that migration was not enabled for these examples.) As a result the problem can be computed a lot faster, in this case, approximately 50 times(!) faster. Though finding such sweet spots require experimentation and certainly depends on the problem, problem size, and hardware configuration, the interesting point is that such a large performance gain is possible simply by allowing overdecomposition without the use of multiple software abstractions, e.g., MPI + threading. All of this code is written using a single and high-level parallel computing abstraction: Charm++ without explicit message passing code.
To demonstrate load balancing in parallel, we have run a Sedov problem modified to produce load imbalance. The Sedov problem is an idealized blast originating from a point, leading to a spherically spreading shock wave.
To induce load imbalance we added some extra computational load to those computational cells whose fluid density exceeds the value of 1.5. This mimics, e.g., combustion, non-trivial material equations of state, etc., and induces realistic load imbalance, characteristic of multi-physics simulations.
Spatial distributions of the extra load, corresponding to the fluid density exceeding the value 1.5, during time evolution of the Sedov solution: (left) shortly after the onset of load imbalance, (right) at end of the simulation.
One can imagine that as the computational domain decomposed into many partitions, residing on multiple compute nodes, the parallel load gets out of balance and work units must all wait for the slowest one.
Depicted below are simulation times for, running the problem using a 3 million-cell mesh on 10 computes nodes of a distributed-memory cluster with 36 CPUs/node. One can see that the density peak exceeds the value of 1.5 around the 130th time step, inducing load imbalance and this persists until the end of the simulation To balance the load we have used different strategies, available in Charm++, simply by turning on load balancing on the command line.
It is clear from the figure that multiple load balancing strategies can successfully and effectively homogenize the uneven, dynamically generated, a priori unknown computational load. The common requirement for effective load balancing is fine-grained work units, ensured by increasing the degree of overdecomposition (virtualization) from 1 to 100x, which yields 100x the mesh partitions compared to the number of CPUs the problem is running on. As expected, increasing the number of partitions beyond the number of CPUs has a cost due to increased communication cost, c.f., black and red lines. Overall however, load balancing yields over an order of magnitude speedup over the non-balanced case.