# Inciter performance

This page quantifies different aspects of the computational performance of Inciter.

## Strong scaling

The following figures quantify strong scaling of the different solvers in Inciter.

Note that the vertical axis above measures wall-clock time in seconds, thus the lower the value is faster the run was. Since the `DG`

method stores the unknowns at cell centers, and there are a lot more cells than points of the unstructured computational mesh (specifically this mesh had `npoin=133,375,577`

and `nelem=794,029,446`

), the DG method operates on a singificantly larger data (also requiring more memory) and thus solves the same discrete problem with larger fidelity compared to the `DiagCG`

and `ALECG`

(node-centered) schemes. To offer a comparison between the solvers based on the same fidelity, the figure below depicts the same data but normalizing the wall-clock time by the number of degrees of freedom (NDOF). NDOF equals the total number of grid cells for `DG`

, and the total number of nodes for the `DiagCG`

and `ALECG`

solvers.

## Weak scaling

The following figure quantifies the weak scaling scaling of DiagCG.

## Effects of overdecomposition

The figures below demonstrate typical effects of overdecomposition, partitioning the computational domain into *more* work units than the number of available processors. Overdecomposition helps load balancing, since the runtime system can work with finer-grained work units (compared to a single mesh partition per processing element) that are quicker to migrate and redistribute to homogenize computational load. Another positive effect of overdecomposition is discussed below where smaller work units fit better into CPU caches yielding improved performance.

The leftmost side of the figures corresponds to the case where the number of work units (*chares*) equal the number of CPUs – this is labelled as "classic MPI", as this is how distributed-memory-parallel codes are traditionally used with the MPI (message passing) paradigm. As the problem is decomposed into more partitions, the chunks become smaller but require more communication as the boundary/domain element ratio increases. Smaller chunks, however, are faster to migrate to other CPUs if needed and fit better into local processor cache. (Note that migration was not enabled for these examples.) As a result the problem can be computed a lot faster, in this case, approximately **50 times(!) faster**. Though finding such sweet spots require experimentation and certainly depends on the problem, problem size, and hardware configuration, the interesting point is that such a large performance gain is possible simply by allowing overdecomposition without the use of multiple software abstractions, e.g., MPI + threading. All of this code is written using a single and high-level parallel computing abstraction: Charm++ *without* explicit message passing code.

*time*utility, including setup and I/O, of integrating the coupled governing equations of mass, momentum, and energy for an ideal gas, using a continuous Galerkin finite element method. The times are normalized and compared to the leftmost (

*classic MPI*) data. As expected, using just a few more partitions per CPU results in a performance degradation as more communication is required. However, further increasing the degree of overdecomposition to about 5 times the number of CPUs yields an excellent speedup of over

**10x(!)**due to better cache utilization and overlap of computation and communication.

**50 times(!)**compared to the original run without overdecomposition.

## Load balancing

To demonstrate load balancing in parallel, we have run a Sedov problem modified to produce load imbalance. The Sedov problem is an idealized blast originating from a point, leading to a spherically spreading shock wave.

To induce load imbalance we added some extra computational load to those computational cells whose fluid density exceeds the value of 1.5. This mimics, e.g., combustion, non-trivial material equations of state, etc., and induces realistic load imbalance, characteristic of multi-physics simulations.

Spatial distributions of the extra load, corresponding to the fluid density exceeding the value 1.5, during time evolution of the Sedov solution: (left) shortly after the onset of load imbalance, (right) at end of the simulation.

One can imagine that as the computational domain decomposed into many partitions, residing on multiple compute nodes, the parallel load gets out of balance and work units must all wait for the slowest one.

Depicted below are simulation times for, running the problem using a 3 million-cell mesh on 10 computes nodes of a distributed-memory cluster with 36 CPUs/node. One can see that the density peak exceeds the value of 1.5 around the 130th time step, inducing load imbalance and this persists until the end of the simulation To balance the load we have used different strategies, available in Charm++, simply by turning on load balancing on the command line.

It is clear from the figure that multiple load balancing strategies can successfully and effectively homogenize the uneven, dynamically generated, a priori unknown computational load. The common requirement for effective load balancing is fine-grained work units, ensured by increasing the degree of overdecomposition (virtualization) from 1 to 100x, which yields 100x the mesh partitions compared to the number of CPUs the problem is running on. As expected, increasing the number of partitions beyond the number of CPUs has a cost due to increased communication cost, c.f., black and red lines. Overall however, load balancing yields over an order of magnitude speedup over the non-balanced case.