Inciter software design
- 1. Startup and migration of read-only global-scope data
- 2. Important classes
- 3. Setup
- 4. Time stepping
This page discusses the high-level software design and some implementation aspects of Inciter. The discussion roughly follows as execution happens in time, from program start to finish. On any of the Charm++ concepts below, consult the Charm++ manual.
As all other executables in Quinoa, Inciter uses the Charm++ runtime system. Runtime execution starts in the Charm++
mainmodule constructor, defined in Main/Inciter.C as
Main. Starting down the constructor's initializer list, the command line is parsed, followed by creating a driver, InciterDriver, which parses the input file.
After the main chare constructor has finished, the runtime system initializes global scope data and migrates it to all other processing elements (PEs), which from that point is considered read-only.
This global-scope data are defined at the beginning of Main/Inciter.C in the same order as they appear in the companion Charm++ interface file Main/
Global_scope data is always prefixed by
g_. Inciter's global-scope data contains two input decks: the defaults (g_inputdeck_defaults) and g_inputdeck which contains all data parsed during command line and input file parsing.
Two global scope vectors in inciter are
g_dgpde. These two vectors hold base classes for partial differential equations (PDE) that can be specialized to different types of derived PDEs that use continuous Galerkin finite element (FE) discretizations (CGPDE) and discontinuous Galerkin FE discretizations (DGPDE). These two vectors are global scope because they hold pointers to derived classes that enable runtime polymorphism. Runtime polymorphism enables code reuse and helps client code stay uniform and modular. Such polymorphic use of pointers (
std::unique_ptr) are one of the rare uses of pointers throughout the code.
Up to this point execution is serial, since there is only a single main Charm++ chare.
InciterDriver's constructor then fires up a single Charm++ chare instance of
Transporter, called from Main::
Transporter, defined in Inciter/Transporter.C, is the main driver class of Inciter that is a Charm++ chare from which all execution happens, e.g., via broadcasts, and to which all execution ends up in, leading to
Transporter::, which eventually calls back to
CkExit(), signaling the runtime system to exit.
Here are the important classes that interoperate within inciter:
Transporter(single chare, driver)
Partitioner(chare nodegroup, mesh partitioner)
MeshWriter(chare group, mesh writer, performing file output in parallel)
Refiner(chare array, mesh refiner)
Sorter(chare array, performs mesh reordering)
Discretization(chare array, generic PDE solver base class)
DiagCG(chare array, PDE solver child class, specialized to a continuous Galerkin finite element discretization scheme with a lumped-mass left-hand side and flux-corrected transport combined with Lax-Wendroff-like explicit time stepping scheme that is second order accurate in space and time),
DG(chare array, PDE solver child class, specialized to a discontinuous Galerkin finite element discretization scheme with a diagonal left-hand side with explicit Runge-Kutta time stepping),
- ... (more hydro schemes to be added in the future)
DistFCT(chare array, performs distributed-memory flux-corrected transport, if used by a specific discretization scheme, e.g.,
Chare above means a single Charm++ chare. There is a single instance of this class. By design,
Transporter is a single chare and is used as a driver that creates objects, used as a target of global parallel reductions, and thus global synchronization points.
Transporter also does most of the printouts to screen, collects statistics, and is the end-point of execution in
Group above means a Charm++ chare group. A group is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a group per PE which does not migrate.
MeshWriter is a group because it calls the MPI-only library, ExodusII, for outputing mesh and mesh-based solution field data to files. Note that while
MeshWriter is a group, the way we use it makes it similar to a nodegroup but with group semantics: we call its Charm++ entry method
MeshWriter::write() from multiple array elements (from Discretization) targeting ony the first PE of each logical node. This yields "serializing" every call per node, required for the underlying non-thread-safe NetCDF/HDF5 library calls, made by ExodusII. This way writing large solution data works in both non-SMP and SMP mode correctly and efficiently, since parallel output load is configurable separately from the number of work-units for computation.
Nodegroup above means a Charm++ chare nodegroup. A nodegroup is a processor-aware chare collection, which means that there is guaranteed to be a single instance of a nodegroup per logical (e.g., compute) node which does not migrate.
Partitioner is a nodegroup because it calls MPI-only libraries.
Array above means a Charm++ chare array, whose elements can migrate (if enabled) and thus they actively participate in automatic load balancing. With nonzero overdecomposition, there may be more array elements (workers) than the number of available PEs. Arrays do the bulk of the heavy lifting in a calculation, i.e., computing right-hand sides for PDE operators, and hold the bulk of unknown/solution arrays. The degree of overdecomposition can be specified by the
-u command line argument. This argument accepts a real value between 0.0 and 1.0, inclusive. 0.0 means no overdecomposition, which corresponds to partitioning the mesh into a number of pieces equalling the number of PEs available. (
-u 0.0 yields an execution style that is most similar to how MPI codes are traditionally used, which is the default.) Nonzero overdecomposition yields larger number of mesh partitions than the available PEs. The extreme of
1.0 represents the largest degree of overdecomposition, which also results in the smallest work units. For a discussion on the effects of overdecomposition, see the page on Inciter performance.
Discretization and its specialized children,
Refiner, and, if used/created,
DistFCT are bound arrays. This means that the runtime system migrates its corresponding array elements together. Bound arrays facilitate modularization among workers that migrate. Since array elements that are bound always appear together on a given PE, even after migration, they can also be thought of as part of the same class, because they can access data from each other (but still respecting the C++ rules of
private, etc). However, dividing functionality into classes, as always, helps readibility and makes reasoning about code easier.
Inciter's setup starts with
Transporter's constructor. After some printouts on configuration, some of the important classes are created, introduced above. In some cases, we simply call Charm++'s
ckNew() without arguments. This means an empty Charm++ chare array is created, we get hold of its proxy, but no constructors are run yet. An example for this is
Sorter. We do empty array creation for two main reasons: (1) we don't yet have all data that is needed to create the given class (more preparation is required but we need its proxy already), and/or (2) we need to pass specific data to each array (or group) element's constructor, which can only be done via dynamic insertion. (Passing it via Charm++'s
ckNew() would pass the same data to all elements to be created in a broadcast fashion, and this is not what we want most of the time.) For example,
Sorter needs part of the mesh connectivity, coordinates, etc., for which we need to have the mesh partitioned first. This also allows nicely adhering to the RAII idiom of the object oriented paradigm, which helps writing correct code.
In many cases, we pass a number of callbacks to chare arrays and groups when their Charm++ proxy is created with
ckNew(). These callbacks are of type
CkCallback. This is one technique that we use as a kind of type erasure so that these classes can interoperate with each other as well as with their host, Transporter. This helps resolving cyclic include dependencies. These callbacks mostly denote reduction targets to
Transporter, that are necessary synchronization points during setup. A
CkCallback is similar to std::function or a C-style function pointer, but can also store a callback to a Charm++ chare object that happens to reside across the network, i.e., an entry method call via a proxy.
Transporter's constructor we create a number of Charm++ chare arrays, most of them are empty to start with. An exception is
Partitioner, whose constructor starts by reading a different chunk of the mesh in blocks as they appear in the file. We read a chunk of the mesh and associated node coordinates for all mesh cells that are in the chunk read.
Partitioner's constructor also reads/computes the triangle element connectivity associated to side sets as well as the node lists associated to side sets. Note that only the portion of all this data is read in on a nodegroup that belongs to a given chunk of the mesh. At this point this results in a simple partitioning which almost certainly not ideal for computing equations later, because of a large surface-area-to-volume ratio of these partitions of the computational domain, since we cannot assume that the ordering in the mesh file also corresponds to close physical-space proximity.
Partitioner's constructor finishes with a global reduction to
Transporter::, which sums the number of elements and the number of nodes (points) in the complete problem (read in in a distributed fashion).
After reading the mesh in a distributed fashion, we now know the total number of cells and points and
Transporter:: computes the total number of worker chares that will be used to partition the problem into. If there is a nonzero overdecomposition configured by the command line argument
-u, the number of chares (partitions) may be more than the available PEs. This is computed by tk::
Partitioner:: sets up the necessary data for calling an external mesh partitioner. Currently, various coordinate-based partitioners are hooked up from Zoltan2. This works in distributed-memory parallel fashion, and calls MPI under the hood (inside the library call). Note that the number of desired mesh partitions equals the number of Charm++ (worker) chares we want, which can be larger than the number of PEs (or compute nodes). The output of mesh partitioning is a map that assigns a chare id to each mesh cell. Next is to categorize (or group) all cells (their connectivity and node coordinates) together that are assigned to each chare by the partitioner and send them to their (owner) chare. This is started out in
Partitioner::, and communicated, in a point-to-point fashion, to other Partitioner chares by calling the Partitioner::
contribute call to
Transporter::. When all contribute calls have arrived, communication of the different parts of the mesh has finished. This is followed by optional initial mesh refinement.
Transporter:: issues a broadcast to
Partitioner::, which uses dynamic array insertion to create all
Refiner chare array elements. Note that there may be multiple
Refiner objects created on each
Partitioner PE (or compute node in SMP mode). Each
Refiner constructor gets a chunk of the mesh (which now are smaller chunks than the chunks that were originally read from file), corresponding to the total number of chares. Not only the mesh connectivity and node coordinates, but also the boundary face connectivity and boundary node lists associated to multiple side sets are also passed to Refiner, and only those portions of these data structures that belong to the particular
Refiner chare. (Boundary face connectivity is used for cell-based discretization schemes, such as
DG, while boundary node lists are used for nodal discretizations, such as
DiagCG to set boundary conditions.)
Refiner's constructor evaluates user configuration and decides if initial mesh refinement is to be performed or not. If not, execution is simply skipped ahead to Refiner::
There are two ways Refiner is used:
- Mesh refinement before time stepping (t<0),
- Mesh refinement during time stepping (t>0),
Initial mesh refinement consists of potentially multiple steps of different refinement types, e.g.,
coordref, which respectively stand for uniform refinement, i.e., split each tetrahedron into 8 new ones, initial-conditions-based, i.e., non-uniform refinement based on estimating the error in the initial conditions on a given initial mesh, and coordinate-based, which allows specifying planes and simple extents in 3 dimensional space and allows tagging edges for refinement between two extremes. Additionally the user can also tag edges manually, creating a list of pairs of global node ids.
Sorter array chares but also reports back to Transporter::
Sorter takes a chunk of the mesh, together with its physical boundary face and node data structures, and sets up a symmetric node communication map, Sorter::
Sorter performs optional PE-locality mesh node reordering in parallel. This reordering is optional and is off by default. The reordering assigns new global mesh node IDs that roughly increase with chare ID.
Whether node reordering is to be performed or not is evaluated in Sorter::
The worker Charm++ chare array elements, that store the mesh and associated data structures and compute PDE operators during time stepping, are organized into a single-level base-child relationship for code reuse and runtime polymorphism. However, both the base-child relationship and runtime polymorphism are slightly differently implemented compared to what would be familiar with inheritance in standard object-oriented programming (OOP).
There is a single base class,
Discretization, that encapsulates data and member functions that are generic to all mesh-based discretization schemes. It stores the mesh, connectivity, node coordinates, etc. As
Discretization is a chare array, its elements are distributed across the whole problem on all available compute nodes and PEs, and can also migrate for load balancing. In the OOP sense, derived (or child) classes are the classes that implement a particular discretization scheme, e.g., DiagCG or DG.
Discretization chares. This is done via the usual Charm++ dynamic array insertion passing the mesh chunk and associated data structures to their constructors. Eventually, this is followed by creating the child workers in Sorter::
Execution from the
Discretization constructor follows to Transporter::
Discretization::, which computes nodal volumes as well as the total volume of the complete problem, which ends up in
Transporter::. Volume calculations are followed by computing various statistics on the mesh cell sizes, including min/max edge lengths, and histograms, which are useful diagnostics to estimate load imbalance.
The various mesh statistics aggregate and arrive independently in reduction target member functions of
pdfstat(). When all of these are complete (see src/
After the child constructors have started, one way or another, they are expected to call Transporter::
setup(). All specific discretization schemes, "deriving from"
Discretization, are expected to define certain member functions, such as
setup(), which do scheme-specific setup and is eventually expected to call
dt(), which starts time stepping (by computing the size of the next time step).
dt() is also the member function that is called again and again starting a new time step during time stepping.
After all the above, time stepping starts. Time stepping is and may be done very differently by the different types of discretizations but they are all expected to define a few member functions that are common so they can interoperate with their host,
Transporter. These are Charm++ entry methods and must also have the same function signature (as in a derived class in OOP). Since the set of these function may change, the best way to find out what is required is to compare the Charm++ interface (
.ci) files for the specific discretization schemes, e.g., DiagCG, DG, etc., and look for entry methods that are defined by all child schemes.