A time-dependent data group with step
and time
having constant
increments, i.e., if the sampling occurs at constant rate and the
equations of motion are integrated at a fixed time step, may use the
following structure:
<data_group>
\-- step
| +-- (offset)
\-- time
| +-- (offset)
| +-- (unit)
\-- value [variable][...]
+-- (unit)
step
time
value
The datasets step
and time
may possess an optional attribute
offset
specifying the absolute step and time corresponding to the
sample at index 0. If the attribute is absent, the respective offset
equals to 0.
Time-averaged data are stored for some applications, for example the
potential energy is computed every 200 simulation steps but only the
average of 50 such computations is stored (every \(10^4\) steps).
Additional statistical information along with the mean value is stored
by extending the triple value
, step
, time
:
The structure of such a data group is :
data_group
\-- value [var][...]
+-- (unit)
\-- error [var][...]
\-- count [var]
\-- step [var]
\-- time [var]
+-- (unit)
value
dataset is as before, but stores the arithmetic mean of
the data sampled since the last output to this group.error
dataset stores the statistical error of the mean value,
given by \(\sqrt{\sigma^2/(N-1)}\) with the variance
\(\sigma^2\) and the number of sampled data points \(N\). The
error is 0 in case of \(N=1\). The dimension of the dataset must
agree with those of value
and its (optional) unit is inferred
from value
.count
dataset is of integer type and stores the number
\(N\) of sampled data points used. The dimension of the dataset
is variable and must agree with the first dimension of value
.Note that the statistical variance and the standard deviation are easily
obtained from combining the datasets error
and count
and need
not to be stored explicitly.
Simulation box information
Some information on the simulation box geometry could be included. For now, the box size is included in the observables group. Symmetry groups could be included in the future.
Topology
There is the need to store topology for rigid bodies, elastic networks or proteins. The topology may be a connectivity table, contain bond lengths, …
Scalar and vector fields
May be used to store coarse grained or cell-based physical quantities.
The “density” dataset has dimensions [variable][Nx][Ny][Nz] where the variable dimension allows to accumulate steps, and Nx, Ny and Nz are the number of data points in each dimension. This dataset possesses the attributes “x0” and “dx”, both of dimension “D” (the dimensionality of the system). “x0” stores the center of the 0-th cell (the [0,0,0] cell) and “dx” stores the cell spacing. The notation from “x” to “z” is given as an example and other ranks can be given for other dimensionalities.
The “velocity_field” dataset has dimensions [variable][Nx][Ny][Nz][D] where “D” is the dimensionality of the system. It stores a cell-baed velocity field. The same remark as for the “x”, “y” and “z” variables as for the “density” dataset applies.
Tracking history of authors and creator programs
It would be desirable to track authors and creator programs. This
could be achieved by replacing the respective attributes in /hm5d
by datasets of variable dimension. The object tracking of these
datasets may then be matched (approximately) with the
creation/modification times of other datasets.
Parallel issues
Although not a specification in itself, one advantage of using HDF5 is the Parallel-HDF5 extension for MPI environments. File written by parallel programs should be identical to programs written by serial programs.
An issue remains however: as particles move in space, they may belong to varying CPUs. A proposition to this problem is to send all particles, as a copy, to their original CPU and to write them from there using collective IO calls. Particles for which the ordering is not important (for instance solvent particles that may be required for checkpointing only) could be written from their actual CPU without recreating the original order.