H5MD - proposal 100: Storage of time information ------------------------------------------------ **status:** Accepted in H5MD 1.1 ### Objective This proposal aims at enhancing the storage of time information. Some reasonable use cases are not covered in H5MD 1.0, such as equal time steps. ### Motivations A few excerpts from the h5md-user mailing list. http://article.gmane.org/gmane.science.simulation.h5md.user/640 Now that I have started to use H5MD seriously, I also start to notice problems with it. One of them is the obligatory presence of a "time" dataset. I want to store a Monte-Carlo trajectory which consists of a sequence of configurations, but without any associated time values. If I want to respect the H5MD specification, I have to make up numbers, which is not a good habit to take. Is there any reason why "time" was made obligatory? http://article.gmane.org/gmane.science.simulation.h5md.user/641 ...and adding to that, can we also make the "step" optional? Weird as this may sound, we would also have to invent step numbers. http://article.gmane.org/gmane.science.simulation.h5md.user/651 > In a more general idea about step/time, I have an idea since a long > time. I didn't want it for H5MD 1.0 to avoid any confusion. But storing > step and time when step is simply step[i] = STEP_SIZE*i and time[i] = > STEP_SIZE*DT*i is a bit of a waste. We could define a proper setup for > regularly sampled data, for which step[0], STEP_SIZE, time[0] and DT > should be given. Good idea, and not just to avoid wasting space. It would also contain the message to the reader "this is regularly sampled data". For some analyses this makes a big difference. For example, computing time correlation functions of regularly sampled data is straightforward and efficient, whereas it is cumbersome, slow, and imprecise for irregular time series. Right now, the only way to check if a time series is regular is to check all the time labels. However, these are floats and thus subject to round-off error. I'll bet that in practice, analysis software will simply assume the time series to be equally spaced and not bother to check. I'll also bet that sooner or later this will lead to wrong results being published. See also http://nongnu.org/h5md/discussion.html#extensions-storage-of-time-dependent-data ### Relax datatype of `time` Whereas the `Integer` character of `step` plays a role in the identification of time frames, `time` could be relaxed to "`Integer` or `Real`". ### Optional use of `time` As, e.g., Monte-Carlo simulations may not possess a well-defined time, it is proposed that only `step` is mandatory in a time-dependent H5MD element. ### Linearly spaced `step` and `time` When the increments of `step` and/or `time` are constant, the interpretation `step[i]=step0+i*delta_step` and `time[i]=time0+i*delta_time` holds. This change would remove the need to store unneeded data but also facilitate the analysis, as many algorithm work only with fixed-spacing of data. The content of a time-dependent H5MD element needs an update to allow for the absence of `step` and `time`. #### Proposition to use scalar datasets. The structure of a time-dependent H5MD element is \-- step: Integer[] \-- (step0): Integer[] \-- (time): Float[] \-- (time0): Float[] \-- value: [variable][...] This structure matches closely the existing one. The use of scalar datasets allows to (i) keep the status of `step` (etc.) a HDF5 dataset and not an attribute (ii) to distinguish clearly from the current structure by using scalar datasets. While not a requirement, it would be encouraged to use compact datasets here. #### Proposition to use attributes The structure of a time-dependent H5MD element is : [variable][...] +-- step: Integer[] +-- (step0): Integer[] +-- (time): Float[] +-- (time0): Float[] #### Proposition to mix scalar datasets and attributes This structure matches closely the existing one. The use of scalar datasets allows us to (i) keep the status of `step` and `time` as HDF5 datasets and (ii) to distinguish clearly from the current structure by using scalar datasets, i.e., the distinction is _after_ reading the shape of the dataset. (iii) Using HDF5 attributes for the offset allows for a single generic identifier `offset` and avoids cluttering of the HDF5 group forming the H5MD element. : \-- step: Integer[] +-- (offset): Integer[] \-- (time): Float[] +-- (offset): Float[] \-- value: [variable][...]