Quantum physics, case study

A documentation case from the field of quantum physics. Courtesy of Assoc. Prof. Juha Muhonen.

Data type 

The data to be collected is empirical and is collected from a number of instruments. Data is mostly in numerical tables with certain units and dimensions.

Challenges in data processing 

Main challenge is how to move data dimensions and units together with the data. If the data is only saved in Excel or text spreadsheet format, the units and dimensions will not move with it automatically. For this reason, it may be difficult to deduce in retrospect which unit has been involved. For the same reason, combining the data with the correct sample and the state of the test equipment at the time of measurement can be challenging. In addition, sometimes large data files are generated in the measurements. Therefore, it is essential for the usability that the data are binary. This makes the data easy to download and read and still interoperable with many different numerical software.

Solutions 

An open source Hierarchical Data Format file format (version HD5) was used. HD5 is a standard and thus interoperable with the required software. A particular advantage of the format is that it allows the data to be combined with the manually generated metadata (e.g. the state of the test equipment at the time of measurement, which sample is involved; other similar experimenter's notes). Thus, the data and metadata are located in one and the same file, and they always move with each other.


Implementation of documentation  

1. HD5 data file

Example of data file structure:

" "

The example is written in Python, which is just one option. What is essential about the example is that when processing numerical data, in this example frequency, is that in addition to the numbers, the unit metadata descriptor, Hertz, is simultaneously documented and saved with the frequency. The example also processes another numerical data: wavelength. Correspondingly, the wavelength entry is accompanied with a metadata descriptor, nanometer. Frequency and wavelength are the variable coordinates that are changed and adjusted in the measurements to measure the desired phenomenon.

The variable to be measured in the example is 1) spectral data in 2D tabular format, depending on the wavelength and frequency. In addition, metadata has been enriched with notes on the resolution band width (a specific parameter of the measurement). In addition to the spectrum, the file also contains another variable, namely, 2) a voltage that depends on the wavelength. Voltage data are in 1D table format. Voltage is measured at the same time as the spectrum. The Attributes section in the file records general notes, such as who did the experiment, temperature, power, and similar relevant contextual information to help return your own traces if necessary to ensure reproducibility.

Advantages of the model: When all the data are in one file, the metadata travels together with the numeric data. An added benefit of the model is that once a file is read into any numeric software, variables can be titled with these self-recorded clear names (frequency, wavelentgh, spectrum, etc.). Additionally, the names are automatically transferred to the software as they are stored. Thus, there is no need to recall whether there was a frequency or a wavelength in a particular column, and so on.

2. Data notes

In addition to the data file itself, experimental notes are needed. Notes document contextual information on e.g. why the data were measured, and a more detailed description of the experimental setup. These notes are maintained in a OneNote Laboratory Diary, which is stored and shared among those working in the laboratory at University's Sharepoint. There are separate sections in the diary for different experiments. In addition, the tab for each experiment has diary pages that record the experiment ID, a clearly marked path address to the S-station or Nextcloud project folder where the data for the individual day is stored, and a free-text description of what has been done.

With regard to the information to be recorded, the team has agreed that the description should always include a description of how the problems encountered in the test situation have been resolved so that they do not have to be resolved again. In addition, all relevant parameters that need to be known in order for the data to be trusted are always recorded for future research publication.

Advantages of using OneNote: OneNote, for example, is a more flexible platform than a wiki, and can be used from several devices simultaneously, as well as with a browser. This avoids the risk of data from different experiments and descriptive notes being lost in the personal files of individual researchers. There is no need for separate notebooks in addition to OneNote, which makes it easier for everyone to work on their own. OneNote also has a keyword feature that allows you to inventory the data. OneNote also has an easy-to-use search feature. However, a hashtag functionality would make the search in the directory even more streamlined and increase the usability of the diaries as the data accumulate. 

3. Internal team guidelines for compiling metadata

In addition, consider creating demo instructions that are stored in Jupyter or other software you use. The demo instructions indicate what metadata is encoded for each point stored below each data field. In demo logging, entering data data, encoding it, and adding metadata are explained step by step. Add notes on, for example, about your data naming conventions, so that the convention remains uniform.

Advantage of the demo guide: The guide facilitates sharing of information within the research group and works as a so-called "meta-metadata", a description of how metadata are compiled so that they move together with the data file and contain all the necessary information.