Tutorial: Parameter File

Our data

In the “HelloWorld” Example, the Record, that was synchronized with the server, was created “manually” using the Python client. Now, we want to have a look at how the Crawler can be told to do this for us.

The Crawler needs instructions on what kind of Records it should create given the data that it sees. This is done using so called “CFood” YAML files.

Let’s once again start with something simple. A common scenario is that we want to insert the contents of a parameter file. Suppose the parameter file is named params_2022-02-02.json and looks like the following:

params_2022-02-02.json
{
  "frequency": 0.5,
  "resolution": 0.01
}

Suppose these are two Properties of an Experiment and the date in the file name is the date of the Experiment. Thus, the data model could be described in a model.yml like this:

model.yml
Experiment:
  recommended_properties:
    frequency:
      datatype: DOUBLE
    resolution:
      datatype: DOUBLE
    date:
      datatype: DATETIME

We will identify experiments solely using the date, so the identifiable.yml is:

identifiable.yml
Experiment:
  - date

Getting started with the CFood

CFoods (Crawler configurations) can be stored in YAML files: The following section in a cfood.yml tells the crawler that the key value pair frequency: 0.5 shall be used to set the Property “frequency” of an “Experiment” Record:

...
my_frequency:  # just the name of this section
  type: FloatElement  # it is a float value
  match_name: ^frequency$  # regular expression: Match the 'frequency' key from the data json
  match_value: ^(?P<value>.*)$  # regular expression: We match any value of that key
  records:
    Experiment:
      frequency: $value
...

The first part of this section defines which kind of data element shall be handled (here: a key-value pair with key “frequency” and a float value) and then we use this to set the “frequency” Property.

How does it work to actually assign the value? Let’s look at what the regular expressions do:

  • ^frequency$ assures that the key is exactly “frequency”. “^” matches the beginning of the string and “$” matches the end.

  • ^(?P<value>.*)$ creates a named match group with the name “value” and the pattern of this group is “.*”. The dot matches any character and the star means that it can occur zero, one or multiple times. Thus, this regular expression matches anything and puts it in a group with the name value.

We can use the groups from the regular expressions that are used for matching. In our example, we use the “value” group to assign the “frequency” value to the “Experiment”.

A fully grown CFood

Since we will not pass this key value pair on its own to the crawler, we need to embed it into its context. The full CFood file cfood.yml for this example might look like the following:

cfood.yml
---
metadata:
  crawler-version: 0.5.0
---
directory: # corresponds to the directory given to the crawler
  type: Directory
  match: .* # we do not care how it is named here
  subtree:
    parameterfile:  # corresponds to our parameter file
      type: JSONFile
      match: params_(?P<date>\d+-\d+-\d+)\.json # extract the date from the parameter file
      records:
        Experiment: # one Experiment is associated with the file
          date: $date # the date is taken from the file name
      subtree:
        dict:  # the JSON contains a dictionary
          type: Dict
          match: .* # the dictionary does not have a meaningful name
          subtree:
            my_frequency: # here we parse the frequency...
              type: FloatElement
              match_name: frequency
              match_value: (?P<val>.*)
              records:
                Experiment:
                  frequency: $val
            resolution: # ... and here the resolution
              type: FloatElement
              match_name: resolution
              match_value: (?P<val>.*)
              records:
                Experiment:
                  resolution: $val

You do not need to understand every aspect of this right now. We will cover this later in greater depth. You might think: “Ohh.. This is lengthy”. Well, yes BUT this is a very generic approach that allows data integration from ANY hierarchical data structure (directory trees, JSON, YAML, HDF5, DICOM, … and combinations of those!) and as you will see in later chapters there are ways to write this in a more condensed way!

For now, we want to see it running!

The crawler can now be run with the following command (assuming that the CFood file is in the current working directory):

caosdb-crawler -s update -i identifiables.yml cfood.yml .