Tutorial: Parameter File
Our data
In the “HelloWorld” Example, the Record, that was synchronized with the server, was created “manually” using the Python client. Now, we want to have a look at how the Crawler can be told to do this for us.
The Crawler needs instructions on what kind of Records it should create given the data that it sees. This is done using so called “CFood” YAML files.
Let’s once again start with something simple. A common scenario is that we
want to insert the contents of a parameter file. Suppose the
parameter file is named params_2022-02-02.json
and looks like the
following:
{
"frequency": 0.5,
"resolution": 0.01
}
Suppose these are two Properties of an Experiment and the date in the file name
is the date of the Experiment. Thus, the data model could be described in a
model.yml
like this:
Experiment:
recommended_properties:
frequency:
datatype: DOUBLE
resolution:
datatype: DOUBLE
date:
datatype: DATETIME
We will identify experiments solely using the date, so the identifiable.yml
is:
Experiment:
- date
Getting started with the CFood
CFoods (Crawler configurations) can be stored in YAML files:
The following section in a cfood.yml tells the crawler that the key value pair
frequency: 0.5
shall be used to set the Property “frequency” of an
“Experiment” Record:
...
my_frequency: # just the name of this section
type: FloatElement # it is a float value
match_name: ^frequency$ # regular expression: Match the 'frequency' key from the data json
match_value: ^(?P<value>.*)$ # regular expression: We match any value of that key
records:
Experiment:
frequency: $value
...
The first part of this section defines which kind of data element shall be handled (here: a key-value pair with key “frequency” and a float value) and then we use this to set the “frequency” Property.
How does it work to actually assign the value? Let’s look at what the regular expressions do:
^frequency$
assures that the key is exactly “frequency”. “^” matches the beginning of the string and “$” matches the end.^(?P<value>.*)$
creates a named match group with the name “value” and the pattern of this group is “.*”. The dot matches any character and the star means that it can occur zero, one or multiple times. Thus, this regular expression matches anything and puts it in a group with the namevalue
.
We can use the groups from the regular expressions that are used for matching. In our example, we use the “value” group to assign the “frequency” value to the “Experiment”.
Note
For more information on the cfood.yml
specification, read on in the chapter Converters.
A fully grown CFood
Since we will not pass this key value pair on its own to the crawler, we need
to embed it into its context. The full CFood file cfood.yml
for
this example might look like the following:
---
metadata:
crawler-version: 0.5.0
---
directory: # corresponds to the directory given to the crawler
type: Directory
match: .* # we do not care how it is named here
subtree:
parameterfile: # corresponds to our parameter file
type: JSONFile
match: params_(?P<date>\d+-\d+-\d+)\.json # extract the date from the parameter file
records:
Experiment: # one Experiment is associated with the file
date: $date # the date is taken from the file name
subtree:
dict: # the JSON contains a dictionary
type: Dict
match: .* # the dictionary does not have a meaningful name
subtree:
my_frequency: # here we parse the frequency...
type: FloatElement
match_name: frequency
match_value: (?P<val>.*)
records:
Experiment:
frequency: $val
resolution: # ... and here the resolution
type: FloatElement
match_name: resolution
match_value: (?P<val>.*)
records:
Experiment:
resolution: $val
You do not need to understand every aspect of this right now. We will cover this later in greater depth. You might think: “Ohh.. This is lengthy”. Well, yes BUT this is a very generic approach that allows data integration from ANY hierarchical data structure (directory trees, JSON, YAML, HDF5, DICOM, … and combinations of those!) and as you will see in later chapters there are ways to write this in a more condensed way!
For now, we want to see it running!
The crawler can now be run with the following command (assuming that the CFood file is in the current working directory):
caosdb-crawler -s update -i identifiables.yml cfood.yml .
Note
caosdb-crawler
currently only works with cfoods which have a directory as top level element.