Tutorial: Parameter File
========================

Our data
--------

In the "HelloWorld" Example, the Record, that was synchronized with the
server, was created "manually" using the Python client. Now, we want to
have a look at how the Crawler can be told to do this for us.

The Crawler needs instructions on what kind of Records it should
create given the data that it sees. This is done using so called
"CFood" YAML files.

Let’s once again start with something simple. A common scenario is that we
want to insert the contents of a parameter file. Suppose the
parameter file is named ``params_2022-02-02.json`` and looks like the
following:


.. code-block:: json
    :caption: params_2022-02-02.json

    {
      "frequency": 0.5,
      "resolution": 0.01
    }

Suppose these are two Properties of an Experiment and the date in the file name
is the date of the Experiment. Thus, the data model could be described in a
``model.yml`` like this:

.. code-block:: yaml
   :caption: model.yml

   Experiment:
     recommended_properties:
       frequency:
         datatype: DOUBLE
       resolution:
         datatype: DOUBLE
       date:
         datatype: DATETIME

We will identify experiments solely using the date, so the ``identifiable.yml`` is:

.. code-block:: yaml
   :caption: identifiable.yml

   Experiment:
     - date


Getting started with the CFood
------------------------------

CFoods (Crawler configurations) can be stored in YAML files:
The following section in a `cfood.yml` tells the crawler that the key value pair
``frequency: 0.5`` shall be used to set the Property "frequency" of an
"Experiment" Record:

.. code:: yaml

   ...
   my_frequency:  # just the name of this section
     type: FloatElement  # it is a float value
     match_name: ^frequency$  # regular expression: Match the 'frequency' key from the data json
     match_value: ^(?P<value>.*)$  # regular expression: We match any value of that key
     records:
       Experiment:
         frequency: $value
   ...

The first part of this section defines which kind of data element shall be handled (here: a
key-value pair with key "frequency" and a float value) and then we use this to set the "frequency"
Property.

How does it work to actually assign the value? Let's look at what the
regular expressions do:

- ``^frequency$`` assures that the key is exactly "frequency". "^" matches the
  beginning of the string and "$" matches the end.
- ``^(?P<value>.*)$`` creates a *named match group* with the name "value" and the
  pattern of this group is ".*". The dot matches any character and the star means
  that it can occur zero, one or multiple times. Thus, this regular expression
  matches anything and puts it in a group with the name ``value``.

We can use the groups from the regular expressions that are used for matching.
In our example, we use the "value" group to assign the "frequency" value to the "Experiment".

A fully grown CFood
-------------------

Since we will not pass this key value pair on its own to the crawler, we need
to embed it into its context. The full CFood file ``cfood.yml`` for
this example might look like the following:

.. code-block:: yaml
   :caption: cfood.yml

   ---
   metadata:
     crawler-version: 0.5.0
   ---
   directory: # corresponds to the directory given to the crawler
     type: Directory
     match: .* # we do not care how it is named here
     subtree:
       parameterfile:  # corresponds to our parameter file
         type: JSONFile
         match: params_(?P<date>\d+-\d+-\d+)\.json # extract the date from the parameter file
         records:
           Experiment: # one Experiment is associated with the file
             date: $date # the date is taken from the file name
         subtree:
           dict:  # the JSON contains a dictionary
             type: Dict
             match: .* # the dictionary does not have a meaningful name
             subtree:
               my_frequency: # here we parse the frequency...
                 type: FloatElement
                 match_name: frequency
                 match_value: (?P<val>.*)
                 records:
                   Experiment:
                     frequency: $val
               resolution: # ... and here the resolution
                 type: FloatElement
                 match_name: resolution
                 match_value: (?P<val>.*)
                 records:
                   Experiment:
                     resolution: $val

You do not need to understand every aspect of this right now. We will cover
this later in greater depth. You might think: "Ohh.. This is lengthy". Well,
yes BUT this is a very generic approach that allows data integration from ANY
hierarchical data structure (directory trees, JSON, YAML, HDF5, DICOM, ... and
combinations of those!) and as you will see in later chapters there are ways
to write this in a more condensed way!

For now, we want to see it running!

The crawler can now be run with the following command (assuming that
the CFood file is in the current working directory):

.. code:: sh

   caosdb-crawler -s update -i identifiables.yml cfood.yml .