CFood-Definition

The crawler specification is called CFood-definition. It is stored inside a yaml file, or - more precisely - inside of one single or two yaml documents inside a yaml file.

The specification consists of three separate parts:

  1. Metadata and macro definitions

  2. Custom converter registrations

  3. The converter tree specification

In the simplest case, there is just one yaml file with just a single document including at least the converter tree specification (see example 1). Additionally the custom converter part may be also included in this single document (for historical reasons, see example 2), but it is recommended to include them in the separate document together with the metadata and macro definitions (see below).

If metadata and macro definitions are provided, there must be a second document preceeding the converter tree specification, including these definitions.

It is highly recommended to specify the version of the CaosDB crawler for which the cfood is written in the metadata section, see below.

Examples

A single document with a converter tree specification:

extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

A single document with a converter tree specification, but also including a custom converters section:

Converters:
  CustomConverter_1:
    package: mypackage.converters
    converter: CustomConverter1
  CustomConverter_2:
    package: mypackage.converters
    converter: CustomConverter2

extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

A yaml multi-document, defining metadata and some macros in the first document and declaring two custom converters in the second document (not recommended, see the recommended version below). Please note, that two separate yaml documents can be defined using the --- syntax:

---
metadata:
  name: Datascience CFood
  description: CFood for data from the local data science work group
  crawler-version: 0.2.1
  macros:
  - !defmacro
    name: SimulationDatasetFile
    params:
      match: null
      recordtype: null
      nodename: null
    definition:
      # (...)
---
Converters:
  CustomConverter_1:
    package: mypackage.converters
    converter: CustomConverter1
  CustomConverter_2:
    package: mypackage.converters
    converter: CustomConverter2

extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

The recommended way of defining metadata, custom converters, macros and the main cfood specification is shown in the following code example:

---
metadata:
  name: Datascience CFood
  description: CFood for data from the local data science work group
  crawler-version: 0.2.1
  macros:
  - !defmacro
    name: SimulationDatasetFile
    params:
      match: null
      recordtype: null
      nodename: null
    definition:
      # (...)
  Converters:
    CustomConverter_1:
      package: mypackage.converters
      converter: CustomConverter1
    CustomConverter_2:
      package: mypackage.converters
      converter: CustomConverter2
---
extroot:
  type: Directory
  match: ^extroot$
  subtree:
    DataAnalysis:
      type: Directory
      match: DataAnalysis
      # (...)

List Mode

Specifying values of properties can make use of two special characters, in order to automatically create lists or multi properties instead of single values:

Experiment1:
    Measurement: +Measurement #  Element in List (list is cleared before run)
                 *Measurement #  Multi Property (properties are removed before run)
                 Measurement  #  Overwrite

Values and units

Property values can be specified as a simple strings (as above) or as a dictionaries that may also specify the collection mode. Strings starting with a “$” will be replaced by a corresponding variable if there is any. See the tutorials chapter of this documentation for more elaborate examples on how the variable replacment works exactly. A simple example could look the following.

ValueElt:
  type: TextElement
  match_name: ^my_prop$
  match_value: "(?P<value>.*)"  # Anything in here is stored in the variable "value"
  records:
    MyRecord:
      MyProp: $value  # will be replace by whatever is stored in the "value" variable set above.

If not given explicitly, the collection mode will be determined from the first character of the property value as explained above, and the following three definitions are all equivalent:

MyProp: +$value
MyProp:
  value: +$value

and

MyProp:
  value: $value
  collection_mode: list

Units of numeric values can be set by providing a property value not as a single string, but as a dictionary with a value and a unit key. Within a converter definition this could look the following.

ValueWithUnitElt:
  type: TextElement
  match_name: ^my_prop$
  match_value: "^(?P<number>\\d+\\.?\\d*)\s+(?P<unit>.+)"  # Extract value and unit from a string which
                                                           # has a number followed by at least one whitespace
                                                           # character followed by a unit.
  records:
    MyRecord:
      MyProp:
        value: $number
        unit: $unit

File Entities

In order to use File Entities, you must set the appropriate role: File. Additionally, the path and file keys have to be given, with values that set the paths remotely and locally, respectively. You can use the variable <converter name>_path that is automatically created by converters that deal with file system related StructureElements. The file object itsself is stored in a vairable with the same name (as it is the case for other Records).

somefile:
  type: SimpleFile
  match: ^params.*$  # match any file that starts with "params"
  records:
    fileEntity:
      role: File           # necessary to create a File Entity
      path: somefile.path  # defines the path in CaosDB
      file: somefile.path  # path where the file is found locally
    SomeRecord:
      ParameterFile: $fileEntity  # creates a reference to the file

Transform Functions

You can use transform functions to alter variable values that the crawler consumes (e.g. a string that was matched with a reg exp). See Converter Documentation.

You can define your own transform functions by adding the the same way you add custom converters:

Transformers:
  transform_foo:
     package: some.package
     function: some_foo

Automatically generated keys

Some variable names are automatically generated and can be used using the $<variable name> syntax. Those include:

  • <converter name>: access the path of converter names to the current converter

  • <converter name>.path: the file system path to the structure element (file system related converters only; you need curly brackets to use them: ${<converter name>.path})

  • <Record key>: all entities that are created in the records section are available under the same key