CFood-Definition
The crawler specification is called CFood-definition. It is stored inside a yaml file, or - more precisely - inside of one single or two yaml documents inside a yaml file.
The specification consists of three separate parts:
Metadata and macro definitions
Custom converter registrations
The converter tree specification
In the simplest case, there is just one yaml file with just a single document including at least the converter tree specification (see example 1). Additionally the custom converter part may be also included in this single document (for historical reasons, see example 2), but it is recommended to include them in the separate document together with the metadata and macro definitions (see below).
If metadata and macro definitions are provided, there must be a second document preceeding the converter tree specification, including these definitions.
It is highly recommended to specify the version of the CaosDB crawler for which the cfood is written in the metadata section, see below.
Examples
A single document with a converter tree specification:
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
A single document with a converter tree specification, but also including a custom converters section:
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
A yaml multi-document, defining metadata and some macros in the first document and declaring
two custom converters in the second document (not recommended, see the recommended version below). Please note, that two separate yaml documents can be defined using the ---
syntax:
---
metadata:
name: Datascience CFood
description: CFood for data from the local data science work group
crawler-version: 0.2.1
macros:
- !defmacro
name: SimulationDatasetFile
params:
match: null
recordtype: null
nodename: null
definition:
# (...)
---
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
The recommended way of defining metadata, custom converters, macros and the main cfood specification is shown in the following code example:
---
metadata:
name: Datascience CFood
description: CFood for data from the local data science work group
crawler-version: 0.2.1
macros:
- !defmacro
name: SimulationDatasetFile
params:
match: null
recordtype: null
nodename: null
definition:
# (...)
Converters:
CustomConverter_1:
package: mypackage.converters
converter: CustomConverter1
CustomConverter_2:
package: mypackage.converters
converter: CustomConverter2
---
extroot:
type: Directory
match: ^extroot$
subtree:
DataAnalysis:
type: Directory
match: DataAnalysis
# (...)
List Mode
Specifying values of properties can make use of two special characters, in order to automatically create lists or multi properties instead of single values:
Experiment1:
Measurement: +Measurement # Element in List (list is cleared before run)
*Measurement # Multi Property (properties are removed before run)
Measurement # Overwrite
Values and units
Property values can be specified as a simple strings (as above) or as a dictionaries that may also specify the collection mode. Strings starting with a “$” will be replaced by a corresponding variable if there is any. See the tutorials chapter of this documentation for more elaborate examples on how the variable replacment works exactly. A simple example could look the following.
ValueElt:
type: TextElement
match_name: ^my_prop$
match_value: "(?P<value>.*)" # Anything in here is stored in the variable "value"
records:
MyRecord:
MyProp: $value # will be replace by whatever is stored in the "value" variable set above.
If not given explicitly, the collection mode will be determined from the first character of the property value as explained above, and the following three definitions are all equivalent:
MyProp: +$value
MyProp:
value: +$value
and
MyProp:
value: $value
collection_mode: list
Units of numeric values can be set by providing a property value not
as a single string, but as a dictionary with a value
and a
unit
key. Within a converter definition this could look the
following.
ValueWithUnitElt:
type: TextElement
match_name: ^my_prop$
match_value: "^(?P<number>\\d+\\.?\\d*)\s+(?P<unit>.+)" # Extract value and unit from a string which
# has a number followed by at least one whitespace
# character followed by a unit.
records:
MyRecord:
MyProp:
value: $number
unit: $unit
File Entities
In order to use File Entities, you must set the appropriate role: File
.
Additionally, the path and file keys have to be given, with values that set the
paths remotely and locally, respectively. You can use the variable
<converter name>_path
that is automatically created by converters that deal
with file system related StructureElements. The file object itsself is stored
in a vairable with the same name (as it is the case for other Records).
somefile:
type: SimpleFile
match: ^params.*$ # match any file that starts with "params"
records:
fileEntity:
role: File # necessary to create a File Entity
path: somefile.path # defines the path in CaosDB
file: somefile.path # path where the file is found locally
SomeRecord:
ParameterFile: $fileEntity # creates a reference to the file
Transform Functions
You can use transform functions to alter variable values that the crawler consumes (e.g. a string that was matched with a reg exp). See Converter Documentation.
You can define your own transform functions by adding the the same way you add custom converters:
Transformers:
transform_foo:
package: some.package
function: some_foo
Automatically generated keys
Some variable names are automatically generated and can be used using the
$<variable name>
syntax. Those include:
<converter name>
: access the path of converter names to the current converter<converter name>.path
: the file system path to the structure element (file system related converters only; you need curly brackets to use them:${<converter name>.path}
)<Record key>
: all entities that are created in therecords
section are available under the same key