Converters treat StructureElements and thereby create the StructureElement that are the children of the treated StructureElement. Converters therefore create the tree of structure elements. The definition of a Converter also contains what Converters shall be used to treat the generated child-StructureElements. The definition is therefore a tree itself.

Each StructureElement in the tree has a set of data values, i.e a dictionary of key value pairs. Some of those values are set due to the kind of StructureElement. For example, a file could have the file name as such a key value pair: ‘filename’: <sth>. Converters may define additional functions that create further values. For example, a regular expresion could be used to get a date from a file name.

A converter is defined via a yml file or part of it. The definition states what kind of StructureElement it treats (typically one). Also, it defines how children of the current StructureElement are created and what Converters shall be used to treat those.

The yaml definition looks like the following:

TODO: outdated, see cfood-schema.yml

    type: <ConverterName>
    match: ".*"
            - Experiment
            - Blablabla
            date: $DATUM
            - Experiment

The <NodeName> is a description of what it represents (e.g. ‘experiment-folder’) and is used as identifier.

<type> selects the converter that is going to be matched against the current structure element. If the structure element matches (this is a combination of a typecheck and a detailed match, see Converter for details) the converter is used to generate records (see create_records()) and to possibly process a subtree, as defined by the function caoscrawler.converters.create_children().

records is a dict of definitions that define the semantic structure (see details below).

Subtree contains a list of Converter defnitions that look like the one described here.

Standard Converters

Directory Converter

The Directory Converter creates StructureElements for each File and Directory inside the current Directory. You can match a regular expression against the directory name using the ‘match’ key.

Simple File Converter

The Simple File Converter does not create any children and is usually used if a file shall be used as it is and be inserted and referenced by other entities.

Markdown File Converter

Reads a YAML header from Markdown files (if such a header exists) and creates children elements according to the structure of the header.

DictElement Converter

Creates a child StructureElement for each key in the dictionary.

Typical Subtree converters

The following StructureElement are typically created:

  • BooleanElement

  • FloatElement

  • TextElement

  • IntegerElement

  • ListElement

  • DictElement

Scalar Value Converters

BooleanElementConverter, FloatElementConverter, TextElementConverter, and IntegerElementConverter behave very similarly.

These converters expect match_name and match_value in their definition which allow to match the key and the value, respectively.

Note that there are defaults for accepting other types. For example, FloatElementConverter also accepts IntegerElements. The default behavior can be adjusted with the fields accept_text, accept_int, accept_float, and accept_bool.

The following denotes what kind of StructureElements are accepted by default (they are defined in src/caoscrawler/

  • DictBooleanElementConverter: bool, int

  • DictFloatElementConverter: int, float

  • DictTextElementConverter: text, bool, int, float

  • DictIntegerElementConverter: int

  • DictListElementConverter: list

  • DictDictElementConverter: dict


A specialized Dict Converter for yaml files: Yaml files are opened and the contents are converted into dictionaries that can be further converted using the typical subtree converters of dict converter.

WARNING: Currently unfinished implementation.



A generic converter (abstract) for files containing tables. Currently, there are two specialized implementations for xlsx-files and csv-files.

All table converters generate a subtree that can be converted with DictDictElementConverters: For each row in the table a DictDictElement (structure element) is generated. The key of the element is the row number. The value of the element is a dict containing the mapping of column names to values of the respective cell.


    type: CSVTableConverter
    match: ^test_table.csv$
      (...)  # Records edited for the whole table file
        type: DictDictElement
        match_name: .*
        match_value: .*
          (...)  # Records edited for each row
            type: DictFloatElement
            match_name: measurement  # Name of the column in the table file
            match_value: (?P<column_value).*)
              (...)  # Records edited for each cell



Custom Converters

It was previously mentioned that it is possible to create custom converters. These custom converters can be used to integrate arbitrary data extraction and ETL capabilities into the caosdb-crawler and make these extensions available to any yaml specification.

The basic syntax for adding a custom converter to a yaml cfood definition file is:

    package: <python>.<module>.<name>
    converter: <PythonClassName>

The Converters-section can be either put into the first or second document of the cfood yaml file. It can be also part of a single-document yaml cfood file. Please refer to the cfood documentation for more details.


  • <NameOfTheConverterInYamlFile>: This is the name of the converter as it is going to be used in the present yaml file.

  • <python>.<module>.<name>: The name of the module where the converter class resides.

  • <PythonClassName>: Within this specified module there must be a class inheriting from base class caoscrawler.converters.Converter.

The following methods are abstract and need to be overwritten by your custom converter to make it work:

  • create_children()

  • match()

  • typecheck()


In the following, we will explain the process of adding a custom converter to a yaml file using a SourceResolver that is able to attach a source element to another entity.

Note: This example might become a standard crawler soon, as part of the scifolder specification. See for details. In this documentation example we will, therefore, add it to a package called “scifolder”.

First we will create our package and module structure, which might be:

      converters/  # <- the actual file containing
                    #    the converter class

Now we need to create a class called “SourceResolver” in the file “”. In this - more advanced - example, we will not inherit our converter directly from Converter, but use TextElementConverter. The latter already implements match() and typecheck(), so only an implementation for create_children() has to be provided by us. Furthermore we will customize the method create_records() that allows us to specify a more complex record generation procedure than provided in the standard implementation. One specific limitation of the standard implementation is, that only a fixed number of records can be generated by the yaml definition. So for any applications - like here - that require an arbitrary number of records to be created, a customized implementation of create_records() is recommended. In this context it is recommended to make use of the function caoscrawler.converters.create_records() that implements creation of record objects from python dictionaries of the same structure that would be given using a yaml definition (see next section below).

import re
from caoscrawler.stores import GeneralStore, RecordStore
from caoscrawler.converters import TextElementConverter, create_records
from caoscrawler.structure_elements import StructureElement, TextElement

class SourceResolver(TextElementConverter):
  This resolver uses a source list element (e.g. from the markdown readme file)
  to link sources correctly.

  def __init__(self, definition: dict, name: str,
               converter_registry: dict):
      Initialize a new directory converter.
      super().__init__(definition, name, converter_registry)

  def create_children(self, generalStore: GeneralStore,
                            element: StructureElement):

      # The source resolver does not create children:

      return []

  def create_records(self, values: GeneralStore,
                     records: RecordStore,
                     element: StructureElement,
      if not isinstance(element, TextElement):
          raise RuntimeError()

      # This function must return a list containing tuples, each one for a modified
      # property: (name_of_entity, name_of_property)
      keys_modified = []

      # This is the name of the entity where the source is going to be attached:
      attach_to_scientific_activity = self.definition["scientific_activity"]
      rec = records[attach_to_scientific_activity]

      # The "source" is a path to a source project, so it should have the form:
      # /<Category>/<project>/<scientific_activity>/
      # obtain these information from the structure element:
      val = element.value
      regexp = (r'/(?P<category>(SimulationData)|(ExperimentalData)|(DataAnalysis))'

      res = re.match(regexp, val)
      if res is None:
          raise RuntimeError("Source cannot be parsed correctly.")

      # Mapping of categories on the file system to corresponding record types in CaosDB:
      cat_map = {
          "SimulationData": "Simulation",
          "ExperimentalData": "Experiment",
          "DataAnalysis": "DataAnalysis"}
      linkrt = cat_map["category")]

      keys_modified.extend(create_records(values, records, {
          "Project": {
          linkrt: {
              "project": "$Project"
          attach_to_scientific_activity: {
              "sources": "+$" + linkrt
          }}, file_path_prefix))

      # Process the records section of the yaml definition:
          super().create_records(values, records, element, file_path_prefix))

      # The create_records function must return the modified keys to make it compatible
      # to the crawler functions:
      return keys_modified

If the recommended (python) package structure is used, the package containing the converter definition can just be installed using pip install . or pip install -e . from the scifolder_package directory.

The following yaml block will register the converter in a yaml file:

    package: scifolder.converters.sources
    converter: SourceResolver

Using the create_records API function

The function caoscrawler.converters.create_records() was already mentioned above and it is the recommended way to create new records from custom converters. Let’s have a look at the function signature:

def create_records(values: GeneralStore,  # <- pass the current variables store here
                   records: RecordStore,  # <- pass the current store of CaosDB records here
                   def_records: dict):    # <- This is the actual definition of new records!

def_records is the actual definition of new records according to the yaml cfood specification (work in progress, in the docs). Essentially you can do everything here, that you could do in the yaml document as well, but using python source code.

Let’s have a look at a few examples:

  type: Directory
  match: (?P<dir_name>.*)
      identifier: $dir_name

This block will just create a new record with parent Experiment and one property identifier with a value derived from the matching regular expression.

Let’s formulate that using create_records:

dir_name = "directory name"

record_def = {
  "Experiment": {
    "identifier": dir_name

keys_modified = create_records(values, records,

The dir_name is set explicitely here, everything else is identical to the yaml statements.

The role of keys_modified

You probably have noticed already, that caoscrawler.converters.create_records() returns keys_modified which is a list of tuples. Each element of keys_modified has two elements:

  • Element 0 is the name of the record that is modified (as used in the record store records).

  • Element 1 is the name of the property that is modified.

It is important, that the correct list of modified keys is returned by create_records() to make the crawler process work.

So, a sketch of a typical implementation within a custom converter could look like this:

def create_records(self, values: GeneralStore,
                     records: RecordStore,
                     element: StructureElement,
                     file_path_prefix: str):

  # Modify some records:
  record_def = {
    # ...

keys_modified = create_records(values, records,

# You can of course do it multiple times:
keys_modified.extend(create_records(values, records,

# You can also process the records section of the yaml definition:
       super().create_records(values, records, element, file_path_prefix))
# This essentially allows users of your converter to customize the creation of records
# by providing a custom "records" section additionally to the modifications provided
# in this implementation of the Converter.

# Important: Return the list of modified keys!
return keys_modified

More complex example

Let’s have a look at a more complex examples, defining multiple records:

  type: Directory
  match: (?P<dir_name>.*)
      identifier: project_name
      identifier: $dir_name
      Project: $Project
      projects: +$Project

This block will create two new Records:

  • A project with a constant identifier

  • An experiment with an identifier, derived from a regular expression and a reference to the new project.

Furthermore a Record ProjectGroup will be edited (its initial definition is not given in the yaml block): The project that was just created will be added as a list element to the property projects.

Let’s formulate that using create_records (again, dir_name is constant here):

dir_name = "directory name"

record_def = {
  "Project": {
    "identifier": "project_name",
  "Experiment": {
    "identifier": dir_name,
    "Project": "$Project",
  "ProjectGroup": {
    "projects": "+$Project",


keys_modified = create_records(values, records,


You can add the key debug_match to the definition of a Converter in order to create debugging output for the match step. The following snippet illustrates this:

  type: Directory
  match: (?P<dir_name>.*)
  debug_match: True
      identifier: project_name

Whenever this Converter tries to match a StructureElement, it logs what was tried to macht against what and what the result was.