CaosDB Crawler

The CaosDB crawler is a tool for the automated insertion or update of entities in CaosDB. Typically, a file structure is crawled, but other things can be crawled as well. For example tables or HDF5 files.

Introduction

In simple terms, the crawler is a program that scans a directory structure, identifies files that will be treated, and generates corresponding Entities in CaosDB, possibly filling meta data. During this process the crawler can also open files and derive content from within, for example reading CSV tables and processing individual rows of these tables.

As shown in the figure, the general principle of the crawler framework is the following:

The crawler walks through the file structure and matches file names (using regular expressions).
For each the matched file, finger prints (so called Identifiables) are created.
Then the crawler checks if the Records corresponding to the Identifiables exist already and if they are up-to-date. If not, the Records are created or updated according to the file contents.

Technically, this behaviour can be adjusted to your needs using so called CFood (pun intended! :-)) Python classes. More details on the different components of the CaosDB Crawler can be found in the developers’ information below.

In case you are happy with our suggestion of a standard crawler, feel free to use the standard crawler. The standard crawler lives in the submodule caosadvancedtools.scifolder

Usage

Typically, you can invoke the crawler in two ways: via the web interface and directly as a Python script.

In both cases, if the crawler has a problem with some file (e.g. columns in a table (tsv, xls, …) are named incorrectly), the problem will be indicated by a warning that is returned. You can fix the problem and run the crawler again. This does not cause any problems, since the crawler can recognize what has already been processed (see the description of finger prints in the Introduction).

Warning

Pay attention when you change a property that is used for the finger print: The crawler will not be able to identify a previous version with the changed one since the finger print is different. This often means that entities in the data base need to be changed or removed. As a rule of thumb, you should be pretty sure that properties that are used as finger prints will not change after the crawler has run for the first time. This prevents complications.

Invocation from the Web Interface

If enabled, the crawler can be called using a menu entry in the web interface. This will open a form where the path of the directory that shall be crawled needs to be given. After the execution information about what was done and which problems might exist is printed in the web interface. Note that some changes might be pending authorization (if indicated in the messages).

Invocation as Python Script

The crawler can be executed directly via a Python script (usually called crawl.py). The script prints the progress and reports potential problems. The exact behavior depends on your setup. However, you can have a look at the example in the tests.

Note

The crawler depends on the LinkAhead Python client, so make sure to install pylinkahead.

Call python3 crawl.py --help to see what parameters can be provided. Typically, an invocation looks like:

python3 crawl.py /someplace/

Note

For trying out the above mentioned example crawler from the integration tests, make sure that the extroot directory in the integrationtests folder is used as CaosDB’s extroot directory, and call the crawler indirectly via ./test.sh.

In this case /someplace/ identifies the path to be crawled within CaosDB’s file system. You can browse the CaosDB file system by opening the WebUI of your CaosDB instance and clicking on “File System”.

In the backend, crawl.py starts a CQL query FIND File WHICH IS STORED AT /someplace/** and crawls the resulting files according to your customized CFoods.

Crawling may consist of two distinct steps: 1. Insertion of files (use function loadFiles) 2. The actual crawling (use program crawl.py) However, the first step may be included in crawl.py. Otherwise, you can only crawl files that were previously inserted by loadFiles.

loadFiles

After installation of the caosadvancedtools you can simply call the function loadFiles contained in the package:

python3 -m caosadvancedtools.loadFiles  /opt/caosdb/mnt/extroot

/opt/caosdb/mnt/extroot is the root of the file system to be crawled as seen by the CaosDB server (The actual path may vary. This is the used in the LinkAhead distribution of CaosDB). In this case the root file system as seen from within the CaosDB docker process is used.

You can provide a .caosdbignore file as a commandline option to the above loadFiles command. The syntax of that file is the same as for gitignore files. Note, that you can have additional .caosdbignore files at lower levels which are appended to the current ignore file and have an effect of the respective subtree.

Extending the Crawlers

In most use cases the crawler needs to be tailored to the specific needs. This section explains how this can be done.

The behavior and rules of the crawler are defined in logical units called CFoods. In order to extend the crawler you need to extend an existing CFood or create new one.

Note

A crawler always needs a corresponding data model to exits in the server. The following does not cover this aspect. Please refer for example to documentation of the YAML Interface.

CFood – Introduction

A CFood is a Python class that inherits from the abstract base class AbstractCFood. It should be independent of other data and define the following methods:

get_re()

This static method is required for classes which inherit from AbstractFileCFood. It returns a regular expression to identify files that can be consumed by this CFood. For other CFood implementations, overload the match_item() method.
create_identifiables()

This method defines (and inserts if necessary) the identifiables which may be updated at a later stage. After calling this method, the AbstractCFood.identifiables Container contains those Records which will be updated at a later time.
update_identifiables()
This method updates the stored identifiables as necessary. All Entities which need to be updated on the Server must be in AbstractCFood.to_be_updated after this call.

As hinted above, the main feature of an identifiable is its fingerprinting ability: it has sufficient properties to identify an existing Record in CaosDB so that the CFood can decide which Records should be updated by the Crawler instead of inserting a new one. Obviously, this allows the Crawler to run twice on the same file structure without duplicating the data in CaosDB.

An identifiable is a Python Record object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If the Record does not exist, the identifiable is used to insert the Record. Thus, after this step the Crawler guarantees that a Record with the features of the identifiable exists in CaosDB (either previously existing or newly created).

An example: An experiment might be uniquely identified by the date when it was conducted and a number. The identifiable might then look as follows:

<Record>
  <Parent name="Experiment"/>
  <Property name="date">2020-04-19</Property>
  <Property name="Exp-No">9</Property>
</Record>

CFoods and the Crawler

In short, the Crawler interacts with the available CFoods in the following way:

The Crawler iterates over the available objects (for example files), and for each object o.
1. The Crawler tests which of the available CFoods can consume the object o, using the CFoods’ match_item() class method.
2. If the CFood matches against the object, an instance of that CFood is instantiated with that object o and stored for later, like cfoods.append(CFood(o)).
The Crawler then iterates over the stored CFood instances, and for each instance cfood does:
1. cfood.create_identifiables() As described above, create identifiables.
2. All the identifiables in cfood.identifiables are searched for existence in the CaosDB instance, and inserted if they do not exist.
3. cfood.update_identifiables() As described above, update the identifiables if their content needs to change.
4. All the identifiables in cfood.to_be_updated are synced to the CaosDB instance.

The following sketch aims to visualize this procedure.

_images/crawler_flow_sketch.svg — Sketch of how the Crawler uses the CFoods to process objects. Of the four identifiables (fingerprints) on the right, only the second does not exist yet and is thus inserted in the second step. Only the identifiables number 2 and 4 have new or changed content, so only these are synced to CaosDB in the last step.

Note

Practical hint: After the call to create_identifiables(), the Crawler guarantees that an Experiment with those properties exists in CaosDB. In the call to update_identifiables(), further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment.

CFood – An example

Let’s look at the following Example:

>>> # Example CFood
>>> from caosadvancedtools.cfood import AbstractFileCFood, assure_has_property
>>> import linkahead as db
>>>
>>> class ExampleCFood(AbstractFileCFood):
...     @staticmethod
...     def get_re():
...         return (r".*/(?P<species>[^/]+)/"
...                 r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
...
...     def create_identifiables(self):
...         self.experiment = db.Record()
...         self.experiment.add_parent(name="Experiment")
...         self.experiment.add_property(
...             name="date",
...             value=self.match.group('date'))
...         self.identifiables.append(self.experiment)
...
...     def update_identifiables(self):
...         assure_has_property(
...             self.experiment,
...             "species",
...             self.match.group('species'))

>>> # check whether the definition is valid
>>> cf = ExampleCFood('')

Every child of AbstractFileCFood (AbstractFileCFood is for crawling files…, and yes, you can crawl other stuff as well) needs to implement the functions get_re, create_identifiables, update_identifiables. The function get_re() defines which files shall be treated with this CFood. The function needs to return a string with a regular expression. Here, the expression matches any “README.md” file that is located below two folder levels like: /any/path/whale/2020-01-01/README.md. Note that the groups defined in the regular expressions (date and species) can be later used via self.match.group('name').

create_identifiables() defines the identifiables that are needed and update_identifiables() applies additional changes. Here, an Experiment Record is identified using solely the date. This implies that there must NOT exist two Experiment Records with the same date. If this might occur, an additional property needs to be added to the identifiable. The identifiables have to be added to the self.identifiables list.

After the correct Record has been identified (or created if none existed) an additional property is added that describes the species.

Your CFood needs to be passed to the crawler instance in the crawl.py file that you use for crawling. For example like this:

c = FileCrawler(files=files, cfood_types=[ExampleCFood])

CFood – Advanced

CFoods have some additional features in order to cope with complex scenarios. For example, what if multiple files are together needed to create some Record? Multiple data files recorded in an experiment could be one example. CFoods may define the collect_information() function. In this function additional information can be collected by accessing files or querying the database. One particular use case is to add file paths to the attached_filenames property. By default, all files that are located at those paths are also treated by this CFood. This also means that the crawler does not list those files as “untreated”.

One special case is the existence of multiple, very similar files. Imagine that you want to treat a range of calibration images with a CFood. You can write a regular expression to match all the files but it might be hard to match one particular. In this case, you should use the CMeal mix-in. This will assure that the first match will create a CFood and all following ones are attached to the same instance. For further information, please consult the API documentation.

As the crawler may run in different environments, it might be different how files can be accessed. This can be defined using the File Guide.

In the crawl.py file, you should set this appropriately:

>>> from caosadvancedtools.cfood import fileguide
>>> import os

>>> fileguide.access = lambda path: "/main/data/" + path

This prefixes all paths that are used in CaosDB with “/main/data/”. In CFoods, files can then be accessed using the fileguide as follows:

with open(fileguide.access("/some/path")):
# do stuff
   pass

Changing data in CaosDB

As described above, a Record matching the identifiable will be inserted if no such Record existed before. This is typically unproblematic. However, what if existing Records need to be modified? Many manipulations have the potential of overwriting changes in made in CaosDB. Thus, unless the data being crawled is a single source of truth for the information in CaosDB (and changes to the respective data in CaosDB should thus not be possible) changes have to be done with some considerations.

Use the functions assure_has_xyz defined in the cfood module to only add a given property, if it is not yet existing. And use the functions assure_xyz_is to force the value of a property (see remarks above).

To further assure that changes are correct, the crawler comes with an authorization mechanism. When running the crawler with the crawl function, a security level can be given.

>>> from caosadvancedtools.crawler import FileCrawler
>>> from caosadvancedtools.guard import RETRIEVE, INSERT, UPDATE
>>> files = [] # put files to be crawled in this list
>>> c = FileCrawler(
...     files=files,
...     cfood_types=[ExampleCFood],
...     interactive=False) # the crawler runs without asking intermediate questions
>>> c.crawl(security_level=INSERT)

This assures that every manipulation of data in CaosDB that is done via the functions provided by the guard class is checked against the provided security level:

RETRIEVE: allows only to retrieve data from CaosDB. No manipulation is allowed
INSERT: allows only to insert new entities and the manipulation of those newly inserted ones
UPDATE: allows all manipulations

This implies that all data manipulation of the crawler should use the functions that are checked by the guard. When writing a CFood you should stick to the above mentioned assure_has_xyz and assure_xyz_is functions which use the respective data manipulation functions.

If you provide the to_be_updated member variable of CFoods to those assure... functions, the crawler provides another convenient feature: When an update is prevented due to the security level, the update is saved and can be subsequently be authorized. If the crawler runs on the CaosDB server, it will try to send a mail which allows to authorize the change. If it runs as a local script it will notify you that there are unauthorized changes and provide a code with which the crawler can be started to authorize the change.

Real World Example

A crawler implementation exists that can crawl a file structure that adheres to the rules defined in this Data publication. The project is of moderate size and shows how a set of CFoods can be defined to deal with a complex file structure.

You can find detailed information on files need to be structured here and the source code of the CFoods here.

Sources

Source of the fingerprint picture: https://svgsilh.com/image/1298040.html