CaosDB Crawler
The CaosDB crawler is a tool for the automated insertion or update of entities in CaosDB. Typically, a file structure is crawled, but other things can be crawled as well. For example tables or HDF5 files.
Introduction
In simple terms, the crawler is a program that scans a directory structure, identifies files that will be treated, and generates corresponding Entities in CaosDB, possibly filling meta data. During this process the crawler can also open files and derive content from within, for example reading CSV tables and processing individual rows of these tables.
As shown in the figure, the general principle of the crawler framework is the following:
The crawler walks through the file structure and matches file names (using regular expressions).
For each the matched file, finger prints (so called
Identifiables
) are created.Then the crawler checks if the Records corresponding to the
Identifiables
exist already and if they are up-to-date. If not, the Records are created or updated according to the file contents.
Technically, this behaviour can be adjusted to your needs using so called CFood (pun intended! :-)) Python classes. More details on the different components of the CaosDB Crawler can be found in the developers’ information below.
In case you are happy with our suggestion of a standard crawler, feel
free to use the standard crawler. The standard crawler lives in the submodule
caosadvancedtools.scifolder
Usage
Typically, you can invoke the crawler in two ways: via the web interface and directly as a Python script.
In both cases, if the crawler has a problem with some file (e.g. columns in a table (tsv, xls, …) are named incorrectly), the problem will be indicated by a warning that is returned. You can fix the problem and run the crawler again. This does not cause any problems, since the crawler can recognize what has already been processed (see the description of finger prints in the Introduction).
Warning
Pay attention when you change a property that is used for the finger print: The crawler will not be able to identify a previous version with the changed one since the finger print is different. This often means that entities in the data base need to be changed or removed. As a rule of thumb, you should be pretty sure that properties that are used as finger prints will not change after the crawler has run for the first time. This prevents complications.
Invocation from the Web Interface
If enabled, the crawler can be called using a menu entry in the web interface. This will open a form where the path of the directory that shall be crawled needs to be given. After the execution information about what was done and which problems might exist is printed in the web interface. Note that some changes might be pending authorization (if indicated in the messages).
Invocation as Python Script
The crawler can be executed directly via a Python script (usually called
crawl.py
). The script prints the progress and reports potential
problems. The exact behavior depends on your setup. However, you can
have a look at the example in the
tests.
Note
The crawler depends on the LinkAhead Python client, so make sure to install pylinkahead.
Call python3 crawl.py --help
to see what parameters can be provided.
Typically, an invocation looks like:
python3 crawl.py /someplace/
Note
For trying out the above mentioned example crawler from the integration tests,
make sure that the extroot
directory in the integrationtests
folder is used as
CaosDB’s extroot directory, and call the crawler indirectly via ./test.sh
.
In this case /someplace/
identifies the path to be crawled within
CaosDB’s file system. You can browse the CaosDB file system by
opening the WebUI of your CaosDB instance and clicking on “File System”.
In the backend, crawl.py
starts a CQL query
FIND File WHICH IS STORED AT /someplace/**
and crawls the resulting
files according to your customized CFoods
.
Crawling may consist of two distinct steps: 1. Insertion of files (use
function loadFiles
) 2. The actual crawling (use program
crawl.py
) However, the first step may be included in crawl.py
.
Otherwise, you can only crawl files that were previously inserted by
loadFiles
.
loadFiles
After installation of the caosadvancedtools
you can simply call the
function loadFiles
contained in the package:
python3 -m caosadvancedtools.loadFiles /opt/caosdb/mnt/extroot
/opt/caosdb/mnt/extroot
is the root of the file system to be crawled
as seen by the CaosDB server (The actual path may vary. This is the used
in the LinkAhead distribution of CaosDB). In this case the root file
system as seen from within the CaosDB docker process is used.
You can provide a .caosdbignore
file as a commandline option to the above
loadFiles command. The syntax of that file is the same as for gitignore files. Note, that you can have additional
.caosdbignore
files at lower levels which are appended to the current ignore
file and have an effect of the respective subtree.
Extending the Crawlers
In most use cases the crawler needs to be tailored to the specific needs. This section explains how this can be done.
The behavior and rules of the crawler are defined in logical units called CFoods. In order to extend the crawler you need to extend an existing CFood or create new one.
Note
A crawler always needs a corresponding data model to exits in the server. The following does not cover this aspect. Please refer for example to documentation of the YAML Interface.
CFood – Introduction
A CFood
is a Python class that inherits from the abstract base
class AbstractCFood
.
It should be independent of other data and define the following methods:
-
This static method is required for classes which inherit from
AbstractFileCFood
. It returns a regular expression to identify files that can be consumed by this CFood. For other CFood implementations, overload thematch_item()
method. -
This method defines (and inserts if necessary) the identifiables which may be updated at a later stage. After calling this method, the
AbstractCFood.identifiables
Container contains those Records which will be updated at a later time. update_identifiables()
This method updates the stored identifiables as necessary. All Entities which need to be updated on the Server must be in
AbstractCFood.to_be_updated
after this call.
As hinted above, the main feature of an identifiable
is its fingerprinting ability: it has
sufficient properties to identify an existing Record in CaosDB so that the CFood can decide which
Records should be updated by the Crawler instead of inserting a new one. Obviously, this allows the
Crawler to run twice on the same file structure without duplicating the data in CaosDB.
An identifiable
is a Python Record
object with the features to
identify the correct Record in CaosDB. This object is used to create a query in order to determine
whether the Record exists. If the Record does not exist, the identifiable
is used to insert the
Record. Thus, after this step the Crawler guarantees that a Record with the features of the
identifiable
exists in CaosDB (either previously existing or newly created).
An example: An experiment might be uniquely identified by the date when it was conducted and a
number. The identifiable
might then look as follows:
<Record>
<Parent name="Experiment"/>
<Property name="date">2020-04-19</Property>
<Property name="Exp-No">9</Property>
</Record>
CFoods and the Crawler
In short, the Crawler interacts with the available CFoods in the following way:
The Crawler iterates over the available objects (for example files), and for each object
o
.The Crawler tests which of the available CFoods can consume the object
o
, using the CFoods’match_item()
class method.If the CFood matches against the object, an instance of that CFood is instantiated with that object
o
and stored for later, likecfoods.append(CFood(o))
.
The Crawler then iterates over the stored CFood instances, and for each instance
cfood
does:cfood.create_identifiables()
As described above, create identifiables.All the identifiables in
cfood.identifiables
are searched for existence in the CaosDB instance, and inserted if they do not exist.cfood.update_identifiables()
As described above, update the identifiables if their content needs to change.All the identifiables in
cfood.to_be_updated
are synced to the CaosDB instance.
The following sketch aims to visualize this procedure.
Note
Practical hint: After the call to
create_identifiables()
, the Crawler
guarantees that an Experiment
with those properties exists in CaosDB. In the call to
update_identifiables()
, further properties
might be added to this Record, e.g. references to data files that were recorded in that
experiment or to the person that did the experiment.
CFood – An example
Let’s look at the following Example:
>>> # Example CFood
>>> from caosadvancedtools.cfood import AbstractFileCFood, assure_has_property
>>> import linkahead as db
>>>
>>> class ExampleCFood(AbstractFileCFood):
... @staticmethod
... def get_re():
... return (r".*/(?P<species>[^/]+)/"
... r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
...
... def create_identifiables(self):
... self.experiment = db.Record()
... self.experiment.add_parent(name="Experiment")
... self.experiment.add_property(
... name="date",
... value=self.match.group('date'))
... self.identifiables.append(self.experiment)
...
... def update_identifiables(self):
... assure_has_property(
... self.experiment,
... "species",
... self.match.group('species'))
>>> # check whether the definition is valid
>>> cf = ExampleCFood('')
Every child of AbstractFileCFood
(AbstractFileCFood
is for
crawling files…, and yes, you can crawl other stuff as well) needs to
implement the functions get_re
, create_identifiables
,
update_identifiables
. The function get_re()
defines which files shall be treated with this CFood. The function needs
to return a string with a regular expression. Here, the expression
matches any “README.md” file that is located below two folder levels
like: /any/path/whale/2020-01-01/README.md
. Note that the groups
defined in the regular expressions (date
and species
) can be
later used via self.match.group('name')
.
create_identifiables()
defines the identifiables
that are needed and
update_identifiables()
applies additional changes. Here, an Experiment
Record is identified
using solely the date. This implies that there must NOT exist two
Experiment
Records with the same date. If this might occur, an
additional property needs to be added to the identifiable. The
identifiables
have to be added to the self.identifiables
list.
After the correct Record has been identified (or created if none existed) an additional property is added that describes the species.
Your CFood needs to be passed to the crawler instance in the
crawl.py
file that you use for crawling. For example like this:
c = FileCrawler(files=files, cfood_types=[ExampleCFood])
CFood – Advanced
CFoods have some additional features in order to cope with complex
scenarios. For example, what if multiple files are together needed to
create some Record? Multiple data files recorded in an experiment could
be one example. CFoods may define the
collect_information()
function. In this function additional information can be collected by
accessing files or querying the database. One particular use case is to
add file paths to the attached_filenames
property. By default, all
files that are located at those paths are also treated by this CFood.
This also means that the crawler does not list those files as
“untreated”.
One special case is the existence of multiple, very similar files.
Imagine that you want to treat a range of calibration images with a
CFood. You can write a regular expression to match all the files but it
might be hard to match one particular. In this case, you should use the
CMeal
mix-in. This will assure that the first match will create a CFood and
all following ones are attached to the same instance. For further
information, please consult the API documentation
.
As the crawler may run in different environments, it might be different
how files can be accessed. This can be defined using the File Guide
.
In the crawl.py
file, you should set this appropriately:
>>> from caosadvancedtools.cfood import fileguide
>>> import os
>>> fileguide.access = lambda path: "/main/data/" + path
This prefixes all paths that are used in CaosDB with “/main/data/”. In CFoods, files can then be accessed using the fileguide as follows:
with open(fileguide.access("/some/path")):
# do stuff
pass
Changing data in CaosDB
As described above, a Record matching the identifiable will be inserted if no such Record existed before. This is typically unproblematic. However, what if existing Records need to be modified? Many manipulations have the potential of overwriting changes in made in CaosDB. Thus, unless the data being crawled is a single source of truth for the information in CaosDB (and changes to the respective data in CaosDB should thus not be possible) changes have to be done with some considerations.
Use the functions assure_has_xyz
defined in the cfood module
to
only add a given property, if it is not yet existing. And use the
functions assure_xyz_is
to force the value of a property (see
remarks above).
To further assure that changes are correct, the crawler comes with an
authorization mechanism. When running the crawler with the crawl
function, a security level can be given.
>>> from caosadvancedtools.crawler import FileCrawler
>>> from caosadvancedtools.guard import RETRIEVE, INSERT, UPDATE
>>> files = [] # put files to be crawled in this list
>>> c = FileCrawler(
... files=files,
... cfood_types=[ExampleCFood],
... interactive=False) # the crawler runs without asking intermediate questions
>>> c.crawl(security_level=INSERT)
This assures that every manipulation of data in CaosDB that is done via the functions provided by
the guard
class is checked against the provided security level:
RETRIEVE
: allows only to retrieve data from CaosDB. No manipulation is allowedINSERT
: allows only to insert new entities and the manipulation of those newly inserted onesUPDATE
: allows all manipulations
This implies that all data manipulation of the crawler should use the
functions that are checked by the guard. When writing a CFood you should
stick to the above mentioned assure_has_xyz
and assure_xyz_is
functions which use the respective data manipulation functions.
If you provide the to_be_updated
member variable of CFoods to those
assure...
functions, the crawler provides another convenient
feature: When an update is prevented due to the security level, the
update is saved and can be subsequently be authorized. If the crawler
runs on the CaosDB server, it will try to send a mail which allows to
authorize the change. If it runs as a local script it will notify you
that there are unauthorized changes and provide a code with which the
crawler can be started to authorize the change.
Real World Example
A crawler implementation exists that can crawl a file structure that adheres to the rules defined in this Data publication. The project is of moderate size and shows how a set of CFoods can be defined to deal with a complex file structure.
You can find detailed information on files need to be structured here and the source code of the CFoods here.
Sources
Source of the fingerprint picture: https://svgsilh.com/image/1298040.html