============== CaosDB Crawler ============== The `CaosDB crawler `__ is a tool for the automated insertion or update of entities in CaosDB. Typically, a file structure is crawled, but other things can be crawled as well. For example tables or HDF5 files. Introduction ============ In simple terms, the crawler is a program that scans a directory structure, identifies files that will be treated, and generates corresponding Entities in CaosDB, possibly filling meta data. During this process the crawler can also open files and derive content from within, for example reading CSV tables and processing individual rows of these tables. .. image:: images/crawler_fingerprint.* As shown in the figure, the general principle of the crawler framework is the following: - The crawler walks through the file structure and matches file names (using regular expressions). - For each the matched file, finger prints (so called ``Identifiables``) are created. - Then the crawler checks if the Records corresponding to the ``Identifiables`` exist already and if they are up-to-date. If not, the Records are created or updated according to the file contents. Technically, this behaviour can be adjusted to your needs using so called CFood (pun intended! :-)) Python classes. More details on the different components of the CaosDB Crawler can be found in the `developers’ information <#extending-the-crawlers>`__ below. In case you are happy with our suggestion of a standard crawler, feel free to use the standard crawler. The standard crawler lives in the submodule `caosadvancedtools.scifolder` Usage ===== Typically, you can invoke the crawler in two ways: via the web interface and directly as a Python script. In both cases, if the crawler has a problem with some file (e.g. columns in a table (tsv, xls, …) are named incorrectly), the problem will be indicated by a warning that is returned. You can fix the problem and run the crawler again. This does not cause any problems, since the crawler can recognize what has already been processed (see the description of finger prints in the `Introduction <#Introduction>`__). .. warning:: Pay attention when you change a property that is used for the finger print: The crawler will not be able to identify a previous version with the changed one since the finger print is different. This often means that entities in the data base need to be changed or removed. As a rule of thumb, you should be pretty sure that properties that are used as finger prints will not change after the crawler has run for the first time. This prevents complications. Invocation from the Web Interface --------------------------------- If enabled, the crawler can be called using a menu entry in the web interface. This will open a form where the path of the directory that shall be crawled needs to be given. After the execution information about what was done and which problems might exist is printed in the web interface. Note that some changes might be pending authorization (if indicated in the messages). Invocation as Python Script --------------------------- The crawler can be executed directly via a Python script (usually called ``crawl.py``). The script prints the progress and reports potential problems. The exact behavior depends on your setup. However, you can have a look at the example in the `tests `__. .. Note:: The crawler depends on the LinkAhead Python client, so make sure to install :doc:`pylinkahead `. Call ``python3 crawl.py --help`` to see what parameters can be provided. Typically, an invocation looks like: .. code:: sh python3 crawl.py /someplace/ .. Note:: For trying out the above mentioned example crawler from the integration tests, make sure that the ``extroot`` directory in the ``integrationtests`` folder is used as CaosDB's extroot directory, and call the crawler indirectly via ``./test.sh``. In this case ``/someplace/`` identifies the path to be crawled **within CaosDB's file system**. You can browse the CaosDB file system by opening the WebUI of your CaosDB instance and clicking on “File System”. In the backend, ``crawl.py`` starts a CQL query ``FIND File WHICH IS STORED AT /someplace/**`` and crawls the resulting files according to your customized ``CFoods``. Crawling may consist of two distinct steps: 1. Insertion of files (use function ``loadFiles``) 2. The actual crawling (use program ``crawl.py``) However, the first step may be included in ``crawl.py``. Otherwise, you can only crawl files that were previously inserted by ``loadFiles``. loadFiles ~~~~~~~~~ After installation of the ``caosadvancedtools`` you can simply call the function ``loadFiles`` contained in the package: :: python3 -m caosadvancedtools.loadFiles /opt/caosdb/mnt/extroot ``/opt/caosdb/mnt/extroot`` is the root of the file system to be crawled as seen by the CaosDB server (The actual path may vary. This is the used in the LinkAhead distribution of CaosDB). In this case the root file system as seen from within the CaosDB docker process is used. You can provide a ``.caosdbignore`` file as a commandline option to the above loadFiles command. The syntax of that file is the same as for `gitignore `_ files. Note, that you can have additional ``.caosdbignore`` files at lower levels which are appended to the current ignore file and have an effect of the respective subtree. Extending the Crawlers ====================== In most use cases the crawler needs to be tailored to the specific needs. This section explains how this can be done. The behavior and rules of the crawler are defined in logical units called CFoods. In order to extend the crawler you need to extend an existing CFood or create new one. .. Note:: A crawler always needs a corresponding data model to exits in the server. The following does not cover this aspect. Please refer for example to documentation of the YAML Interface. .. _c-food-introduction: CFood -- Introduction ---------------------- A ``CFood`` is a Python class that inherits from the abstract base class :py:class:`~caosadvancedtools.cfood.AbstractCFood`. It should be independent of other data and define the following methods: 1. :py:meth:`~caosadvancedtools.cfood.AbstractFileCFood.get_re` This *static* method is required for classes which inherit from :py:class:`~caosadvancedtools.cfood.AbstractFileCFood`. It returns a regular expression to identify files that can be consumed by this CFood. For other CFood implementations, overload the :py:meth:`~caosadvancedtools.cfood.AbstractCFood.match_item` method. 2. :py:meth:`~caosadvancedtools.cfood.AbstractCFood.create_identifiables` This method defines (and inserts if necessary) the identifiables which may be updated at a later stage. After calling this method, the ``AbstractCFood.identifiables`` Container contains those Records which will be updated at a later time. 3. :py:meth:`~caosadvancedtools.cfood.AbstractCFood.update_identifiables` This method updates the stored identifiables as necessary. All Entities which need to be updated on the Server must be in ``AbstractCFood.to_be_updated`` after this call. As hinted above, the main feature of an ``identifiable`` is its fingerprinting ability: it has sufficient properties to identify an existing Record in CaosDB so that the CFood can decide which Records should be updated by the Crawler instead of inserting a new one. Obviously, this allows the Crawler to run twice on the same file structure without duplicating the data in CaosDB. An ``identifiable`` is a Python :py:class:`~caosdb.common.models.Record` object with the features to identify the correct Record in CaosDB. This object is used to create a query in order to determine whether the Record exists. If the Record does not exist, the ``identifiable`` is used to insert the Record. Thus, after this step the Crawler guarantees that a Record with the features of the ``identifiable`` exists in CaosDB (either previously existing or newly created). An example: An experiment might be uniquely identified by the date when it was conducted and a number. The ``identifiable`` might then look as follows: .. code:: xml 2020-04-19 9 CFoods and the Crawler ----------------------- In short, the Crawler interacts with the available CFoods in the following way: #. The Crawler iterates over the available objects (for example files), and for each object ``o``. #. The Crawler tests which of the available CFoods can consume the object ``o``, using the CFoods' :py:meth:`~caosadvancedtools.cfood.AbstractCFood.match_item` class method. #. If the CFood matches against the object, an instance of that CFood is instantiated with that object ``o`` and stored for later, like ``cfoods.append(CFood(o))``. #. The Crawler then iterates over the stored CFood instances, and for each instance ``cfood`` does: #. ``cfood.create_identifiables()`` As described :ref:`above`, create identifiables. #. All the identifiables in ``cfood.identifiables`` are searched for existence in the CaosDB instance, and inserted if they do not exist. #. ``cfood.update_identifiables()`` As described :ref:`above`, update the identifiables if their content needs to change. #. All the identifiables in ``cfood.to_be_updated`` are synced to the CaosDB instance. The following sketch aims to visualize this procedure. .. figure:: images/crawler_flow_sketch.* Sketch of how the Crawler uses the CFoods to process objects. Of the four identifiables (fingerprints) on the right, only the second does not exist yet and is thus inserted in the second step. Only the identifiables number 2 and 4 have new or changed content, so only these are synced to CaosDB in the last step. .. note:: **Practical hint:** After the call to :py:meth:`~caosadvancedtools.cfood.AbstractCFood.create_identifiables`, the Crawler guarantees that an ``Experiment`` with those properties exists in CaosDB. In the call to :py:meth:`~caosadvancedtools.cfood.AbstractCFood.update_identifiables`, further properties might be added to this Record, e.g. references to data files that were recorded in that experiment or to the person that did the experiment. CFood -- An example -------------------- Let’s look at the following Example: .. code:: python >>> # Example CFood >>> from caosadvancedtools.cfood import AbstractFileCFood, assure_has_property >>> import caosdb as db >>> >>> class ExampleCFood(AbstractFileCFood): ... @staticmethod ... def get_re(): ... return (r".*/(?P[^/]+)/" ... r"(?P\d{4}-\d{2}-\d{2})/README.md") ... ... def create_identifiables(self): ... self.experiment = db.Record() ... self.experiment.add_parent(name="Experiment") ... self.experiment.add_property( ... name="date", ... value=self.match.group('date')) ... self.identifiables.append(self.experiment) ... ... def update_identifiables(self): ... assure_has_property( ... self.experiment, ... "species", ... self.match.group('species')) >>> # check whether the definition is valid >>> cf = ExampleCFood('') Every child of ``AbstractFileCFood`` (``AbstractFileCFood`` is for crawling files…, and yes, you can crawl other stuff as well) needs to implement the functions ``get_re``, ``create_identifiables``, ``update_identifiables``. The function :py:meth:`~caosadvancedtools.cfood.AbstractFileCFood.get_re` defines which files shall be treated with this CFood. The function needs to return a string with a regular expression. Here, the expression matches any "README.md" file that is located below two folder levels like: ``/any/path/whale/2020-01-01/README.md``. Note that the groups defined in the regular expressions (``date`` and ``species``) can be later used via ``self.match.group('name')``. :py:meth:`~caosadvancedtools.cfood.AbstractCFood.create_identifiables` defines the ``identifiables`` that are needed and :py:meth:`~caosadvancedtools.cfood.AbstractCFood.update_identifiables` applies additional changes. Here, an ``Experiment`` Record is identified using solely the date. This implies that there must NOT exist two ``Experiment`` Records with the same date. If this might occur, an additional property needs to be added to the identifiable. The ``identifiables`` have to be added to the ``self.identifiables`` list. After the correct Record has been identified (or created if none existed) an additional property is added that describes the species. Your CFood needs to be passed to the crawler instance in the ``crawl.py`` file that you use for crawling. For example like this: .. code:: python c = FileCrawler(files=files, cfood_types=[ExampleCFood]) CFood -- Advanced ------------------ CFoods have some additional features in order to cope with complex scenarios. For example, what if multiple files are together needed to create some Record? Multiple data files recorded in an experiment could be one example. CFoods may define the :py:meth:`~.AbstractCFood.collect_information` function. In this function additional information can be collected by accessing files or querying the database. One particular use case is to add file paths to the ``attached_filenames`` property. By default, all files that are located at those paths are also treated by this CFood. This also means that the crawler does not list those files as “untreated”. One special case is the existence of multiple, very similar files. Imagine that you want to treat a range of calibration images with a CFood. You can write a regular expression to match all the files but it might be hard to match one particular. In this case, you should use the :py:class:`~.CMeal` mix-in. This will assure that the first match will create a CFood and all following ones are attached to the same instance. For further information, please consult the :py:obj:`API documentation <.CMeal>`. As the crawler may run in different environments, it might be different how files can be accessed. This can be defined using the :py:obj:`File Guide <.cfood.FileGuide>`. In the ``crawl.py`` file, you should set this appropriately: .. code:: python >>> from caosadvancedtools.cfood import fileguide >>> import os >>> fileguide.access = lambda path: "/main/data/" + path This prefixes all paths that are used in CaosDB with “/main/data/”. In CFoods, files can then be accessed using the fileguide as follows: .. code:: python with open(fileguide.access("/some/path")): # do stuff pass Changing data in CaosDB ----------------------- As described above, a Record matching the identifiable will be inserted if no such Record existed before. This is typically unproblematic. However, what if existing Records need to be modified? Many manipulations have the potential of overwriting changes in made in CaosDB. Thus, unless the data being crawled is a single source of truth for the information in CaosDB (and changes to the respective data in CaosDB should thus not be possible) changes have to be done with some considerations. Use the functions ``assure_has_xyz`` defined in the :py:mod:`cfood module <.cfood>` to only add a given property, if it is not yet existing. And use the functions ``assure_xyz_is`` to force the value of a property (see remarks above). To further assure that changes are correct, the crawler comes with an authorization mechanism. When running the crawler with the ``crawl`` function, a security level can be given. .. code:: python >>> from caosadvancedtools.crawler import FileCrawler >>> from caosadvancedtools.guard import RETRIEVE, INSERT, UPDATE >>> files = [] # put files to be crawled in this list >>> c = FileCrawler( ... files=files, ... cfood_types=[ExampleCFood], ... interactive=False) # the crawler runs without asking intermediate questions >>> c.crawl(security_level=INSERT) This assures that every manipulation of data in CaosDB that is done via the functions provided by the :py:class:`~caosadvancedtools.guard` class is checked against the provided security level: - ``RETRIEVE``: allows only to retrieve data from CaosDB. No manipulation is allowed - ``INSERT``: allows only to insert new entities and the manipulation of those newly inserted ones - ``UPDATE``: allows all manipulations This implies that all data manipulation of the crawler should use the functions that are checked by the guard. When writing a CFood you should stick to the above mentioned ``assure_has_xyz`` and ``assure_xyz_is`` functions which use the respective data manipulation functions. If you provide the ``to_be_updated`` member variable of CFoods to those ``assure...`` functions, the crawler provides another convenient feature: When an update is prevented due to the security level, the update is saved and can be subsequently be authorized. If the crawler runs on the CaosDB server, it will try to send a mail which allows to authorize the change. If it runs as a local script it will notify you that there are unauthorized changes and provide a code with which the crawler can be started to authorize the change. Real World Example ================== A crawler implementation exists that can crawl a file structure that adheres to the rules defined in this `Data publication `__. The project is of moderate size and shows how a set of CFoods can be defined to deal with a complex file structure. You can find detailed information on files need to be structured `here `__ and the source code of the CFoods `here `__. Sources ======= Source of the fingerprint picture: https://svgsilh.com/image/1298040.html