==============
CaosDB Crawler
==============

The `CaosDB
crawler <https://gitlab.com/caosdb/caosdb-advanced-user-tools/blob/main/src/caosadvancedtools/crawler.py>`__
is a tool for the automated insertion or update of entities in CaosDB.
Typically, a file structure is crawled, but other things can be crawled as well.
For example tables or HDF5 files.


Introduction
============

In simple terms, the crawler is a program that scans a directory
structure, identifies files that will be treated, and generates
corresponding Entities in CaosDB, possibly filling meta data. During
this process the crawler can also open files and derive content from
within, for example reading CSV tables and processing individual rows of
these tables.

.. image:: images/crawler_fingerprint.*

As shown in the figure, the general principle of the crawler framework
is the following:

- The crawler walks through the file structure and matches file names (using regular expressions).
- For each the matched file, finger prints (so called ``Identifiables``) are created.
- Then the crawler checks if the Records corresponding to the ``Identifiables`` exist already and if
  they are up-to-date. If not, the Records are created or updated according to the file contents.


Technically, this behaviour can be adjusted to your needs using so
called CFood (pun intended! :-)) Python classes. More details on the
different components of the CaosDB Crawler can be found in the
`developers’ information <#extending-the-crawlers>`__ below.

In case you are happy with our suggestion of a standard crawler, feel
free to use the standard crawler. The standard crawler lives in the submodule
`caosadvancedtools.scifolder`

Usage
=====

Typically, you can invoke the crawler in two ways: via the web interface
and directly as a Python script.

In both cases, if the crawler has a problem with some file (e.g. columns
in a table (tsv, xls, …) are named incorrectly), the problem will be
indicated by a warning that is returned. You can fix the problem and run
the crawler again. This does not cause any problems, since the crawler
can recognize what has already been processed (see the description of
finger prints in the `Introduction <#Introduction>`__).

.. warning:: Pay attention when you change a property that is used for the finger print: The crawler
    will not be able to identify a previous version with the changed one since the finger print is
    different. This often means that entities in the data base need to be changed or removed. As a
    rule of thumb, you should be pretty sure that properties that are used as finger prints will not
    change after the crawler has run for the first time. This prevents complications.

Invocation from the Web Interface
---------------------------------

If enabled, the crawler can be called using a menu entry in the web
interface. This will open a form where the path of the directory that
shall be crawled needs to be given. After the execution information
about what was done and which problems might exist is printed in the web
interface. Note that some changes might be pending authorization (if
indicated in the messages).

Invocation as Python Script
---------------------------

The crawler can be executed directly via a Python script (usually called
``crawl.py``). The script prints the progress and reports potential
problems. The exact behavior depends on your setup. However, you can
have a look at the example in the
`tests <https://gitlab.indiscale.com/caosdb/src/caosdb-advanced-user-tools/-/blob/main/integrationtests/crawl.py>`__.

.. Note:: The crawler depends on the LinkAhead Python client, so make sure to install :doc:`pylinkahead
          <linkahead-pylib:README_SETUP>`.


Call ``python3 crawl.py --help`` to see what parameters can be provided.
Typically, an invocation looks like:

.. code:: sh

   python3 crawl.py /someplace/

.. Note:: For trying out the above mentioned example crawler from the integration tests,
          make sure that the ``extroot`` directory in the ``integrationtests`` folder is used as
          CaosDB's extroot directory, and call the crawler indirectly via ``./test.sh``.

In this case ``/someplace/`` identifies the path to be crawled **within
CaosDB's file system**. You can browse the CaosDB file system by
opening the WebUI of your CaosDB instance and clicking on “File System”.

In the backend, ``crawl.py`` starts a CQL query
``FIND File WHICH IS STORED AT /someplace/**`` and crawls the resulting
files according to your customized ``CFoods``.

Crawling may consist of two distinct steps: 1. Insertion of files (use
function ``loadFiles``) 2. The actual crawling (use program
``crawl.py``) However, the first step may be included in ``crawl.py``.
Otherwise, you can only crawl files that were previously inserted by
``loadFiles``.

loadFiles
~~~~~~~~~

After installation of the ``caosadvancedtools`` you can simply call the
function ``loadFiles`` contained in the package:

::

   python3 -m caosadvancedtools.loadFiles  /opt/caosdb/mnt/extroot

``/opt/caosdb/mnt/extroot`` is the root of the file system to be crawled
as seen by the CaosDB server (The actual path may vary. This is the used
in the LinkAhead distribution of CaosDB). In this case the root file
system as seen from within the CaosDB docker process is used.


You can provide a ``.caosdbignore`` file as a commandline option to the above
loadFiles command. The syntax of that file is the same as for `gitignore
<https://git-scm.com/docs/gitignore>`_ files. Note, that you can have additional
``.caosdbignore`` files at lower levels which are appended to the current ignore
file and have an effect of the respective subtree.


Extending the Crawlers
======================

In most use cases the crawler needs to be tailored to the specific
needs. This section explains how this can be done.

The behavior and rules of the crawler are defined in logical units
called CFoods. In order to extend the crawler you need to extend an
existing CFood or create new one.

.. Note:: A crawler always needs a corresponding data model to exits in the 
          server. The following does not cover this aspect. Please refer  
          for example to documentation of the YAML Interface.

.. _c-food-introduction:

CFood -- Introduction
----------------------

A ``CFood`` is a Python class that inherits from the abstract base
class :py:class:`~caosadvancedtools.cfood.AbstractCFood`.
It should be independent of other data and define the following methods:

1. :py:meth:`~caosadvancedtools.cfood.AbstractFileCFood.get_re`

     This *static* method is required for classes which inherit from
     :py:class:`~caosadvancedtools.cfood.AbstractFileCFood`.  It returns a regular expression to
     identify files that can be consumed by this CFood.  For other CFood implementations, overload
     the :py:meth:`~caosadvancedtools.cfood.AbstractCFood.match_item` method.
2. :py:meth:`~caosadvancedtools.cfood.AbstractCFood.create_identifiables`

     This method defines (and inserts if necessary) the identifiables which may be updated at a
     later stage.  After calling this method, the ``AbstractCFood.identifiables`` Container contains
     those Records which will be updated at a later time.
3. :py:meth:`~caosadvancedtools.cfood.AbstractCFood.update_identifiables`
     This method updates the stored identifiables as necessary.  All Entities which need to be
     updated on the Server must be in ``AbstractCFood.to_be_updated`` after this call.

As hinted above, the main feature of an ``identifiable`` is its fingerprinting ability: it has
sufficient properties to identify an existing Record in CaosDB so that the CFood can decide which
Records should be updated by the Crawler instead of inserting a new one.  Obviously, this allows the
Crawler to run twice on the same file structure without duplicating the data in CaosDB.

An ``identifiable`` is a Python :py:class:`~caosdb.common.models.Record` object with the features to
identify the correct Record in CaosDB. This object is used to create a query in order to determine
whether the Record exists. If the Record does not exist, the ``identifiable`` is used to insert the
Record. Thus, after this step the Crawler guarantees that a Record with the features of the
``identifiable`` exists in CaosDB (either previously existing or newly created).


An example: An experiment might be uniquely identified by the date when it was conducted and a
number. The ``identifiable`` might then look as follows:

.. code:: xml

    <Record>
      <Parent name="Experiment"/>
      <Property name="date">2020-04-19</Property>
      <Property name="Exp-No">9</Property>
    </Record>

CFoods and the Crawler
-----------------------

In short, the Crawler interacts with the available CFoods in the following way:

#. The Crawler iterates over the available objects (for example files), and for each object ``o``.

   #. The Crawler tests which of the available CFoods can consume the object ``o``, using the CFoods'
      :py:meth:`~caosadvancedtools.cfood.AbstractCFood.match_item` class method.

   #. If the CFood matches against the object, an instance of that CFood is instantiated with that
      object ``o`` and stored for later, like ``cfoods.append(CFood(o))``.

#. The Crawler then iterates over the stored CFood instances, and for each instance ``cfood`` does:

   #. ``cfood.create_identifiables()`` As described :ref:`above<c-food-introduction>`, create
      identifiables.

   #.  All the identifiables in ``cfood.identifiables`` are searched for existence in the CaosDB
       instance, and inserted if they do not exist.

   #. ``cfood.update_identifiables()`` As described :ref:`above<c-food-introduction>`, update the 
      identifiables if their content needs to change.

   #. All the identifiables in ``cfood.to_be_updated`` are synced to the CaosDB instance.

The following sketch aims to visualize this procedure.      

.. figure:: images/crawler_flow_sketch.*

   Sketch of how the Crawler uses the CFoods to process objects.  Of the four identifiables
   (fingerprints) on the right, only the second does not exist yet and is thus inserted in the
   second step.  Only the identifiables number 2 and 4 have new or changed content, so only these
   are synced to CaosDB in the last step.

.. note:: **Practical hint:** After the call to
          :py:meth:`~caosadvancedtools.cfood.AbstractCFood.create_identifiables`, the Crawler
          guarantees that an ``Experiment`` with those properties exists in CaosDB. In the call to
          :py:meth:`~caosadvancedtools.cfood.AbstractCFood.update_identifiables`, further properties
          might be added to this Record, e.g. references to data files that were recorded in that
          experiment or to the person that did the experiment.

CFood -- An example
--------------------

Let’s look at the following Example:

.. code:: python

   >>> # Example CFood
   >>> from caosadvancedtools.cfood import AbstractFileCFood, assure_has_property
   >>> import caosdb as db
   >>> 
   >>> class ExampleCFood(AbstractFileCFood):
   ...     @staticmethod
   ...     def get_re():
   ...         return (r".*/(?P<species>[^/]+)/"
   ...                 r"(?P<date>\d{4}-\d{2}-\d{2})/README.md")
   ... 
   ...     def create_identifiables(self):
   ...         self.experiment = db.Record()
   ...         self.experiment.add_parent(name="Experiment")
   ...         self.experiment.add_property(
   ...             name="date",
   ...             value=self.match.group('date'))
   ...         self.identifiables.append(self.experiment)
   ... 
   ...     def update_identifiables(self):
   ...         assure_has_property(
   ...             self.experiment,
   ...             "species",
   ...             self.match.group('species'))

   >>> # check whether the definition is valid
   >>> cf = ExampleCFood('')

Every child of ``AbstractFileCFood`` (``AbstractFileCFood`` is for
crawling files…, and yes, you can crawl other stuff as well) needs to
implement the functions ``get_re``, ``create_identifiables``,
``update_identifiables``. The function :py:meth:`~caosadvancedtools.cfood.AbstractFileCFood.get_re`
defines which files shall be treated with this CFood. The function needs
to return a string with a regular expression. Here, the expression
matches any "README.md" file that is located below two folder levels
like: ``/any/path/whale/2020-01-01/README.md``. Note that the groups
defined in the regular expressions (``date`` and ``species``) can be
later used via ``self.match.group('name')``.

:py:meth:`~caosadvancedtools.cfood.AbstractCFood.create_identifiables`
defines the ``identifiables`` that are needed and
:py:meth:`~caosadvancedtools.cfood.AbstractCFood.update_identifiables`
applies additional changes. Here, an ``Experiment`` Record is identified
using solely the date. This implies that there must NOT exist two
``Experiment`` Records with the same date. If this might occur, an
additional property needs to be added to the identifiable. The
``identifiables`` have to be added to the ``self.identifiables`` list.

After the correct Record has been identified (or created if none
existed) an additional property is added that describes the species.

Your CFood needs to be passed to the crawler instance in the
``crawl.py`` file that you use for crawling. For example like this:

.. code:: python

   c = FileCrawler(files=files, cfood_types=[ExampleCFood])


CFood -- Advanced
------------------

CFoods have some additional features in order to cope with complex
scenarios. For example, what if multiple files are together needed to
create some Record? Multiple data files recorded in an experiment could
be one example. CFoods may define the
:py:meth:`~.AbstractCFood.collect_information`
function. In this function additional information can be collected by
accessing files or querying the database. One particular use case is to
add file paths to the ``attached_filenames`` property. By default, all
files that are located at those paths are also treated by this CFood.
This also means that the crawler does not list those files as
“untreated”.

One special case is the existence of multiple, very similar files.
Imagine that you want to treat a range of calibration images with a
CFood. You can write a regular expression to match all the files but it
might be hard to match one particular. In this case, you should use the
:py:class:`~.CMeal`
mix-in. This will assure that the first match will create a CFood and
all following ones are attached to the same instance. For further
information, please consult the :py:obj:`API documentation <.CMeal>`.

As the crawler may run in different environments, it might be different
how files can be accessed. This can be defined using the :py:obj:`File Guide <.cfood.FileGuide>`.

In the ``crawl.py`` file, you should set this appropriately:

.. code:: python

   >>> from caosadvancedtools.cfood import fileguide
   >>> import os

   >>> fileguide.access = lambda path: "/main/data/" + path

This prefixes all paths that are used in CaosDB with “/main/data/”. In
CFoods, files can then be accessed using the fileguide as follows:

.. code:: python

   with open(fileguide.access("/some/path")):
   # do stuff
      pass

Changing data in CaosDB
-----------------------

As described above, a Record matching the identifiable will be inserted
if no such Record existed before. This is typically unproblematic.
However, what if existing Records need to be modified? Many
manipulations have the potential of overwriting changes in made in
CaosDB. Thus, unless the data being crawled is a single source of truth
for the information in CaosDB (and changes to the respective data in
CaosDB should thus not be possible) changes have to be done with some
considerations.

Use the functions ``assure_has_xyz`` defined in the :py:mod:`cfood module <.cfood>` to
only add a given property, if it is not yet existing. And use the
functions ``assure_xyz_is`` to force the value of a property (see
remarks above).

To further assure that changes are correct, the crawler comes with an
authorization mechanism. When running the crawler with the ``crawl``
function, a security level can be given.

.. code:: python

   >>> from caosadvancedtools.crawler import FileCrawler
   >>> from caosadvancedtools.guard import RETRIEVE, INSERT, UPDATE
   >>> files = [] # put files to be crawled in this list
   >>> c = FileCrawler(
   ...     files=files, 
   ...     cfood_types=[ExampleCFood],
   ...     interactive=False) # the crawler runs without asking intermediate questions
   >>> c.crawl(security_level=INSERT)

This assures that every manipulation of data in CaosDB that is done via the functions provided by
the :py:class:`~caosadvancedtools.guard` class is checked against the provided security level:

- ``RETRIEVE``: allows only to retrieve data from CaosDB. No manipulation is allowed
- ``INSERT``: allows only to insert new entities and the manipulation of those newly inserted ones
- ``UPDATE``: allows all manipulations

This implies that all data manipulation of the crawler should use the
functions that are checked by the guard. When writing a CFood you should
stick to the above mentioned ``assure_has_xyz`` and ``assure_xyz_is``
functions which use the respective data manipulation functions.

If you provide the ``to_be_updated`` member variable of CFoods to those
``assure...`` functions, the crawler provides another convenient
feature: When an update is prevented due to the security level, the
update is saved and can be subsequently be authorized. If the crawler
runs on the CaosDB server, it will try to send a mail which allows to
authorize the change. If it runs as a local script it will notify you
that there are unauthorized changes and provide a code with which the
crawler can be started to authorize the change.

Real World Example
==================

A crawler implementation exists that can crawl a file structure that adheres to the rules defined in
this `Data publication <https://doi.org/10.3390/data5020043>`__. The project is of moderate size and
shows how a set of CFoods can be defined to deal with a complex file structure.

You can find detailed information on files need to be structured `here
<https://gitlab.com/salexan/check-sfs/-/blob/f-software/filesystem_structure.md>`__ and the source
code of the CFoods `here <https://gitlab.com/caosdb/caosdb-advanced-user-tools>`__.

Sources
=======

Source of the fingerprint picture: https://svgsilh.com/image/1298040.html