Scientific Folder Structure
===========================

The SciFolder structure
-----------------------

Let's walk through a more elaborate example of using the CaosDB Crawler,
this time making use of a simple directory structure. We assume
the directory structure to have the following form:

.. code-block:: text

   ExperimentalData/
   
     2022_ProjectA/
     
       2022-02-17_TestDataset/
         file1.dat
         file2.dat
         ...
       ...
     
     2023_ProjectB/
       ...
     ...

This file structure is described in our article "Guidelines for a Standardized Filesystem Layout for Scientific Data" (https://doi.org/10.3390/data5020043). As a simplified example
we want to write a crawler that creates "Project" and "Measurement" records in CaosDB and set
some reasonable properties stemming from the file and directory names. Furthermore, we want
to link the data files to the measurement records.

Let's first clarify the terms we are using:

.. code-block:: text

   ExperimentalData/            <--- Category level (level 0)
     2022_ProjectA/             <--- Project level (level 1)
       2022-02-17_TestDataset/  <--- Activity / Measurement level (level 2)
         file1.dat              <--- Files on level 3
         file2.dat
         ...
       ...
     2023_ProjectB/    <--- Project level (level 1)
       ...
     ...

So we can see that this follows the three-level folder structure described in the paper.
We use the term "Activity level" here, instead of the terms used in the article, as
it can be used in a more general way.

A CFood for SciFolder
---------------------

The following YAML CFood is able to match and insert / update the records accordingly, with a
detailed explanation of the YAML definitions:

.. image:: example_crawler.svg


See for yourself
----------------

If you want to try this out for yourself, you will need the following content:

- Data files in a SciFolder structure.
- A data model which describes the data.
- An identifiables definition which describes how data Entities can be identified.
- A CFood definition which the crawler uses to map from the folder structure to entities in CaosDB.

You can download all the necessarily files, packed in `scifolder_tutorial.tar.gz
<../_static/assets/scifolder_tutorial.tar.gz>`__.  After storing this archive file, unpack it and go
into the ``scifolder`` directory, then follow these steps:

.. role:: shell(code)
   :language: shell

1. Copy the data files folder to the ``extroot`` directory of your LinkAhead installation:

   :shell:`cp -r scifolder_data ../../<your_extroot>/`.
2. Load the content of the data folder into CaosDB:

   :shell:`python -m caosadvancedtools.loadFiles /opt/caosdb/mnt/extroot/scifolder_data`.

   The path to loadfiles is the one that the CaosDB server sees, which is not necessarily the same
   as the one on your local machine. The prefix ``/opt/caosdb/mnt/extroot/`` is correct for all
   LinkAhead instances. If you are in doubt, please ask your administrator for the correct path.

   For more information on `loadFiles`, call :shell:`python -m caosadvancedtools.loadFiles --help`.

   .. note::

      If the Records that are created shall be referenced by CaosDB File Entities, you
      (currently) need to make them accessible in CaosDB in advance. For example, if you
      have a folder with experimental data files and you want those files to be referenced
      (for example by an Experiment Record).
3. Teach the server about the data model:

   :shell:`python -m caosadvancedtools.models.parser model.yml --sync`
4. Run the crawler on the local ``scifolder_data`` folder, using the identifiables and CFood
   definition files:

   :shell:`caosdb-crawler -s update -i identifiables.yml scifolder_cfood.yml scifolder_data`