Further converters
More converters, together with cfood definitions and examples can be found in the LinkAhead Crawler Extensions Subgroup on gitlab. In the following, we list converters that are shipped with the crawler library itself but are not part of the set of standard converters and may require this library to be installed with additional optional dependencies.
HDF5 Converters
For treating HDF5 Files, there are in total
four individual converters corresponding to the internal structure of HDF5
files: the H5FileConverter which opens the file itself and creates
further structure elements from HDF5 groups, datasets, and included
multi-dimensional arrays that are in turn treated by the
H5GroupConverter, the H5DatasetConverter, and the
H5NdarrayConverter, respectively. You need to install the LinkAhead
crawler with its optional h5-crawler dependency for using these converters.
The basic idea when crawling HDF5 files is to treat them very similar to
dictionaries in which the attributes on root,
group, or dataset level are essentially treated like BooleanElement,
TextElement, FloatElement, and IntegerElement in a dictionary: They
are appended as children and can be accessed via the subtree. The file
itself and the groups within may contain further groups and datasets, which can
have their own attributes, subgroups, and datasets, very much like
DictElements within a dictionary. The main difference to any other
dictionary type is the presence of multi-dimensional arrays within HDF5
datasets. Since LinkAhead doesn’t have any datatype corresponding to these, and
since it isn’t desirable to store these arrays directly within LinkAhead for
reasons of performance and of searchability, we wrap them within a specific
Record as explained below, together with more
metadata and their internal path within the HDF5 file. Users can thus query for
datasets and their arrays according to their metadata within LinkAhead and then
use the internal path information to access the dataset within the file
directly. The type of this record and the property for storing the internal path
need to be reflected in the datamodel. Using the default names, you would need a
datamodel like
H5Ndarray:
obligatory_properties:
internal_hdf5-path:
datatype: TEXT
although the names of both property and record type can be configured within the cfood definition.
A simple example of a cfood definition for HDF5 files can be found in the unit tests and shows how the individual converters are used in order to crawl a simple example file containing groups, subgroups, and datasets, together with their respective attributes.
H5FileConverter
This is an extension of the
SimpleFileConverter
class. It opens the HDF5 file and creates children for any contained
group or dataset. Additionally, the root-level attributes of the HDF5
file are accessible as children.
H5GroupConverter
This is an extension of the
DictElementConverter
class. Children are created for all subgroups and datasets in this
HDF5 group. Additionally, the group-level attributes are accessible as
children.
H5DatasetConverter
This is an extension of the
DictElementConverter
class. Most importantly, it stores the array data in HDF5 dataset into
H5NdarrayElement
which is added to its children, as well as the dataset attributes.
H5NdarrayConverter
This converter creates a wrapper record for the contained dataset. The name of
this record needs to be specified in the cfood definition of this converter via
the recordname option. The RecordType of this record can be configured with
the array_recordtype_name option and defaults to H5Ndarray. Via the
given recordname, this record can be used within the cfood. Most
importantly, this record stores the internal path of this array within the HDF5
file in a text property, the name of which can be configured with the
internal_path_property_name option which defaults to internal_hdf5_path.
ROCrateConverter
The ROCrateConverter unpacks ro-crate files, and creates one instance of the
ROCrateEntity structure element for each contained object. Currently only
zipped ro-crate files are supported. The created ROCrateEntities wrap a
rocrate.model.entity.Entity with a path to the folder the ROCrate data
is saved in. They are appended as children and can then be accessed via the
subtree and treated using the ROCrateEntityConverter.
To use the ROCrateConverter, you need to install the LinkAhead crawler with its
optional rocrate dependency.
ELNFileConverter
As .eln files are zipped ro-crate files, the ELNFileConverter works analogously to the ROCrateConverter and also creates ROCrateEntities for contained objects.
ROCrateEntityConverter
The ROCrateEntityConverter unpacks the rocrate.model.entity.Entity wrapped
within a ROCrateEntity, and appends all properties, contained files, and parts
as children. Properties are converted to a basic element matching their value
(BooleanElement, IntegerElement, etc.) and can be matched using
match_properties. Each rocrate.model.file.File is converted to a crawler
File object, which can be matched with SimpleFile. And each subpart of the
ROCrateEntity is also converted to a ROCrateEntity, which can then again be
treated using this converter.
The match_entity_type keyword can be used to match a ROCrateEntity using its
entity_type. With the match_properties keyword, properties of a ROCrateEntity
can be either matched or extracted, as seen in the cfood example below:
* with match_properties: "@id": ro-crate-metadata.json the ROCrateEntities
can be filtered to only match the metadata json files.
* with match_properties: dateCreated: (?P<dateCreated>.*) the dateCreated
entry of that metadata json file is extracted and accessible through the
dateCreated variable.
* the example could then be extended to use any other entry present in the metadata
json to filter the results, or insert the extracted information into generated records.
Example cfood
One short cfood to generate records for each .eln file in a directory and their metadata files could be:
---
metadata:
crawler-version: 0.9.0
---
Converters:
ELNFile:
converter: ELNFileConverter
package: caoscrawler.converters.rocrate
ROCrateEntity:
converter: ROCrateEntityConverter
package: caoscrawler.converters.rocrate
ParentDirectory:
type: Directory
match: (.*)
subtree:
ELNFile:
type: ELNFile
match: (?P<filename>.*)\.eln
records:
ELNExampleRecord:
filename: $filename
subtree:
ROCrateEntity:
type: ROCrateEntity
match_properties:
"@id": ro-crate-metadata.json
dateCreated: (?P<dateCreated>.*)
records:
MDExampleRecord:
parent: $ELNFile
filename: ro-crate-metadata.json
time: $dateCreated