Further converters

More converters, together with cfood definitions and examples can be found in the LinkAhead Crawler Extensions Subgroup on gitlab. In the following, we list converters that are shipped with the crawler library itself but are not part of the set of standard converters and may require this library to be installed with additional optional dependencies.

HDF5 Converters

For treating HDF5 Files, there are in total four individual converters corresponding to the internal structure of HDF5 files: the H5FileConverter which opens the file itself and creates further structure elements from HDF5 groups, datasets, and included multi-dimensional arrays that are in turn treated by the H5GroupConverter, the H5DatasetConverter, and the H5NdarrayConverter, respectively. You need to install the LinkAhead crawler with its optional h5-crawler dependency for using these converters.

The basic idea when crawling HDF5 files is to treat them very similar to dictionaries in which the attributes on root, group, or dataset level are essentially treated like BooleanElement, TextElement, FloatElement, and IntegerElement in a dictionary: They are appended as children and can be accessed via the subtree. The file itself and the groups within may contain further groups and datasets, which can have their own attributes, subgroups, and datasets, very much like DictElements within a dictionary. The main difference to any other dictionary type is the presence of multi-dimensional arrays within HDF5 datasets. Since LinkAhead doesn’t have any datatype corresponding to these, and since it isn’t desirable to store these arrays directly within LinkAhead for reasons of performance and of searchability, we wrap them within a specific Record as explained below, together with more metadata and their internal path within the HDF5 file. Users can thus query for datasets and their arrays according to their metadata within LinkAhead and then use the internal path information to access the dataset within the file directly. The type of this record and the property for storing the internal path need to be reflected in the datamodel. Using the default names, you would need a datamodel like

H5Ndarray:
  obligatory_properties:
    internal_hdf5-path:
      datatype: TEXT

although the names of both property and record type can be configured within the cfood definition.

A simple example of a cfood definition for HDF5 files can be found in the unit tests and shows how the individual converters are used in order to crawl a simple example file containing groups, subgroups, and datasets, together with their respective attributes.

H5FileConverter

This is an extension of the SimpleFileConverter class. It opens the HDF5 file and creates children for any contained group or dataset. Additionally, the root-level attributes of the HDF5 file are accessible as children.

H5GroupConverter

This is an extension of the DictElementConverter class. Children are created for all subgroups and datasets in this HDF5 group. Additionally, the group-level attributes are accessible as children.

H5DatasetConverter

This is an extension of the DictElementConverter class. Most importantly, it stores the array data in HDF5 dataset into H5NdarrayElement which is added to its children, as well as the dataset attributes.

H5NdarrayConverter

This converter creates a wrapper record for the contained dataset. The name of this record needs to be specified in the cfood definition of this converter via the recordname option. The RecordType of this record can be configured with the array_recordtype_name option and defaults to H5Ndarray. Via the given recordname, this record can be used within the cfood. Most importantly, this record stores the internal path of this array within the HDF5 file in a text property, the name of which can be configured with the internal_path_property_name option which defaults to internal_hdf5_path.