caoscrawler.converters module

class caoscrawler.converters.BooleanElementConverter(definition: dict, name: str, converter_registry: dict)

Bases: _AbstractScalarValueElementConverter

default_matches = {'accept_bool': True, 'accept_float': False, 'accept_int': True, 'accept_text': False}
class caoscrawler.converters.CSVTableConverter(definition: dict, name: str, converter_registry: dict)

Bases: TableConverter

create_children(generalStore: GeneralStore, element: StructureElement)
get_options()

This method needs to be overwritten by the specific table converter to provide information about the possible options.

class caoscrawler.converters.Converter(definition: dict, name: str, converter_registry: dict)

Bases: object

Converters treat StructureElements contained in the hierarchical sturcture.

apply_transformers(values: GeneralStore, transformer_functions: dict)

Check if transformers are defined using the “transform” keyword. Then apply the transformers to the variables defined in GeneralStore “values”.

Parameters:
  • values (GeneralStore) – The GeneralStore to store values in.

  • transformer_functions (dict) –

    A dictionary of registered functions that can be used within this transformer block. The keys of the dict are the function keys and the values the callable functions of the form:

    def func(in_value: Any, in_parameters: dict) -> Any:

    pass

static converter_factory(definition: dict, name: str, converter_registry: dict)

creates a Converter instance of the appropriate class.

The type key in the definition defines the Converter class which is being used.

abstract create_children(values: GeneralStore, element: StructureElement)
create_records(values: GeneralStore, records: RecordStore, element: StructureElement)
create_values(values: GeneralStore, element: StructureElement)

Extract information from the structure element and store them as values in the general store.

Parameters:
  • values (GeneralStore) – The GeneralStore to store values in.

  • element (StructureElement) – The StructureElement to extract values from.

static debug_matching(kind=None)
filter_children(children_with_strings: List[Tuple[StructureElement, str]], expr: str, group: str, rule: str)

Filter children according to regexp expr and rule.

abstract match(element: StructureElement) dict | None

This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.

The return value is a dictionary providing possible matched variables from the structure elements information.

abstract typecheck(element: StructureElement)

Check whether the current structure element can be converted using this converter.

exception caoscrawler.converters.ConverterValidationError(msg)

Bases: Exception

To be raised if contents of an element to be converted are invalid.

class caoscrawler.converters.CrawlerTemplate(template)

Bases: Template

braceidpattern = '(?a:[_a-z][_\\.a-z0-9]*)'
pattern = re.compile('\n            \\$(?:\n              (?P<escaped>\\$)  |   # Escape sequence of two delimiters\n              (?P<named>(?a:[_a-z][_a-z0-9]*))       |   # delimiter and a Python identifier\n          , re.IGNORECASE|re.VERBOSE)
class caoscrawler.converters.DateElementConverter(definition, *args, **kwargs)

Bases: TextElementConverter

allows to convert different text formats of dates to Python date objects.

The text to be parsed must be contained in the “date” group. The format string can be supplied under “dateformat” in the Converter definition. The library used is datetime so see its documentation for information on how to create the format string.

match(element: StructureElement)

This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.

The return value is a dictionary providing possible matched variables from the structure elements information.

class caoscrawler.converters.DictBooleanElementConverter(*args, **kwargs)

Bases: BooleanElementConverter

metadata: dict[str, set[str]]
class caoscrawler.converters.DictConverter(*args, **kwargs)

Bases: DictElementConverter

class caoscrawler.converters.DictDictElementConverter(*args, **kwargs)

Bases: DictElementConverter

class caoscrawler.converters.DictElementConverter(definition: dict, name: str, converter_registry: dict)

Bases: Converter

Operates on: caoscrawler.structure_elements.DictElement

Generates: caoscrawler.structure_elements.StructureElement

create_children(generalStore: GeneralStore, element: StructureElement)
match(element: StructureElement)

This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.

The return value is a dictionary providing possible matched variables from the structure elements information.

typecheck(element: StructureElement)

Check whether the current structure element can be converted using this converter.

class caoscrawler.converters.DictFloatElementConverter(*args, **kwargs)

Bases: FloatElementConverter

class caoscrawler.converters.DictIntegerElementConverter(*args, **kwargs)

Bases: IntegerElementConverter

class caoscrawler.converters.DictListElementConverter(*args, **kwargs)

Bases: ListElementConverter

class caoscrawler.converters.DictTextElementConverter(*args, **kwargs)

Bases: TextElementConverter

class caoscrawler.converters.DirectoryConverter(definition: dict, name: str, converter_registry: dict)

Bases: Converter

create_children(generalStore: GeneralStore, element: StructureElement)
static create_children_from_directory(element: Directory)

Creates a list of files (of type File) and directories (of type Directory) for a given directory. No recursion.

element: A directory (of type Directory) which will be traversed.

create_values(values: GeneralStore, element: StructureElement)

Extract information from the structure element and store them as values in the general store.

Parameters:
  • values (GeneralStore) – The GeneralStore to store values in.

  • element (StructureElement) – The StructureElement to extract values from.

match(element: StructureElement)

This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.

The return value is a dictionary providing possible matched variables from the structure elements information.

typecheck(element: StructureElement)

Check whether the current structure element can be converted using this converter.

class caoscrawler.converters.FileConverter(*args, **kwargs)

Bases: SimpleFileConverter

class caoscrawler.converters.FloatElementConverter(definition: dict, name: str, converter_registry: dict)

Bases: _AbstractScalarValueElementConverter

default_matches = {'accept_bool': False, 'accept_float': True, 'accept_int': True, 'accept_text': False}
class caoscrawler.converters.IntegerElementConverter(definition: dict, name: str, converter_registry: dict)

Bases: _AbstractScalarValueElementConverter

default_matches = {'accept_bool': False, 'accept_float': False, 'accept_int': True, 'accept_text': False}
class caoscrawler.converters.JSONFileConverter(definition: dict, name: str, converter_registry: dict)

Bases: SimpleFileConverter

create_children(generalStore: GeneralStore, element: StructureElement)
class caoscrawler.converters.ListElementConverter(definition: dict, name: str, converter_registry: dict)

Bases: Converter

create_children(generalStore: GeneralStore, element: StructureElement)
match(element: StructureElement)

This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.

The return value is a dictionary providing possible matched variables from the structure elements information.

typecheck(element: StructureElement)

Check whether the current structure element can be converted using this converter.

class caoscrawler.converters.MarkdownFileConverter(definition: dict, name: str, converter_registry: dict)

Bases: SimpleFileConverter

Read the yaml header of markdown files (if a such a header exists).

create_children(generalStore: GeneralStore, element: StructureElement)
class caoscrawler.converters.SimpleFileConverter(definition: dict, name: str, converter_registry: dict)

Bases: Converter

Just a file, ignore the contents.

create_children(generalStore: GeneralStore, element: StructureElement)
create_values(values: GeneralStore, element: StructureElement)

Extract information from the structure element and store them as values in the general store.

Parameters:
  • values (GeneralStore) – The GeneralStore to store values in.

  • element (StructureElement) – The StructureElement to extract values from.

match(element: StructureElement)

This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.

The return value is a dictionary providing possible matched variables from the structure elements information.

typecheck(element: StructureElement)

Check whether the current structure element can be converted using this converter.

class caoscrawler.converters.TableConverter(definition: dict, name: str, converter_registry: dict)

Bases: Converter

This converter reads tables in different formats line by line and allows matching the corresponding rows.

The subtree generated by the table converter consists of DictElements, each being a row. The corresponding header elements will become the dictionary keys.

The rows can be matched using a DictElementConverter.

abstract get_options()

This method needs to be overwritten by the specific table converter to provide information about the possible options.

match(element: StructureElement)

This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.

The return value is a dictionary providing possible matched variables from the structure elements information.

typecheck(element: StructureElement)

Check whether the current structure element can be converted using this converter.

class caoscrawler.converters.TextElementConverter(definition, *args, **kwargs)

Bases: _AbstractScalarValueElementConverter

default_matches = {'accept_bool': True, 'accept_float': True, 'accept_int': True, 'accept_text': True}
class caoscrawler.converters.XLSXTableConverter(definition: dict, name: str, converter_registry: dict)

Bases: TableConverter

Operates on: caoscrawler.structure_elements.File

Generates: caoscrawler.structure_elements.DictElement

create_children(generalStore: GeneralStore, element: StructureElement)
get_options()

This method needs to be overwritten by the specific table converter to provide information about the possible options.

class caoscrawler.converters.YAMLFileConverter(definition: dict, name: str, converter_registry: dict)

Bases: SimpleFileConverter

create_children(generalStore: GeneralStore, element: StructureElement)
caoscrawler.converters.convert_basic_element(element: list | dict | bool | int | float | str | None, name=None, msg_prefix='')

Convert basic Python objects to the corresponding StructureElements

caoscrawler.converters.create_path_value(func)

Decorator for create_values functions that adds a value containing the path.

should be used for StructureElement that are associated with file system objects that have a path, like File or Directory.

caoscrawler.converters.create_records(values: GeneralStore, records: RecordStore, def_records: dict)
caoscrawler.converters.handle_value(value: dict | str | list, values: GeneralStore)
Determine whether the given value needs to set a property,

be added to an existing value (create a list) or add as an additional property (multiproperty).

Variable names (starting with a “$”) are replaced by the corresponding value stored in the values GeneralStore.

Parameters:

value

  • if str, the value to be interpreted. E.g. “4”, “hallo” or “$a” etc.

  • if dict, must have keys “value” and “collection_mode”. The returned tuple is directly created from the corresponding values.

  • if list, each element is checked for replacement and the resulting list will be used as (list) value for the property

Returns:

out

  • the final value of the property; variable names contained in values are replaced.

  • the collection mode (can be single, list or multiproperty)

Return type:

tuple

caoscrawler.converters.match_name_and_value(definition, name, value)
Take match definitions from the definition argument and apply regular expression to name and

possibly value

one of the keys ‘match_name’ and “match’ needs to be available in definition ‘match_value’ is optional

Returns:

None, if match_name or match lead to no match. Otherwise, returns a dictionary with the matched groups, possibly including matches from using match_value

Return type:

out

caoscrawler.converters.replace_variables(propvalue: Any, values: GeneralStore)

This function replaces variables in property values (and possibly other locations, where the crawler can replace cfood-internal variables).

If propvalue is a single variable name preceeded with a ‘$’ (e.g. ‘$var’ or ‘${var}’), then the corresponding value stored in values is returned. In any other case the variable substitution is carried out as defined by string templates and a new string with the replaced variables is returned.

caoscrawler.converters.str_to_bool(x)
caoscrawler.converters.validate_against_json_schema(instance, schema_resource: dict | str)

Validate given instance against given schema_resource.

Parameters:
  • instance – Instance to be validated, typically dict but can be list, str, etc.

  • schema_resource – Either a path to the JSON file containing the schema or a dict with the schema.