caoscrawler.converters.converters module
Converters take structure elements and create Records and new structure elements from them.
- class caoscrawler.converters.converters.CrawlerTemplate(template)
Bases:
Template
- braceidpattern = '\\D[.\\w]*'
- pattern = re.compile('\n \\$(?:\n (?P<escaped>\\$) | # Escape sequence of two delimiters\n (?P<named>(?a:[_a-z][_a-z0-9]*)) | # delimiter and a Python identifier\n , re.IGNORECASE|re.VERBOSE)
- caoscrawler.converters.converters.str_to_bool(x)
- exception caoscrawler.converters.converters.ConverterValidationError(msg)
Bases:
Exception
To be raised if contents of an element to be converted are invalid.
- caoscrawler.converters.converters.create_path_value(func)
Decorator for create_values functions that adds a value containing the path.
should be used for StructureElement that are associated with file system objects that have a path, like File or Directory.
- caoscrawler.converters.converters.replace_variables(propvalue: Any, values: GeneralStore)
This function replaces variables in property values (and possibly other locations, where the crawler can replace cfood-internal variables).
If
propvalue
is a single variable name preceeded by a$
(e.g.$var
or${var}
), then the corresponding value stored invalues
is returned. In any other case the variable substitution is carried out as defined by string templates and a new string with the replaced variables is returned.
- caoscrawler.converters.converters.handle_value(value: dict | str | list, values: GeneralStore)
- Determine whether the given value needs to set a property,
be added to an existing value (create a list) or add as an additional property (multiproperty).
Variable names (starting with a “$”) are replaced by the corresponding value stored in the
values
GeneralStore.
- Parameters:
value (Union[dict, str, list]) –
If str, the value to be interpreted. E.g. “4”, “hello” or “$a” etc. No unit is set and collection mode is determined from the first character: - ‘+’ corresponds to “list” - ‘*’ corresponds to “multiproperty” - everything else is “single”
If dict, it must have a
value
key and mayunit
, andcollection_mode
. The returned tuple is directly created from the corresponding values if they are given;unit
defaults to None andcollection_mode
is determined fromvalue
as explained for the str case above, i.e., - if it starts with ‘+’, collection mode is “list”, - in case of ‘*’, collection mode is “multiproperty”, - and everything else is “single”.If list, each element is checked for variable replacement and the resulting list will be used as (list) value for the property
- Returns:
out –
the final value of the property; variable names contained in values are replaced.
the final unit of the property; variable names contained in values are replaced.
the collection mode (can be single, list or multiproperty)
- Return type:
- caoscrawler.converters.converters.create_records(values: GeneralStore, records: RecordStore, def_records: dict)
- class caoscrawler.converters.converters.Converter(definition: dict, name: str, converter_registry: dict)
Bases:
object
Converters treat StructureElements contained in the hierarchical sturcture.
This is the abstract super class for all Converters.
- setup()
Analogous to cleanup. Can be used to set up variables that are permanently stored in this converter.
- static converter_factory(definition: dict, name: str, converter_registry: dict)
Create a Converter instance of the appropriate class.
The type key in the definition defines the Converter class which is being used.
- create_values(values: GeneralStore, element: StructureElement)
Extract information from the structure element and store them as values in the general store.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
element (StructureElement) – The StructureElement to extract values from.
- apply_transformers(values: GeneralStore, transformer_functions: dict)
Check if transformers are defined using the “transform” keyword. Then apply the transformers to the variables defined in GeneralStore “values”.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
transformer_functions (dict) –
A dictionary of registered functions that can be used within this transformer block. The keys of the dict are the function keys and the values the callable functions of the form:
- def func(in_value: Any, in_parameters: dict) -> Any:
pass
- abstract create_children(values: GeneralStore, element: StructureElement)
- create_records(values: GeneralStore, records: RecordStore, element: StructureElement)
- filter_children(children_with_strings: list[tuple[StructureElement, str]], expr: str, group: str, rule: str)
Filter children according to regexp expr and rule.
- abstract typecheck(element: StructureElement)
Check whether the current structure element can be converted using this converter.
- static debug_matching(kind=None)
- abstract match(element: StructureElement) dict | None
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- cleanup()
This function is called when the converter runs out of scope and can be used to clean up objects that were needed in the converter or its children.
- class caoscrawler.converters.converters.DirectoryConverter(definition: dict, name: str, converter_registry: dict)
Bases:
Converter
Converter that matches and handles structure elements of type directory.
This is one typical starting point of a crawling procedure.
- create_children(generalStore: GeneralStore, element: StructureElement)
- create_values(values: GeneralStore, element: StructureElement)
Extract information from the structure element and store them as values in the general store.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
element (StructureElement) – The StructureElement to extract values from.
- typecheck(element: StructureElement)
Check whether the current structure element can be converted using this converter.
- match(element: StructureElement)
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.SimpleFileConverter(definition: dict, name: str, converter_registry: dict)
Bases:
Converter
Just a file, ignore the contents.
- typecheck(element: StructureElement)
Check whether the current structure element can be converted using this converter.
- create_children(generalStore: GeneralStore, element: StructureElement)
- create_values(values: GeneralStore, element: StructureElement)
Extract information from the structure element and store them as values in the general store.
- Parameters:
values (GeneralStore) – The GeneralStore to store values in.
element (StructureElement) – The StructureElement to extract values from.
- match(element: StructureElement)
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.FileConverter(*args, **kwargs)
Bases:
SimpleFileConverter
- class caoscrawler.converters.converters.MarkdownFileConverter(definition: dict, name: str, converter_registry: dict)
Bases:
SimpleFileConverter
Read the yaml header of markdown files (if a such a header exists).
- create_children(generalStore: GeneralStore, element: StructureElement)
- caoscrawler.converters.converters.convert_basic_element(element: list | dict | bool | int | float | str | None, name=None, msg_prefix='')
Convert basic Python objects to the corresponding StructureElements
- caoscrawler.converters.converters.validate_against_json_schema(instance, schema_resource: dict | str)
Validate given
instance
against givenschema_resource
.- Parameters:
instance – Instance to be validated, typically
dict
but can belist
,str
, etc.schema_resource – Either a path to the JSON file containing the schema or a
dict
with the schema.
- class caoscrawler.converters.converters.DictElementConverter(definition: dict, name: str, converter_registry: dict)
Bases:
Converter
Operates on:
caoscrawler.structure_elements.DictElement
Generates:
caoscrawler.structure_elements.StructureElement
- create_children(generalStore: GeneralStore, element: StructureElement)
- typecheck(element: StructureElement)
Check whether the current structure element can be converted using this converter.
- match(element: StructureElement)
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.PropertiesFromDictConverter(definition: dict, name: str, converter_registry: dict, referenced_record_callback: callable | None = None)
Bases:
DictElementConverter
Extend the
DictElementConverter
by a heuristic to set property values from the dictionary keys.- create_records(values: GeneralStore, records: RecordStore, element: StructureElement)
- class caoscrawler.converters.converters.DictConverter(*args, **kwargs)
Bases:
DictElementConverter
- class caoscrawler.converters.converters.DictDictElementConverter(*args, **kwargs)
Bases:
DictElementConverter
- class caoscrawler.converters.converters.JSONFileConverter(definition: dict, name: str, converter_registry: dict)
Bases:
SimpleFileConverter
- create_children(generalStore: GeneralStore, element: StructureElement)
- class caoscrawler.converters.converters.YAMLFileConverter(definition: dict, name: str, converter_registry: dict)
Bases:
SimpleFileConverter
- create_children(generalStore: GeneralStore, element: StructureElement)
- caoscrawler.converters.converters.match_name_and_value(definition, name, value)
- Take match definitions from the definition argument and apply regular expression to name and
possibly value
one of the keys ‘match_name’ and “match’ needs to be available in definition ‘match_value’ is optional
- Returns:
None, if match_name or match lead to no match. Otherwise, returns a dictionary with the matched groups, possibly including matches from using match_value
- Return type:
out
- class caoscrawler.converters.converters.BooleanElementConverter(definition: dict, name: str, converter_registry: dict)
Bases:
_AbstractScalarValueElementConverter
- default_matches = {'accept_bool': True, 'accept_float': False, 'accept_int': True, 'accept_text': False}
- class caoscrawler.converters.converters.DictBooleanElementConverter(*args, **kwargs)
Bases:
BooleanElementConverter
- class caoscrawler.converters.converters.FloatElementConverter(definition: dict, name: str, converter_registry: dict)
Bases:
_AbstractScalarValueElementConverter
- default_matches = {'accept_bool': False, 'accept_float': True, 'accept_int': True, 'accept_text': False}
- class caoscrawler.converters.converters.DictFloatElementConverter(*args, **kwargs)
Bases:
FloatElementConverter
- class caoscrawler.converters.converters.TextElementConverter(definition, *args, **kwargs)
Bases:
_AbstractScalarValueElementConverter
- default_matches = {'accept_bool': True, 'accept_float': True, 'accept_int': True, 'accept_text': True}
- class caoscrawler.converters.converters.DictTextElementConverter(*args, **kwargs)
Bases:
TextElementConverter
- class caoscrawler.converters.converters.IntegerElementConverter(definition: dict, name: str, converter_registry: dict)
Bases:
_AbstractScalarValueElementConverter
- default_matches = {'accept_bool': False, 'accept_float': False, 'accept_int': True, 'accept_text': False}
- class caoscrawler.converters.converters.DictIntegerElementConverter(*args, **kwargs)
Bases:
IntegerElementConverter
- class caoscrawler.converters.converters.ListElementConverter(definition: dict, name: str, converter_registry: dict)
Bases:
Converter
- create_children(generalStore: GeneralStore, element: StructureElement)
- typecheck(element: StructureElement)
Check whether the current structure element can be converted using this converter.
- match(element: StructureElement)
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.DictListElementConverter(*args, **kwargs)
Bases:
ListElementConverter
- class caoscrawler.converters.converters.TableConverter(definition: dict, name: str, converter_registry: dict)
Bases:
Converter
This converter reads tables in different formats line by line and allows matching the corresponding rows.
The subtree generated by the table converter consists of DictElements, each being a row. The corresponding header elements will become the dictionary keys.
The rows can be matched using a DictElementConverter.
- get_options() dict
Get specific options, e.g. from
self.definitions
.This method may to be overwritten by the specific table converter to provide information about the possible options. Implementors may use
TableConverter._get_options(...)
to get (and convert) options fromself.definitions
.- Returns:
out – An options dict.
- Return type:
- typecheck(element: StructureElement)
Check whether the current structure element can be converted using this converter.
- match(element: StructureElement)
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.XLSXTableConverter(definition: dict, name: str, converter_registry: dict)
Bases:
TableConverter
Operates on:
caoscrawler.structure_elements.File
Generates:
caoscrawler.structure_elements.DictElement
- get_options()
Get specific options, e.g. from
self.definitions
.This method may to be overwritten by the specific table converter to provide information about the possible options. Implementors may use
TableConverter._get_options(...)
to get (and convert) options fromself.definitions
.- Returns:
out – An options dict.
- Return type:
- create_children(generalStore: GeneralStore, element: StructureElement)
- class caoscrawler.converters.converters.CSVTableConverter(definition: dict, name: str, converter_registry: dict)
Bases:
TableConverter
- get_options()
Get specific options, e.g. from
self.definitions
.This method may to be overwritten by the specific table converter to provide information about the possible options. Implementors may use
TableConverter._get_options(...)
to get (and convert) options fromself.definitions
.- Returns:
out – An options dict.
- Return type:
- create_children(generalStore: GeneralStore, element: StructureElement)
- class caoscrawler.converters.converters.DateElementConverter(definition, *args, **kwargs)
Bases:
TextElementConverter
allows to convert different text formats of dates to Python date objects.
The text to be parsed must be contained in the “date” group. The format string can be supplied under “date_format” in the Converter definition. The library used is datetime so see its documentation for information on how to create the format string.
- match(element: StructureElement)
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.
- class caoscrawler.converters.converters.DatetimeElementConverter(definition, *args, **kwargs)
Bases:
TextElementConverter
Convert text so that it is formatted in a way that LinkAhead can understand it.
The text to be parsed must be in the
val
parameter. The format string can be supplied in thedatetime_format
node. This class uses thedatetime
module, sodatetime_format
must follow this specificaton: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes- match(element: StructureElement)
This method is used to implement detailed checks for matching compatibility of the current structure element with this converter.
The return value is a dictionary providing possible matched variables from the structure elements information.