caoscrawler.scanner module

This is the scanner.

This was where formerly the _crawl(...) function from crawl.py was located.

This is just the functionality that extracts data from the file system.

caoscrawler.scanner.load_definition(crawler_definition_path: str) → dict

Load a cfood from a crawler definition defined by crawler definition path and validate it using cfood-schema.yml.

Arguments:

crawler_definition_path: str: Path to the crawler definition file in yaml format.

Returns:

dict containing the crawler definition.

caoscrawler.scanner.create_converter_registry(definition: dict)

Currently the converter registry is a dictionary containing for each converter: - key is the short code, abbreviation for the converter class name - module is the name of the module to be imported which must be installed - class is the converter class to load and associate with this converter entry

Formerly known as “load_converters”.

all other info for the converter needs to be included in the converter plugin directory: schema.yml file README.md documentation

caoscrawler.scanner.create_transformer_registry(definition: dict[str, dict[str, str]])

Currently the transformer registry is a dictionary containing for each transformer: - key is the short code, abbreviation for the converter class name - module is the name of the module to be imported which must be installed - class is the transformer function to load and associate with this converter entry

all other info for the converter needs to be included in the converter plugin directory: schema.yml file README.md documentation

Please refer to the docstring of function “scanner” for more information about the detailed structure of the transformer functions.

caoscrawler.scanner.initialize_converters(crawler_definition: dict, converter_registry: dict): takes the cfood as dict (crawler_definition) and creates the converter objects that are defined on the highest level. Child Converters will in turn be created during the initialization of the Converters.

caoscrawler.scanner.scanner(items: list[StructureElement], converters: list[Converter], general_store: GeneralStore | None = None, record_store: RecordStore | None = None, structure_elements_path: list[str] | None = None, converters_path: list[str] | None = None, restricted_path: list[str] | None = None, crawled_data: list[Record] | None = None, debug_tree: DebugTree | None = None, registered_transformer_functions: dict | None = None) → list[Record]

Crawl a list of StructureElements and apply any matching converters.

Formerly known as _crawl(...).

Parameters:

items (list[StructureElement]) – structure_elements (e.g. files and folders on one level on the hierarchy)
converters (list[Converter]) – locally defined converters for treating structure elements. A locally defined converter could be one that is only valid for a specific subtree of the originally cralwed StructureElement structure.
general_store (GeneralStore, RecordStore, optional) – This recursion of the crawl function should only operate on copies of the global stores of the Crawler object.
record_store (GeneralStore, RecordStore, optional) – This recursion of the crawl function should only operate on copies of the global stores of the Crawler object.
restricted_path (list[str], optional) – traverse the data tree only along the given path. For example, when a directory contains files a, b and c, and b is given as restricted_path, a and c will be ignored by the crawler. When the end of the given path is reached, traverse the full tree as normal. The first element of the list provided by restricted_path should be the name of the StructureElement at this level, i.e. denoting the respective element in the items argument.
registered_transformer_functions (dict, optional) –
A dictionary of transformer functions that can be used in the “transform” block of a converter and that allows to apply simple transformations to variables extracted either by the current converter or to other variables found in the current variable store.

Each function is a dictionary:
- The key is the name of the function to be looked up in the dictionary of registered transformer functions.
- The value is the function which needs to be of the form:
  
  def func(in_value: Any, in_parameters: dict) -> Any:
  pass

caoscrawler.scanner.scan_directory(dirname: str | list[str], crawler_definition_path: str, restricted_path: list[str] | None = None, debug_tree: DebugTree | None = None)

Crawl a single directory.

Formerly known as “crawl_directory”.

Convenience function that starts the crawler (calls start_crawling) with a single directory as the StructureElement.

Parameters:

dirname (str or list[str]) – directory or list of directories to be scanned
restricted_path (optional, list of strings) – Traverse the data tree only along the given path. When the end of the given path is reached, traverse the full tree as normal. See docstring of ‘scanner’ for more details.

Returns:

crawled_data – the final list with the target state of Records.

Return type:

list

caoscrawler.scanner.scan_structure_elements(items: list[StructureElement] | StructureElement, crawler_definition: dict, converter_registry: dict, restricted_path: list[str] | None = None, debug_tree: DebugTree | None = None, registered_transformer_functions: dict | None = None) → list[Record]

Start point of the crawler recursion.

Formerly known as “start_crawling”.

Parameters:

items (list) – A list of structure elements (or a single StructureElement) that is used for generating the initial items for the crawler. This could e.g. be a Directory.
crawler_definition (dict) – A dictionary representing the crawler definition, possibly from a yaml file.
restricted_path (list[str], optional) – Traverse the data tree only along the given path. When the end of the given path is reached, traverse the full tree as normal. See docstring of ‘scanner’ for more details.

Returns:

crawled_data – the final list with the target state of Records.

Return type:

list[db.Record]