caoscrawler.crawl module

Crawl a file structure using a yaml cfood definition and synchronize the acuired data with LinkAhead.

exception caoscrawler.crawl.ForbiddenTransaction

Bases: Exception

caoscrawler.crawl.check_identical(record1: Entity, record2: Entity, ignore_id=False)

Check whether two entities are identical.

This function uses compare_entities to check whether two entities are identical in a quite complex fashion:

  • If one of the entities has additional parents or additional properties -> not identical

  • If the value of one of the properties differs -> not identical

  • If datatype, importance or unit are reported different for a property by compare_entities

    return False only if these attributes are set explicitely by record1. Ignore the difference otherwise.

  • If description, name, id or path appear in list of differences -> not identical.

  • If file, checksum, size appear -> Only different, if explicitely set by record1.

record1 serves as the reference, so datatype, importance and unit checks are carried out using the attributes from record1. In that respect, the function is not symmetrical in its arguments.

class caoscrawler.crawl.SecurityMode(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

RETRIEVE = 0
INSERT = 1
UPDATE = 2
class caoscrawler.crawl.Crawler(generalStore: GeneralStore | None = None, debug: bool | None = None, identifiableAdapter: IdentifiableAdapter | None = None, securityMode: SecurityMode = SecurityMode.UPDATE)

Bases: object

Crawler class that encapsulates crawling functions. Furthermore it keeps track of the storage for records (record store) and the storage for values (general store).

load_converters(definition: dict)
load_definition(crawler_definition_path: str)
initialize_converters(crawler_definition: dict, converter_registry: dict)
generate_run_id()
start_crawling(items: list[StructureElement] | StructureElement, crawler_definition: dict, converter_registry: dict, restricted_path: list[str] | None = None)
property crawled_data
crawl_directory(crawled_directory: str, crawler_definition_path: str, restricted_path: list[str] | None = None)

The new main function to run the crawler on a directory.

replace_entities_with_ids(rec: Record)
static compact_entity_list_representation(entities, referencing_entities: List) str

a more readable representation than the standard xml representation

TODO this can be removed once the yaml format representation is in pylib

static remove_unnecessary_updates(crawled_data: list[Record], identified_records: list[Record])

Compare the Records to be updated with their remote correspondant. Only update if there are actual differences.

Return type:

update list without unecessary updates

static execute_parent_updates_in_list(to_be_updated, securityMode, run_id, unique_names)

Execute the updates of changed parents.

This method is used before the standard inserts and needed because some changes in parents (e.g. of Files) might fail if they are not updated first.

static replace_name_with_referenced_entity_id(prop: Property)

changes the given property in place if it is a reference property that has a name as value

If the Property has a List datatype, each element is treated separately. If the datatype is generic, i.e. FILE or REFERENCE, values stay unchanged. If the value is not a string, the value stays unchanged. If the query using the datatype and the string value does not uniquely identify an Entity, the value stays unchanged. If an Entity is identified, then the string value is replaced by the ID.

static execute_inserts_in_list(to_be_inserted, securityMode, run_id: UUID | None = None, unique_names=True)
static set_ids_and_datatype_of_parents_and_properties(rec_list)
static execute_updates_in_list(to_be_updated, securityMode, run_id: UUID | None = None, unique_names=True)
static check_whether_parent_exists(records: list[Entity], parents: list[str])

returns a list of all records in records that have a parent that is in parents

synchronize(commit_changes: bool = True, unique_names: bool = True, crawled_data: list[Record] | None = None, no_insert_RTs: list[str] | None = None, no_update_RTs: list[str] | None = None, path_for_authorized_run: str | None = '')

This function applies several stages: 1) Retrieve identifiables for all records in crawled_data. 2) Compare crawled_data with existing records. 3) Insert and update records based on the set of identified differences.

This function makes use of an IdentifiableAdapter which is used to retrieve register and retrieve identifiables.

if commit_changes is True, the changes are synchronized to the CaosDB server. For debugging in can be useful to set this to False.

Parameters:
  • no_insert_RTs (list[str], optional) – list of RecordType names. Records that have one of those RecordTypes as parent will not be inserted

  • no_update_RTs (list[str], optional) – list of RecordType names. Records that have one of those RecordTypes as parent will not be updated

  • path_for_authorized_run (str, optional) – only used if there are changes that need authorization before being applied. The form for rerunning the crawler with the authorization of these changes will be generated with this path. See caosadvancedtools.crawler.Crawler.save_form for more info about the authorization form.

Returns:

the final to_be_inserted and to_be_updated as tuple.

Return type:

inserts and updates

static create_entity_summary(entities: list[Entity])

Creates a summary string reprensentation of a list of entities.

static inform_about_pending_changes(pending_changes, run_id, path, inserts=False)
static debug_build_usage_tree(converter: Converter)
save_debug_data(filename: str, debug_tree: DebugTree | None = None)

Save the information contained in a debug_tree to a file named filename.

caoscrawler.crawl.crawler_main(crawled_directory_path: str, cfood_file_name: str, identifiables_definition_file: str | None = None, debug: bool = False, provenance_file: str | None = None, dry_run: bool = False, prefix: str = '', securityMode: SecurityMode = SecurityMode.UPDATE, unique_names: bool = True, restricted_path: list[str] | None = None, remove_prefix: str | None = None, add_prefix: str | None = None, sss_max_log_level: int | None = None)
Parameters:
  • crawled_directory_path (str) – path to be crawled

  • cfood_file_name (str) – filename of the cfood to be used

  • identifiables_definition_file (str) – filename of an identifiable definition yaml file

  • debug (bool) – DEPRECATED, use a provenance file instead.

  • provenance_file (str) – Provenance information will be stored in a file with given filename

  • dry_run (bool) – do not commit any chnages to the server

  • prefix (str) – DEPRECATED, remove the given prefix from file paths

  • securityMode (int) – securityMode of Crawler

  • unique_names (bool) – whether or not to update or insert entities inspite of name conflicts

  • restricted_path (optional, list of strings) – Traverse the data tree only along the given path. When the end of the given path is reached, traverse the full tree as normal. See docstring of ‘scanner’ in module ‘scanner’ for more details.

  • remove_prefix (Optional[str]) – Remove the given prefix from file paths. See docstring of ‘_fix_file_paths’ for more details.

  • add_prefix (Optional[str]) – Add the given prefix to file paths. See docstring of ‘_fix_file_paths’ for more details.

  • sss_max_log_level (Optional[int]) – If given, set the maximum log level of the server-side scripting log separately from the general debug option. If None is given, the maximum sss log level will be determined from the value of debug: logging.INFO if debug is False, logging.DEBUG if debug is True.

Returns:

return_value – 0 if successful

Return type:

int

caoscrawler.crawl.parse_args()
caoscrawler.crawl.split_restricted_path(path)

Split a path string into components separated by slashes or other os.path.sep. Empty elements will be removed.

caoscrawler.crawl.main()