caoscrawler.crawl module

Crawl a file structure using a yaml cfood definition and synchronize the acuired data with LinkAhead.

class caoscrawler.crawl.Crawler(generalStore: GeneralStore | None = None, debug: bool | None = None, identifiableAdapter: IdentifiableAdapter | None = None, securityMode: SecurityMode = SecurityMode.UPDATE)

Bases: object

Crawler class that encapsulates crawling functions. Furthermore it keeps track of the storage for records (record store) and the storage for values (general store).

static bend_references_to_new_object(old, new, entities)

Bend references to the other object Iterate over all entities in entities and check the values of all properties of occurances of old Entity and replace them with new Entity

static check_whether_parent_exists(records: list[Entity], parents: list[str])

returns a list of all records in records that have a parent that is in parents

static compact_entity_list_representation(entities, referencing_entities: List) str

a more readable representation than the standard xml representation

TODO this can be removed once the yaml format representation is in pylib

crawl_directory(crawled_directory: str, crawler_definition_path: str, restricted_path: list[str] | None = None)

The new main function to run the crawler on a directory.

property crawled_data
static create_entity_summary(entities: list[Entity])

Creates a summary string reprensentation of a list of entities.

static create_flat_list(ent_list: list[Entity], flat: list[Entity] | None = None)

Recursively adds entities and all their properties contained in ent_list to the output list flat.

TODO: This function will be moved to pylib as it is also needed by the

high level API.

static create_reference_mapping(flat: list[Entity])

Create a dictionary of dictionaries of the form: dict[int, dict[str, list[Union[int,None]]]]

  • The integer index is the Python id of the value object.

  • The string is the name of the first parent of the referencing object.

Each value objects is taken from the values of all properties from the list flat.

So the returned mapping maps ids of entities to the ids of objects which are referring to them.

static debug_build_usage_tree(converter: Converter)
static detect_circular_dependency(flat: list[Entity])

Detects whether there are circular references in the given entity list and returns a list where the entities are ordered according to the chain of references (and only the entities contained in the circle are included. Returns None if no circular dependency is found.

TODO: for the sake of detecting problems for split_into_inserts_and_updates we should only consider references that are identifying properties.

static execute_inserts_in_list(to_be_inserted, securityMode, run_id: UUID | None = None, unique_names=True)
static execute_parent_updates_in_list(to_be_updated, securityMode, run_id, unique_names)

Execute the updates of changed parents.

This method is used before the standard inserts and needed because some changes in parents (e.g. of Files) might fail if they are not updated first.

static execute_updates_in_list(to_be_updated, securityMode, run_id: UUID | None = None, unique_names=True)
generate_run_id()
static inform_about_pending_changes(pending_changes, run_id, path, inserts=False)
initialize_converters(crawler_definition: dict, converter_registry: dict)
load_converters(definition: dict)
load_definition(crawler_definition_path: str)
static remove_unnecessary_updates(crawled_data: list[Record], identified_records: list[Record])

Compare the Records to be updated with their remote correspondant. Only update if there are actual differences.

Return type:

update list without unecessary updates

replace_entities_with_ids(rec: Record)
static replace_name_with_referenced_entity_id(prop: Property)

changes the given property in place if it is a reference property that has a name as value

If the Property has a List datatype, each element is treated separately. If the datatype is generic, i.e. FILE or REFERENCE, values stay unchanged. If the value is not a string, the value stays unchanged. If the query using the datatype and the string value does not uniquely identify an Entity, the value stays unchanged. If an Entity is identified, then the string value is replaced by the ID.

replace_references_with_cached(record: Record, referencing_entities: dict)

Replace all references with the versions stored in the cache.

If the cache version is not identical, raise an error.

save_debug_data(filename: str, debug_tree: DebugTree | None = None)

Save the information contained in a debug_tree to a file named filename.

static set_ids_and_datatype_of_parents_and_properties(rec_list)
split_into_inserts_and_updates(ent_list: list[Entity])
start_crawling(items: list[StructureElement] | StructureElement, crawler_definition: dict, converter_registry: dict, restricted_path: list[str] | None = None)
synchronize(commit_changes: bool = True, unique_names: bool = True, crawled_data: list[Record] | None = None, no_insert_RTs: list[str] | None = None, no_update_RTs: list[str] | None = None, path_for_authorized_run: str | None = '')

This function applies several stages: 1) Retrieve identifiables for all records in crawled_data. 2) Compare crawled_data with existing records. 3) Insert and update records based on the set of identified differences.

This function makes use of an IdentifiableAdapter which is used to retrieve register and retrieve identifiables.

if commit_changes is True, the changes are synchronized to the CaosDB server. For debugging in can be useful to set this to False.

Parameters:
  • no_insert_RTs (list[str], optional) – list of RecordType names. Records that have one of those RecordTypes as parent will not be inserted

  • no_update_RTs (list[str], optional) – list of RecordType names. Records that have one of those RecordTypes as parent will not be updated

  • path_for_authorized_run (str, optional) – only used if there are changes that need authorization before being applied. The form for rerunning the crawler with the authorization of these changes will be generated with this path. See caosadvancedtools.crawler.Crawler.save_form for more info about the authorization form.

Returns:

the final to_be_inserted and to_be_updated as tuple.

Return type:

inserts and updates

exception caoscrawler.crawl.ForbiddenTransaction

Bases: Exception

class caoscrawler.crawl.SecurityMode(value)

Bases: Enum

An enumeration.

INSERT = 1
RETRIEVE = 0
UPDATE = 2
class caoscrawler.crawl.TreatedRecordLookUp

Bases: object

tracks Records and Identifiables for which it was checked whether they exist in the remote server

For a given Record it can be checked, whether it exists in the remote sever if - it has a (valid) ID - it has a (valid) path (FILEs only) - an identifiable can be created for the Record.

Records are added by calling the add function and they are then added to the internal existing or missing list depending on whether the Record has a valid ID. Additionally, the Record is added to three look up dicts. The keys of those are paths, IDs and the representation of the identifiables.

The extreme case, that one could imagine, would be that the same Record occurs three times as different Python objects: one that only has an ID, one with only a path and one without ID and path but with identifying properties. During split_into_inserts_and_updates all three must be identified with each other (and must be merged). Since we require, that treated entities have a valid ID if they exist in the remote server, all three objects would be identified with each other simply using the IDs.

In the case that the Record is not yet in the remote server, there cannot be a Python object with an ID. Thus we might have one with a path and one with an identifiable. If that Record does not yet exist, it is necessary that both Python objects have at least either the path or the identifiable in common.

add(record: Entity, identifiable: Identifiable | None = None)

Add a Record that was treated, such that it is contained in the internal look up dicts

This Record MUST have an ID if it was found in the remote server.

get_any(record: Entity, identifiable: Identifiable | None = None)

Check whether this Record was already added. Identity is based on ID, path or Identifiable represenation

get_existing(record: Entity, identifiable: Identifiable | None = None)

Check whether this Record exists on the remote server

Returns: The stored Record

get_existing_list()

Return all Records that exist in the remote server

get_missing(record: Entity, identifiable: Identifiable | None = None)

Check whether this Record is missing on the remote server

Returns: The stored Record

get_missing_list()

Return all Records that are missing in the remote server

caoscrawler.crawl.check_identical(record1: Entity, record2: Entity, ignore_id=False)

Check whether two entities are identical.

This function uses compare_entities to check whether two entities are identical in a quite complex fashion:

  • If one of the entities has additional parents or additional properties -> not identical

  • If the value of one of the properties differs -> not identical

  • If datatype, importance or unit are reported different for a property by compare_entities

    return “not_identical” only if these attributes are set explicitely by record1. Ignore the difference otherwise.

  • If description, name, id or path appear in list of differences -> not identical.

  • If file, checksum, size appear -> Only different, if explicitely set by record1.

record1 serves as the reference, so datatype, importance and unit checks are carried out using the attributes from record1. In that respect, the function is not symmetrical in its arguments.

caoscrawler.crawl.crawler_main(crawled_directory_path: str, cfood_file_name: str, identifiables_definition_file: str | None = None, debug: bool = False, provenance_file: str | None = None, dry_run: bool = False, prefix: str = '', securityMode: SecurityMode = SecurityMode.UPDATE, unique_names: bool = True, restricted_path: list[str] | None = None, remove_prefix: str | None = None, add_prefix: str | None = None)
Parameters:
  • crawled_directory_path (str) – path to be crawled

  • cfood_file_name (str) – filename of the cfood to be used

  • identifiables_definition_file (str) – filename of an identifiable definition yaml file

  • debug (bool) – DEPRECATED, use a provenance file instead.

  • provenance_file (str) – Provenance information will be stored in a file with given filename

  • dry_run (bool) – do not commit any chnages to the server

  • prefix (str) – DEPRECATED, remove the given prefix from file paths

  • securityMode (int) – securityMode of Crawler

  • unique_names (bool) – whether or not to update or insert entities inspite of name conflicts

  • restricted_path (optional, list of strings) – Traverse the data tree only along the given path. When the end of the given path is reached, traverse the full tree as normal. See docstring of ‘scanner’ in module ‘scanner’ for more details.

  • remove_prefix (Optional[str]) – Remove the given prefix from file paths. See docstring of ‘_fix_file_paths’ for more details.

  • add_prefix (Optional[str]) – Add the given prefix to file paths. See docstring of ‘_fix_file_paths’ for more details.

Returns:

return_value – 0 if successful

Return type:

int

caoscrawler.crawl.main()
caoscrawler.crawl.parse_args()
caoscrawler.crawl.split_restricted_path(path)

Split a path string into components separated by slashes or other os.path.sep. Empty elements will be removed.