caoscrawler.sync_graph module

A data model class for the graph of entities that shall be created during synchronization of the crawler.

class caoscrawler.sync_graph.SyncGraph(entities: list[linkahead.common.models.Entity], identifiableAdapter: IdentifiableAdapter)

Bases: object

A data model class for the graph of entities that shall be created during synchronization of the crawler.

The SyncGraph combines nodes in the graph based on their identity in order to create a graph of objects that can either be inserted or updated in(to) the remote server. This combination of SyncNodes happens during initialization and later on when the ID of SyncNodes is set.

When the SyncGraph is initialized, the properties of given entities are scanned and used to create multiple reference maps that track how SyncNodes reference each other. These maps are kept up to date when SyncNodes are merged because they are identified with each other. During initialization, SyncNodes are first merged based on their ID, path or identifiable.

When additional information is added to the graph by setting the ID of a node (via set_id_of_node) then the graph is updated accordingly: - if this information implies that the node is equivalent to another node (e.g. has same ID),

then they are merged

  • if knowing that one node does not exist in the remote server, then this might imply that some other node also does not exist if its identity relies on the latter.

  • The new ID might make it possible to create the identifiables of connected nodes and thus might trigger further merging of nodes based on the new identifiables.

A SyncGraph should only be manipulated via one function: - set_id_of_node: a positive integer means the Entity exists, None means it is missing TODO what about String IDs

The SyncGraph can be converted back to lists of entities which allow to perform the desired inserts and updates.

Usage: - Initialize the Graph with a list of entities. Those will be converted to the SyncNodes of the

graph.

  • SyncNodes that can be merged are automatically merged and SyncNodes where the existence can be determined are automatically removed from the list of unchecked SyncNodes: graph.unchecked.

  • You manipulate the graph by setting the ID of a SyncNode (either to a valid ID or to None). For example, you can check whether a SyncNode has an identifiable and then query the remote server and use the result to set the ID.

  • After each manipulation, the graph updates accordingly (see above)

  • Ideally, the unchecked list is empty after some manipulation.

  • You can export a list of entities to be inserted and one of entities to be updated with export_record_lists.

Last review by Alexander Schlemmer on 2024-05-24.

set_id_of_node(node: SyncNode, node_id: Optional[str] = None)

sets the ID attribute of the given SyncNode to node_id.

If node_id is None, a negative ID will be given indicating that the node does not exist on the remote server. Furthermore it will be marked as missing using _mark_missing.

Last review by Alexander Schlemmer on 2024-05-24.

export_record_lists()

exports the SyncGraph in form of db.Entities

All nodes are converted to db.Entity objects and reference values that are SyncNodes are replaced by their corresponding (newly created) db.Entity objects.

Since the result is returned in form of two lists, one with Entities that have a valid ID one with those that haven’t, an error is raised if there are any SyncNodes without an (possibly negative) ID.

Last review by Alexander Schlemmer on 2024-05-24.

unchecked_contains_circular_dependency()

Detects whether there are circular references in the given entity list and returns a list where the entities are ordered according to the chain of references (and only the entities contained in the circle are included. Returns None if no circular dependency is found.

TODO: for the sake of detecting problems for split_into_inserts_and_updates we should only consider references that are identifying properties.

get_equivalent(entity: SyncNode) Optional[SyncNode]

Return an equivalent SyncNode.

Equivalent means that ID, path or identifiable are the same. If a new information was added to the given SyncNode (e.g. the ID), it might be possible then to identify an equivalent node (i.e. one with the same ID in this example). There might be more than one equivalent node in the graph. However, simply the first that is found is being returned. (When an equivalent node is found, the given node is typically merged, into the one that was found and after the merge the graph is again checked for equivalent nodes.)

Returns None if no equivalent node is found.

Last review by Alexander Schlemmer on 2024-05-28.