WIP Tutorial: Single structured file

Warning

This tutorial is still work in progress. It may be better than nothing, but it is still incomplete and probably contains serious errors.

Use at your own risk.

In this tutorial, we will create a crawler that reads a single structured file, such as an XLSX file.

Declarations

identifiables.yml

Präventionsmaßnahme:
  - Organisation
  - titel
  - Laufzeit

cfood.yml

---
metadata:
  crawler-version: 0.6.1
---

Präventionsmaßnahme der Organisation: # Eine Excel-Datei mit Präventionsmaßnahmen
  type: XLSXTableConverter
  match: ".*xlsx$"  # Any xlsx file.
  subtree:
    Maßnahme:  # Eine Zeile in der Datei
      type: DictElement
      match_name: .*
      match_value: .*
      records:
        Präventionsmaßnahme:  # Records edited for each row
          name: ""
      subtree:
        MaßnahmenArt:  # Spalte mit Art der Maßnahme
          type: IntegerElement
          match_name: Art der Maßnahme  # Name of the column in the table file
          match_value: (?P<column_value).*)
        MaßnahmenTitel:
          type: TextElement
          match_name: Titel der Maßnahme  # Name of the column in the table file
          match_value: (?P<column_value).*)
          records:  # Records edited for each cell
            Präventionsmaßnahme:
              titel: $column_value

Python code

#!/usr/bin/env python3

# Crawler für Präventionsmaßnahme
#
# Copyright (C) 2023 IndiScale GmbH <info@indiscale.com>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
#
# You should have received a copy of the GNU Affero General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.

"""Crawler für Präventionsmaßnahmen"""

import argparse

from caoscrawler.scanner import load_definition, create_converter_registry, scan_structure_elements
from caoscrawler.structure_elements import File


def crawl_file(filename: str, dry_run: bool = False):
    """Read an XLSX file into a LinkAhead container.

Parameters
----------
filename : str
  The name of the XLSX file.

dry_run : bool
  If True, do not modify the database.
    """
    definition = load_definition("cfood.yml")
    converter_registry = create_converter_registry(definition)

    records = scan_structure_elements(items=File(name="somename.xlsx", path=filename),
                                      crawler_definition=definition,
                                      converter_registry=converter_registry)
    from IPython import embed
    embed()

def _parse_arguments():
    """Parse the arguments."""
    parser = argparse.ArgumentParser(description='Crawler für Präventionsmaßnahme')
    parser.add_argument('-n', '--dry-run', help="Do not modify the database.", action="store_true")
    parser.add_argument('xlsx_file', metavar="XSLX file", help="The xlsx file to be crawled.")
    return parser.parse_args()


def main():
    """Main function."""
    args = _parse_arguments()
    crawl_file(args.xlsx_file, dry_run=args.dry_run)


if __name__ == '__main__':
    main()