Conversion between LinkAhead data models, JSON schema, and XLSX (and vice versa)

This file describes the conversion between JSON schema files and XLSX templates, and between JSON data files following a given schema and XLSX files with data. This conversion is handled by the Python modules in the table_json_conversion library.

Data models in JSON Schema and JSON data

Let’s start simple! If you would describe a Person Record with the Properties family_name and given_name in JSON, it would probably look like this:

{
    "Person":
        {
            "family_name": "Steve",
            "given_name": "Stevie"
        }
}

The data model in LinkAhead defines the types of records present in a LinkAhead instance and their structure. This data model can also be represented in a JSON Schema, which defines the structure of JSON files containing records pertaining to the data model.

You can define this kind of structure with the following JSON schema:

{
  "type": "object",
  "properties": {
    "Person": {
      "type": "object",
      "properties": {
        "family_name": {
          "type": "string"
        },
        "given_name": {
          "type": "string"
        }
      }
    }
  },
  "$schema": "https://json-schema.org/draft/2020-12/schema"
}

The above schema (and schemas created by json_schema_exporter.merge_schemas(...)) is, from a broad view, a dict with all the top level recordtypes (the recordtype names are the keys). This is sufficient to describe the data model. However, actual data often consists of multiple entries of the same type (e.g. multiple Persons).

Since the data model schema does not match multiple data sets, there is a utility function which creates a data array schema out of the data model schema: It basically replaces the top-level entries of the data model by lists which may contain data.

For example, the following JSON describes two “Person” Records:

{
    "Person": [
        {
            "family_name": "Steve",
            "given_name": "Stevie"
        },
        {
            "family_name": "Star",
            "given_name": "Stella"
        }
    ]
}

The JSON Schema for a JSON like the above one could look like the following:

{
  "type": "object",
  "properties": {
    "Person": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "family_name": {
            "type": "string"
          },
          "given_name": {
            "type": "string"
          }
        }
      }
    }
  },
  "$schema": "https://json-schema.org/draft/2020-12/schema"
}

This would define that the top level object/dict may have a key Person which has as value an array of objects that in turn have the properties family_name and given_name.

You can create a data array schema from a data model schema using xlsx_utils.array_schema_from_model_schema.

From JSON to XLSX: Data Representation

The following describes how JSON files representing LinkAhead records are converted into XLSX files, or how JSON files with records are created from XLSX files.

The attribute name (e.g., “Person” above) determines the RecordType, and the value of this attribute can either be an object or a list. If it is an object (as in the example above), a single record is represented. In the case of a list, multiple records sharing the same RecordType as the parent are represented.

The Properties of the record (e.g., family_name and given_name above) become columns in the XLSX file. Thus the XLSX file created from the above example would have a sheet “Person” with the following table:

given_name

family_name

Stevie

Steve

Stella

Star

The properties of objects (Records) in the JSON have an attribute name and a value. The value can be:

  1. A primitive (text, number, boolean, …)

  2. A record

  3. A list of primitive types

  4. A list of unique enums (multiple choice)

  5. A list of records

In cases a. and c., a cell is created in the column corresponding to the property in the XLSX file. In case b., columns are created for the Properties of the record, where for each of the Properties the cases a. - e. are considered recursively. Case d. leads to a number of columns, one for each of the possible choices.

For case e. however, the two-dimensional structure of an XLSX sheet is not sufficient. Therefore, for such cases, new XLSX sheets/tables are created.

In these sheets/tables, the referenced records are treated as described above (new columns for the Properties). However, there are now additional columns that indicate from which “external” record these records are referenced.

Let’s now consider these five cases in detail and with examples:

a. Properties with primitive data types

{
    "Training": [
      {
        "date": "2023-01-01",
        "url": "www.indiscale.com",
        "duration": 1.0,
        "participants": 1,
        "remote": false
      },
      {
        "date": "2023-06-15",
        "url": "www.indiscale.com/next",
        "duration": 2.5,
        "participants": None,
        "remote": true
      }
    ]
}

This entry will be represented in an XLSX sheet with the following content:

date

url

duration

participants

remote

2023-01-01

www.indiscale.com

1.0

1

false

2023-06-15

www.indiscale.com/next

2.5

true

b. Property referencing a record

{
    "Training": [
      {
        "date": "2023-01-01",
        "supervisor": {
            "family_name": "Stevenson",
            "given_name": "Stevie"
        }
      }
    ]
}

This entry will be represented in an XLSX sheet named “Training” with the following content:

date

supervisor.family_name

supervisor.given_name

2023-01-01

Stevenson

Stevie

c. Properties containing lists of primitive data types

{
    "Training": [
      {
        "url": "www.indiscale.com",
        "subjects": ["Math", "Physics"],
      }
    ]
}

This entry would be represented in an XLSX sheet with the following content:

url

subjects

www.indiscale.com

Math;Physics

The list elements are written into the cell separated by ; (semicolon). If the elements contain the separator ;, it is escaped with \.

d. Multiple choice properties

{
    "Training": [
      {
        "date": "2024-04-17",
        "skills": [
              "Planning",
              "Evaluation"
        ]
      }
    ]
}

If the skills list is denoted as an enum array with "uniqueItems": true in the json schema, this entry would be represented like this in an XLSX:

date

skills.Planning

skills.Communication

skills.Evaluation

2024-04-17

x

x

Note that this example assumes that the list of possible choices, as given in the json schema, was “Planning, Communication, Evaluation”.

e. Properties containing lists with references

{
    "Training": [
      {
        "date": "2023-01-01",
        "coach": [
            {
              "family_name": "Sky",
              "given_name": "Max",
            },
            {
              "family_name": "Sky",
              "given_name": "Min",
            }
        ]
      }
    ]
}

Since the two coaches cannot be represented properly in a single cell, another worksheet is needed to contain the properties of the coaches.

The sheet for the Trainings in this example only contains the “date” column

date

2023-01-01

Additionally, there is another sheet where the coaches are stored. Here, it is crucial to define how the correct element is chosen from potentially multiple “Trainings”. In this case, it means that the “date” must be unique.

The second sheet looks like this:

date

coach.family_name

coach.given_name

2023-01-01

Sky

Max

2023-01-01

Sky

Min

Note: This uniqueness requirement is not strictly checked right now, it is your responsibility as a user that such “foreign properties” are truly unique.

When converting JSON files that contain Records that were exported from LinkAhead it might be a good idea to use the LinkAhead ID as a unique identifier for Records. However, if your Records do not yet have LinkAhead IDs you need to find some other identifying properties/foreign keys. Note, that those properties only need to identify a Record uniquely within the list of Records: In the above example the “coach” Record needs to be identified in the list of coaches.

Data in XLSX: Hidden automation logic

First column: Marker for row types

The first column in each sheet will be hidden and it will contain an entry in each row that needs special treatment. The following values are used:

  • IGNORE: This row is ignored. It can be used for explanatory texts or layout.

  • COL_TYPE: Typically the first row that is not IGNORE. It indicates the row that defines the type of columns (FOREIGN, SCALAR, LIST, MULTIPLE_CHOICE, IGNORE). This row must occur exactly once per sheet.

  • PATH: Indicates that the row is used to define the path within the JSON. These rows are typically hidden for users.

An example table could look like this:

IGNORE

Welcome

to

this

file

IGNORE

Please

enter your

data here:

COL_TYPE

IGNORE

SCALAR

SCALAR

LIST

SCALAR

PATH

Training

Training

Training

Training

PATH

url

date

subjects

supervisor

PATH

email

IGNORE

Please enter one training per line.

Training URL

Training date

Subjects

Supervisor’s email

example.com/mp

2024-02-27

Math;Physics

steve@example.com

example.com/m

2024-02-28

Math

stella@example.com

Parsing XLSX data

To extract the value of a given cell, we traverse all path elements (in PATH rows) from top to bottom. The final element of the path is the name of the Property to which the value belongs. In the example above, steve@example.com is the value of the email Property in the path ["Training", "supervisor", "email"].

The path elements are sufficient to identify the object within a JSON, at least if the corresponding JSON element is a single object. If the JSON element is an array, the appropriate object within the array needs to be selected.

For this selection additional FOREIGN columns are used. The paths in these columns must all have the same base and one additional unique key component. For example, two FOREIGN columns could be ["Training", "date"] and ["Training", "url"], where ["Training"] is the base path and "date" and "url" are the unique keys.

The base path defines the table (or recordtype) to which the entries belong, and the values of the unique keys define the actual rows to which data belongs.

For example, this table defines three coaches for the two trainings from the last table:

COL_TYPE

FOREIGN

FOREIGN

SCALAR

PATH

Training

Training

Training

PATH

date

url

coach

PATH

given_name

IGNORE

Date of training

URL of training

The coach’s given name

IGNORE

from sheet ‘Training’

from sheet ‘Training’

2024-02-27

example.com/mp

Ada

2024-02-27

example.com/mp

Berta

2024-02-28

example.com/m

Chris

Sepcial case: multiple choice “checkboxes”

As a special case, enum arrays with "uniqueItems": true can be represented as multiple columns, with one column per choice. The choices are denoted as the last PATH component, the column type must be MULTIPLE_CHOICE.

Stored data is denoted as an “x” character in the respective cell, empty cells denote that the item was not selected. Additionally, the implementation also allows TRUE or 1 for selected items, and FALSE, 0 or cells with only whitespace characters for deselected items:

COL_TYPE

MULTIPLE_CHOICE

MULTIPLE_CHOICE

MULTIPLE_CHOICE

PATH

skills

skills

skills

PATH

Planning

Communication

Evaluation

IGNORE

skills.Planning

skills.Communication

skills.Evaluation

x

X

“ “

TRUE

FALSE

0

x

1

These rows correspond to:

  1. Planning, Evaluation

  2. Communication

  3. Communication, Evaluation

User Interaction

The primary and most straight forward use case of this utility is to export LinkAhead data as JSON and then as XLSX tables. This can be done fully automatic.

TODO show how!

The hidden cells for automation are designed such that the XLSX template that is created can be customized such that it is a nicely formatted table. The hidden content must remain. See below for tips on how to manipulate the table.

The second use case is to use XLSX to collect data and then import it into LinkAhead. Here, it may be necessary to define foreign keys in order to identify Records in lists.

Table Manipulation

  • All formatting is ignored

  • Nothing has to be observed when adding new data rows

  • When adding new descriptory rows (for example one for descriptions of the columns), the COL_TYPE must be set to IGNORE

  • You can freely rename sheets.

  • You can freely rename columns (since the row containing the column names is set to IGNROE; the Property name is taken from the last path element)

  • You can change the order of columns. However, you have to make sure to move the full column including hidden elements. Thus you should not select a range of cells, but click on the column index in your spread sheet program.

Note: Requirements

This conversion does not allow arbitrary JSON schema files nor does it support arbitrary JSON files since conversion to XLSX files would not make sense. Instead, this utility is tailored to supported conversion of data (models) that are structured like data (models) in LinkAhead:

  • The JSON schema describes a data model of RecordTypes and Properties as it would be generated by the caosadvancedtools.json_schema_exporter module.

  • The JSON files must contain arrays of Records complying with such a data model.

Thus, when converting from a JSON schema, the top level of the JSON schema must be a dict. The keys of the dict are RecordType names.

Current limitations

The current implementation still lacks the following:

  • Files handling is not implemented yet.