Aggregate Data Value Import/Export 2.0 #355

jbee · 2025-02-05T18:53:53Z

jbee
Feb 5, 2025
Collaborator

Motivation

The aggregate data value import/export is one of the most essential foundations of the DHIS2 platform.
Yet it leaves much to be desired both for users and maintainers.

The main issues are:

too vague and flexible, which leads to...
too complex, which leads to...
slow
error-prone
unclear semantics
inconsistent
unreliable
all of which lead to
- expensive to maintain
- frustrating to use
- difficult to integrate into other features like aggregate data exchange or data value sync
- impossible to improve in a significant way

Problem Causes

The main causes of the above issues are design flaws and feature creep

a file is not self contained - e.g. it does not contain which type of identifiers are used
a file has no scope and no order requirements (ADX has some) - any data value key combination can occur at any point => no bulk IO possible
the file formats are highly repetitive (CSV less so) - while compression can take care of size on disk large uncompressed files still are costly to process
consequently the import loop is mostly a 1 by 1 processing design, the writing is batched but far from ideal
most things are handled in terms of their object model making this a exercise in object graph optimizations, caching, and still it requires lots of memory
there are many options changing the semantics of the import operation which makes optimization virtually impossible
validation is a bouncer that does data quality control at the entrance, but this black & white way of looking at data is a mismatch with the reality of data where validity depends on a changing context

Proposal

Data files need a header that is specific about the data that is contained in the file.
The data in the file must be ordered in a specified way to allow batching.
The format should be universal and flexible where possible, without compromising the ability to batch process.

Data Format

A file is a self-contained unit that defines a fixed scope and states the used key objects in a "header":

1+ Period
1+ OU,
1 DE (=COC-CC) + all used COs of the DE-CC OR all used COCs
0-1 DS + all used COs of the DS-CC OR all used AOCs
0-1 storedBy (common for all values)

If only one identifier is given for a key dimension that dimension is implicit and omitted entirely in the data array.
If multiple identifiers are given for a key dimension the data array refers to the actual value for the row by index.
If multiple OUs and/or Periods are used the data has to be organized in the given order,
with Period being the 1st grouping dimension, and OU the 2nd, if both have multiple values.
Index references to the group header occur in the order: Period, OU, n x COs or m x COCs.
COs are given as array of of arrays for each category in the order defined by the CC
or by using an object where the keys are the C identifiers, the values the COs for that category.

Additionally a value array defines:

1 value
0-1 deleted flag (default false)
0-1 followup flag (default null = keep as is or false for insert)
0-1 comment (default null = keep or none for insert)

These always occur in the same order. Defaults can be omitted as long as they aren't followed by a non-default value.

{
  "identifiers": { "ou": "code", "de": "code", ... }
  "group": {
    "size": 4,
    "ou": [ "OU_ABC" ], 
    "p": [ "2025" ],
    "de": "DE_XYZ", 
    "co": [["0-7", "8-15"], ["Female", "Male"]]
    // alternatively
    "co": {"age": [["0-7", "8-15"], "sex": ["Female", "Male"]] }
   },
  "data": [
    [0, 0, 42],
    [1, 0, 107, true],
    [0, 1, 107, false, true],
    [1, 0, 3, false, false, "comment"]
  ]
}

Note

A file might also support consisting of an array of the above group object.
In that case each group is its own import TX wise and similar to sending each of them in a extra file.

There is only one extra "option", delete, which can be set to "soft" or "hard" (or omitted to not delete).
If set to soft all data matching the key dimensions of the header are marked deleted.
If set to hard all data matching the key dimensions is deleted from the table.
Both can be used either with stating new data in the data array or without (size=0).

Warning

DE + CO vs DE + COC vs DE + CO + CC vs DE + CO + DS
When using DE + CO as identifiers the CC used isn't necessarily unique in case the DS connects to multiple DE with a CC override.
In such cases the collision is detected, the import fails with a message asking the user to either provide a CC, a DS or use COCs instead > of COs.
When exporting the user should be forced to select one of the options:

use COs, select a DS
use COs, select a CC
use COCs

Data Format Specification

FileObject := BatchObject |  [ BatchObject ]
BatchObject := 
              identifiers: IdentifierObject, 
              group: GroupObject, 
              data: [ [ string | number | boolean ] ]
IdentifierObject := 
              ou: string, 
              de: string, 
              co?: string,
              c?: string,
              ds?: string, 
              ao?: string, 
GroupObject :=
              size: number,
              ou: [ string ], 
              p: [ string ], 
              de: string , 
              co?: [ [ string ] ] | OptionObject, 
              coc?: [ string ], 
              ds?: string, 
              ao?: [ [ string ] ] | OptionObject,
              aoc?: [ string ]
              delete? : string
OptionObject :=
              * : [ string ]

Notation:

? : property is not always required
| : or
[ ... ] : an array of ...

Values

The actual values of a data value are always strings in the DB.
However, the import format allows and encourages the use of specific JSON types:

string
number
boolean
null (to delete individual)
[ string ] (muti-text)

Values do not need to be consistent in their use of types. Each row can use the type that fits best.
Conversion tools can always use string for simplicity.

Validation

When looking at the file format it becomes apparent that most validation do target the header as the data array
really just contains index references, a value, some flags and a comment.

The only value level validation that remain are:

is the value valid for the value type (and potentially options) of the DE? => a single validator function applies to all values
is the comment valid? (does not exceed max length)
~~does the file exist the value refers to? (super rare value type FILE case)~~

All other validations are really just consistency checks of the relations between the key dimensions as stated in the header.
Thus, all "expensive" validations can be performed in O(1).
Very complicated group level validation should be moved to data integrity checks instead.
The validations on the data array rows are simple pure function validators that do not need any metadata input once they are constructed
and thus run fast in O(n) with n being number of data rows.

The validation strategy follows that insight: A header level validation fails the entire import.
A value level validation failure or out of bounds index reference ignores that row.
No modes or switches. Users can test the result by using a dry-run.

Batching and Stream Processing

The file layout allows to only read the header and run most validations.
Once they succeed the data can be stream processed quite easily.
While the json-tree library doesn't work in input streams it can perform a iteration through the JSON
without using memory that cannot be GCed once an array element has been processed.
Alternatively we could also use true JSON stream processing that already exists but would need to be integrated.

That means as long as there is enough memory to fit the file that is processed the import is only needing a fixed size of extra
memory to process it. With that files in below 100MB size clearly are no issue.
Given the compact format a typical data value row will need around 10-20 bytes.
Assuming each takes 25 bytes, 1K can fit 40 values, 1M ~ 40K values, 100MB ~ 4Mio values.

Since the data input is required to occur in Period and OU order given in case multiple OUs and/or Periods are used
the fetching of the existing data values can be split into manageable chunks if wanted or everything can be fetched in one query.
With the size of existing and values in the file known up-front a decision can be made to batch in one of 4 modes:

everything in one fetch
batch one period at a time
batch one OU at a time
batch one period + OU combination at a time

SQL

The import should be implemented mostly using native SQL instead of using the object model.

Queries are (in order of occurrence):

1x validate existence of key dimension objects referenced in the header
~20x run group level validations
1x fetch DE value type and value options
0-1x soft or hard delete all data values for key dimensions
1x count existing data values for key dimensions
1-*x fetch existing data values for key dimensions (batching dependent on size using groups of OU and/or Period)
0-*x insert new data (batching dependent on size)
0-*x update existing data (batching dependent on size)
0-*x delete individual values (batching dependent on size)

The SQL for performing the import changes should be implemented in such a way that a dry-ran can include the generated native SQL in the response, if the users desires it.

Legacy Format Support

There are already multiple formats for aggregate data values. DFX in form of JSON, CSV, XML or PDF and ADX in form of XML.
Adding yet another one that just is added to the existing code just adds more overall complexity.
Therefore, the goal must be to create file rewriting functions that rewrite old formats into the new one
so that the only processing pipeline needed is the new one.
The rewriting can be done either by a stand-alone command line tool or by the server at the outer controller layer.
This will allow to use old data files but will remove some of the import options users could select in the app.
Since the new format always assumes 1 DE only a old file containing multiple DEs has to be split in multiple files.

The new format is designed so rewriting is possible without additional knowledge, except for the identifier schema used.
One exception would be if a file uses AOC that are associated with different datasets.
But this was never really correctly supported by the old import so it should be fair to assume this does not exist.

App

The new file format does not require a complicated app.
Mostly there is a button to select a file, a dry-run and a run button.
As usual the app shows the progress once the import is started.

Timeline

A rough estimate of the required steps (BE only), each being about 1 week of focused work:

base implementation of the import (no validation, but including batch handling)
import validations
conversion functions of old formats into the new format
implementation of the export for the new format
cleanup

At that point the import can be used to power ADE and data value synchronization.
The effort required to switch using the new import/export for these is not included.

jbee · 2025-02-06T12:38:43Z

jbee
Feb 6, 2025
Collaborator Author

FYI: While ADX currently is the only format that uses COs it does not identify CC collisions and thus is not correct if a DS with a CC override is used.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregate Data Value Import/Export 2.0 #355

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Aggregate Data Value Import/Export 2.0 #355

jbee Feb 5, 2025 Collaborator

Motivation

Problem Causes

Proposal

Data Format

Data Format Specification

Values

Validation

Batching and Stream Processing

SQL

Legacy Format Support

App

Timeline

Replies: 1 comment

jbee Feb 6, 2025 Collaborator Author

jbee
Feb 5, 2025
Collaborator

jbee
Feb 6, 2025
Collaborator Author