Aggregate Data Value Import/Export 2.0 #355
Replies: 1 comment
-
FYI: While ADX currently is the only format that uses COs it does not identify CC collisions and thus is not correct if a DS with a CC override is used. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Motivation
The aggregate data value import/export is one of the most essential foundations of the DHIS2 platform.
Yet it leaves much to be desired both for users and maintainers.
The main issues are:
Problem Causes
The main causes of the above issues are design flaws and feature creep
Proposal
Data files need a header that is specific about the data that is contained in the file.
The data in the file must be ordered in a specified way to allow batching.
The format should be universal and flexible where possible, without compromising the ability to batch process.
Data Format
A file is a self-contained unit that defines a fixed scope and states the used key objects in a "header":
If only one identifier is given for a key dimension that dimension is implicit and omitted entirely in the data array.
If multiple identifiers are given for a key dimension the data array refers to the actual value for the row by index.
If multiple OUs and/or Periods are used the data has to be organized in the given order,
with Period being the 1st grouping dimension, and OU the 2nd, if both have multiple values.
Index references to the group header occur in the order: Period, OU, n x COs or m x COCs.
COs are given as array of of arrays for each category in the order defined by the CC
or by using an object where the keys are the C identifiers, the values the COs for that category.
Additionally a value array defines:
false
)null
= keep as is orfalse
for insert)null
= keep or none for insert)These always occur in the same order. Defaults can be omitted as long as they aren't followed by a non-default value.
Note
A file might also support consisting of an array of the above group object.
In that case each group is its own import TX wise and similar to sending each of them in a extra file.
There is only one extra "option",
delete
, which can be set to"soft"
or"hard"
(or omitted to not delete).If set to soft all data matching the key dimensions of the header are marked deleted.
If set to hard all data matching the key dimensions is deleted from the table.
Both can be used either with stating new data in the data array or without (size=0).
Warning
DE + CO vs DE + COC vs DE + CO + CC vs DE + CO + DS
When using DE + CO as identifiers the CC used isn't necessarily unique in case the DS connects to multiple DE with a CC override.
In such cases the collision is detected, the import fails with a message asking the user to either provide a CC, a DS or use COCs instead > of COs.
When exporting the user should be forced to select one of the options:
Data Format Specification
Notation:
?
: property is not always required|
: or[
...]
: an array of ...Values
The actual values of a data value are always strings in the DB.
However, the import format allows and encourages the use of specific JSON types:
Values do not need to be consistent in their use of types. Each row can use the type that fits best.
Conversion tools can always use string for simplicity.
Validation
When looking at the file format it becomes apparent that most validation do target the header as the data array
really just contains index references, a value, some flags and a comment.
The only value level validation that remain are:
does the file exist the value refers to? (super rare value type FILE case)All other validations are really just consistency checks of the relations between the key dimensions as stated in the header.
Thus, all "expensive" validations can be performed in O(1).
Very complicated group level validation should be moved to data integrity checks instead.
The validations on the data array rows are simple pure function validators that do not need any metadata input once they are constructed
and thus run fast in O(n) with n being number of data rows.
The validation strategy follows that insight: A header level validation fails the entire import.
A value level validation failure or out of bounds index reference ignores that row.
No modes or switches. Users can test the result by using a dry-run.
Batching and Stream Processing
The file layout allows to only read the header and run most validations.
Once they succeed the data can be stream processed quite easily.
While the
json-tree
library doesn't work in input streams it can perform a iteration through the JSONwithout using memory that cannot be GCed once an array element has been processed.
Alternatively we could also use true JSON stream processing that already exists but would need to be integrated.
That means as long as there is enough memory to fit the file that is processed the import is only needing a fixed size of extra
memory to process it. With that files in below 100MB size clearly are no issue.
Given the compact format a typical data value row will need around 10-20 bytes.
Assuming each takes 25 bytes, 1K can fit 40 values, 1M ~ 40K values, 100MB ~ 4Mio values.
Since the data input is required to occur in Period and OU order given in case multiple OUs and/or Periods are used
the fetching of the existing data values can be split into manageable chunks if wanted or everything can be fetched in one query.
With the size of existing and values in the file known up-front a decision can be made to batch in one of 4 modes:
SQL
The import should be implemented mostly using native SQL instead of using the object model.
Queries are (in order of occurrence):
The SQL for performing the import changes should be implemented in such a way that a dry-ran can include the generated native SQL in the response, if the users desires it.
Legacy Format Support
There are already multiple formats for aggregate data values. DFX in form of JSON, CSV, XML or PDF and ADX in form of XML.
Adding yet another one that just is added to the existing code just adds more overall complexity.
Therefore, the goal must be to create file rewriting functions that rewrite old formats into the new one
so that the only processing pipeline needed is the new one.
The rewriting can be done either by a stand-alone command line tool or by the server at the outer controller layer.
This will allow to use old data files but will remove some of the import options users could select in the app.
Since the new format always assumes 1 DE only a old file containing multiple DEs has to be split in multiple files.
The new format is designed so rewriting is possible without additional knowledge, except for the identifier schema used.
One exception would be if a file uses AOC that are associated with different datasets.
But this was never really correctly supported by the old import so it should be fair to assume this does not exist.
App
The new file format does not require a complicated app.
Mostly there is a button to select a file, a dry-run and a run button.
As usual the app shows the progress once the import is started.
Timeline
A rough estimate of the required steps (BE only), each being about 1 week of focused work:
At that point the import can be used to power ADE and data value synchronization.
The effort required to switch using the new import/export for these is not included.
Beta Was this translation helpful? Give feedback.
All reactions