Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to reference DSD from data message via agencyID and version #145

Open
hoehrmann opened this issue Apr 1, 2024 · 4 comments
Open

Comments

@hoehrmann
Copy link

hoehrmann commented Apr 1, 2024

Introduction:

The current SDMX-JSON data message specification requires that the data.structure object reference the Data Structure Definition (DSD) through a link to the Data Provision Agreement (DPA) or Dataflow (DF). This approach necessitates multiple lookups and parsing URN references to identify the relevant DSD. It also presents challenges when dealing with non-URN references or situations where the referenced web version becomes unavailable.

Proposed Improvement:

To enhance clarity, convenience, and validation capabilities, this proposal recommends including the agencyID and version of the referenced DSD directly within the data.structure object of SDMX-JSON messages. This eliminates the need for intermediary references through DPA or DF and streamlines the process of identifying the relevant DSD.

Benefits:

  • Reduced Lookups: Eliminates the need for additional requests to locate the DSD through DPA or DF.
  • Improved Convenience: Direct access to agencyID and version is very simple.
  • Improved Persistence: Ensures a persistent reference to the DSD even when web versions disappear.
  • Enhanced Validation: Facilitates easier validation of the message structure using JSON Schema.

Implementation:

  • Modify the data.structure object schema to include two additional properties:

    • agencyID: reflects DSD agencyID
    • version: reflects DSD version
  • Remove requirement to have a links property.

Conclusion:

This proposal offers a more efficient and reliable approach to referencing DSDs within SDMX-JSON messages. Direct inclusion of agencyID and version within the data.structure object simplifies data access, enhances validation, and ensures persistent DSD references, fostering a more streamlined and robust data exchange experience.

Alternative:

Having a datastructure URN reference would also be okay, but then id becomes redundant.

(In doubt, please handle this as a public review comment on SDMX 3.1 once the comment period begins.)

@dosse
Copy link
Contributor

dosse commented Apr 2, 2024

Hi @hoehrmann, the need is not clear:

  • Reduced Lookups: Eliminates the need for additional requests to locate the DSD through DPA or DF.
    --> Using the dataflow identification, it is very easy to obtain the DSD (through the references URL parameter) together with the dataflow artefact (and potentially other very relevant artefacts such as allowed content constraint, codelists, concept schemes, ...) in one single request. Additional requests should not be necessary. One call would always be necessary.
  • Improved Convenience: Direct access to agencyID and version is very simple.
    --> For what other purpose than retrieving the DSD?
  • Improved Persistence: Ensures a persistent reference to the DSD even when web versions disappear.
    --> The reference of the DF to the DSD is persistent. If "web versions disappear" then also the DF is gone, which will make the data "homeless". Having only the DF and DSD identifications is not sufficient.
  • Enhanced Validation: Facilitates easier validation of the message structure using JSON Schema.
    --> Could you please further clarify? What is your underlying validation process, where a single call for all related structures would not be sufficient?

@hoehrmann
Copy link
Author

hoehrmann commented Apr 2, 2024

Summary:
The intent of the current requirement seems to be to ensure the DSD can be identified from data messages. The current requirements are insufficient to ensure this goal. The idiomatic way would be adding the agencyID and version properties, or add a mandatory new property like datastructure that contains the URN. It would also be possible to add a requirement that the dataStructure link has a urn property, but for implementations that read data messages going through all links, and going through all rel values, and checking the urn field for each is more complicated than direct properties, and adding a JSON Schema validation rule that checks that the DSD URN is present or can be computed is also more complicated.

Details:

Right now the requirement in the specification is satisfied by:

data:
  structures:
    - links:
        - rel: dataStructure
          href: https://example.org/X347.xml

There is no way to infer the URN of the DSD. This is valid because only any link to the DSD is required, specifying its URN is not required, and it does not have to be hosted as a SDMX-REST web service (otherwise you could guess the needed URN parts from the URL). This could be addressed by adding that for the dataStructure link the urn property must be specified.

As for validation, take this example https://sdmx.oecd.org/public/rest/v2/data/dataflow/OECD.SDD.NAD.SEEA/DSD_NAT_RES@DF_NAT_RES/1.0/CAN.A.T.LEAD.*.A?. That claims to be a SDMX-JSON 2.0.0 data message. Ignoring the error in the contentLanguages property, the incorrect use of ~ in dimension index lists, and the wrong time format for TIME_PERIOD, the message is valid according to the JSON Schema for SDMX-JSON data messages, even though it does not have the required link (it puts the link on the dataSet instead of the Structure).

It is probably possible to amend the schema to require that in this specific case there must be one link with rel containing dataStructure, but it would increase the complexity of the schema.

As for lookups, the current requirement is satisfied by referencing a provisioning agreement. Even if you are lucky and it references a SDMX-REST end point, you can probably only get the dataflow with a single request. In theory you could use references=descendants or all but those likely return an unreasonable amount of data and/or might be disabled or throttled on public endpoints as denial of service protection.

As for persistence, if I knew the URN I could try to look it up elsewhere (e.g., I may have old data and the web server just changed its address) but without it I would have to guess the URN (based on the IDs of the fields).

@dosse
Copy link
Contributor

dosse commented Apr 3, 2024

Thanks for the clarifications. The intent of the current link requirement is to ensure that the artefact (either DF, DSD or ProvisionAgreement) for which data have been requested can be fully identified from the data message. The choice of the artefact type is not arbitrary but must correspond to the artefact used in the original data request. This was meant with the wording "At least the link to the Data Structure Definition, Dataflow or Data Provision Agreement to which the data relates is required.", but the wording in the field guide can be improved. E.g., if data was requested for a dataflow, then the dataflow identification is required. If data was requested for a dsd, then the dsd identification is required. Also, in order for the full artefact identification to be available immediately, that link requires the usage of 'self' for the relationship and indeed the URN of the artefact, e.g.,

	"href": "https://registry.sdmx.org/ws/rest/dataflow/ECB.DISS/BSI_PUB/1.0",
	"rel": "self",
	"urn": "urn:sdmx:org.sdmx.infomodel.datastructure.dataflow=ECB.DISS:BSI_PUB(1.0)"

This information is sufficient to retrieve all required structure artefacts in one single request. This can be further clarified in the field guide.

If a client requires the structure information at a later time than the client is free to extract and store the structure information at the same time than the data. If you need the get just the DSD for a DF, then you could use the references=datastructure parameter. Disabling structure retrieval through references as a 'denial of service' protection seems to me an unreasonable approach. Compared to data extractions, structure messages are usually much smaller.

I would conclude, that this ticket specifically requests that the URN of the underlying artefact can be found in a more straightforward way (without looping through the links array), by taking it out of the links array and adding it as a separate structure property (similar to the SDMX-ML data messages) or, e.g., by requiring to position that link as the first value of the links array.

For issues you find in the practical SDMX implementation https://sdmx.oecd.org/public/rest/, could you please open tickets in this separate code repository https://gitlab.com/sis-cc/.stat-suite/dotstatsuite-core-sdmxri-nsi-ws/-/issues/ ?

@hoehrmann
Copy link
Author

I would like to add the following point: in a structure message external structures are referenced like this

data:
  dataStructures:
    - id: EXAMPLE
      agencyID: EXAMPLE
      version: "1.0.0"
      name: Example
      isExternalReference: true

Using links to reference a DSD in data messages is inconsistent with this pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants