Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A catalog metadata source format to support automatic ingestion #482

Open
jsheunis opened this issue Jul 3, 2024 · 1 comment · May be fixed by #484
Open

A catalog metadata source format to support automatic ingestion #482

jsheunis opened this issue Jul 3, 2024 · 1 comment · May be fixed by #484
Assignees

Comments

@jsheunis
Copy link
Member

jsheunis commented Jul 3, 2024

Context: https://github.com/psychoinformatics-de/org/pull/310

There are currently multiple catalog instances in production (ABCD-J, SFB1451, demo catalog, Public nEUro) that have heterogeneous maintenance workflows, i.e. different ways of providing and transforming metadata into a state that existing datalad-catalog commands can handle. This is not ideal.

To improve this situation, we can create, document, and publish a specification for a datalad-catalog compatible collection of dataset records in a well-defined format.

This will:

  • Enable users to create and maintain such collections without having to employ datalad-catalog tooling
  • Support disentangling inter-dependencies of formats and tooling to elevate accessibility
  • Reduce heterogeneity of catalog maintenance workflows
  • Support automation of generic aspects of catalog maintenance and rollout

After initial discussion, the following structure was produced:

- catalog.json: (do versioned? e.g., `config/v1/...`)
- records/
  - <name-id>/
    - config.json
    - <version-id>/
      - ...<format-id>...
      - ...<format-id>...

These would be standalone "dataset-version" metadata records living in the presented structure on a file system, with a top-level configuration that supports per-catalog customizations. Metadata records may be in various formats (e.g. ScientificDataset YAML, and tabby XLSX), i.e. the specification relates to structure and not to file format or content.

TODO

  • Document the specification for catalog maintainers:
    • I think it makes sense to do this as part of the datalad-catalog documentation, perhaps as a new "Metadata Ingestion" or "Metadata Source Specification" section
    • It could be useful to have an additional user-specific description of the same structure, i.e. not for maintainers of a catalog but rather for people who need to deposit the metadata in the specified structure in some location. This would be documentation that could be reused in any deployment that describes "How do I add my metadata?" to users.
  • Create issues relating to the implementation that would support ingestion of this format
@jsheunis
Copy link
Member Author

jsheunis commented Jul 4, 2024

Been reading through the existing documentation and I think the best candidate for placing this new addition would be the Pipeline Description section, which describes a functioning but outdated view of generating a catalog entry from a datalad dataset using metalad and catalog translators. I think that whole page can be rewritten with the focus being the content proposed the current issue.

Afterwards, we should also revamp/update the Metadata and datalad-catalog page to become in line with the metadata source description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant