Replies: 7 comments 12 replies
-
In NOMAD we use regular expressions on file-names, mime-types, and file contents to match parsers to files. Allows us to determine, if a parser is applicable with reasonable effort. It's not always 100% correct, but practical. |
Beta Was this translation helpful? Give feedback.
-
The usual schema-language suspects (as those listed above) are typically not tailed for scientific data, both in expressiveness and scale. E.g. it is cumbersome to define multi-dimensional data and handle them efficiently (json, xml, csv instead of hdf5, pandas, numpy, ...). Maybe not just treat the output as another stream/file, but as data in e.g. Python runtime. |
Beta Was this translation helpful? Give feedback.
-
Related to the discussion of multi-dimensional data was the idea of calculating derived quantities. To clarify in a comment some potential situations... Let's say you have some file in a (purposefully gross) home-baked format that looks something like:
Are we putting any constraint on what a "parser" or "metadata extractor" (by our definition) should return? e.g.,
We obviously want to allow/encourage 1, 2 and 3. Perhaps 4 is the average value is well-described. 5 would be useful for actual re-use yet the conversion is somehow lossy, as is 6. The answer I guess would be "whatever the parser wants" but should these differences be somehow expressed in the schema, to make it easier for re-use? e.g., an ELN probably does not want to store/index huge useless arrays in primary data, but automated use of a parser that does 2/3/4 would do this by default) |
Beta Was this translation helpful? Give feedback.
-
After the first office hours today we made repos for each of main topics. I would hope that this discussion thread can continue for general stuff, but if we want to comment on specific ideas/code then we can use PRs over at https://github.com/marda-alliance/metadata_extractors_schema (you will see a similar comment on each other thread with the appropriate link). |
Beta Was this translation helpful? Give feedback.
-
After some feedback from @PeterKraus, I've just merged a draft file type schema at https://github.com/marda-alliance/metadata_extractors_schema that we can begin iterating on. The schema is authored as YAML using LinkML and can be converted into many different formats (e.g., JSONSchema, auto-generated Python models etc). We will work a bit before the next meeting to generate a demo of the registry and API based on this schema. Please take a look if you are interested! |
Beta Was this translation helpful? Give feedback.
-
Parsing of coupled filesThis was a discussion item raised in our Office Hours on 2023-01-24: What do we do for As an example, @ml-evs mentioned the Bruker Topspin files, where for example the nmrglue library requires the parent folder or "stem" of the files: https://nmrglue.readthedocs.io/en/latest/examples/proc_bruker_1d.html#instructions I see two options on how to deal with this:
I'm in favour of 1), as in both cases we would still need to touch the |
Beta Was this translation helpful? Give feedback.
-
Metadata vs dataOK, on the last Office Hours (2023-01-24), we've agreed it is time we discuss this. This might be of particular importance to resolve #7 (comment), as I imagine a two-pass extraction might go a long way:
The questions I would like opinions from "the community" are:
|
Beta Was this translation helpful? Give feedback.
-
A lightweight metadata schema for parsers and associated tooling for software libraries to self-report:
Beta Was this translation helpful? Give feedback.
All reactions