Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution composed of more than one file, but not packaged #482

Closed
dr-shorthair opened this issue Oct 22, 2018 · 27 comments
Closed

Distribution composed of more than one file, but not packaged #482

dr-shorthair opened this issue Oct 22, 2018 · 27 comments
Assignees
Milestone

Comments

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Oct 22, 2018

A Distribution may be composed of multiple files which cannot be used independently, such as a shapefile and its attendant sidecars (index and database files). These might not be packaged into a single distributable artefact, such as a tar or zip archive (see #54 and #259) . So a dataset's distribution, while a single entity, is composed of multiple artefacts. We need to show patterns about how these will appear in a catalog.

@agbeltran
Copy link
Member

One particular case for the multiple files of a distribution is that of the checksum files. This is something we want to include in DATS (see datatagsuite/schema#11) and it would be good to have specific vocabulary to refer to them.

@agreiner
Copy link
Contributor

agreiner commented Nov 7, 2018

I have a use case where we have log files that we release as datasets. One "log" could contain hundreds of raw files, making it impractical to treat each as a separate dataset. They are often grouped in multiple layers of directories, such as a directory per cabinet, and within that a directory per chassis, and within that a directory per slot, etc. We've been releasing them as a single gzip, but they are getting so big that it's impractical to use a simple http download. (We offer Globus instead.)

@makxdekkers
Copy link
Contributor

This seem to be a completely new requirement. And, given the resource and time constraints, I think it will be hard to incorporate a solution in the current revision of DCAT. I also think that it might be difficult to define a single approach/vocabulary that covers all possible relationships -- even the three examples from @dr-shorthair, @agbeltran and @agreiner seem to require different solutions.
In the current revision, you could do something with dcat:accessURL which you could use to point to a landing page where you link to various files and explain their relationships. The use of dcat:downloadURL is only possible for a single file according to the definition: "The URL of the downloadable file in a given format".
Should we leave this for future work?

@dr-shorthair
Copy link
Contributor Author

dr-shorthair commented Nov 7, 2018

@agreiner I wonder if this is just another case of part-whole, as discussed in #411 (proposed new property dcat:componentDistribution) ?

@makxdekkers Yes - I tend to agree - our docket is quite full. However, perhaps we can recommend the solution agreed for #256 (i.e. use dct:relation ) and leave stronger semantics to another day.

@agreiner
Copy link
Contributor

agreiner commented Nov 7, 2018

dcat:componentDistribution seems like it would work for my use case. dct:relation, if restricted to files that have no relationship, would not. The tricky bit is to find a place to use componentDistribution. I don't fancy giving people separate RDF descriptions for each of the hundreds of files in a set. I would really like to have a property of a distribution that is a list of component parts, by relative path, sort of a manifest, or maybe even just a URI for a manifest. But we are creating our own vocabulary for handling log files, so we will invent our own for that if DCAT can't cover it.

@makxdekkers
Copy link
Contributor

@agreiner You can always use dcat:accessURL and link to a page or directory where you files live, with a README file to explain what the files are. Or even use dcat:landingPage on the Dataset description to link to such a page.

@agreiner
Copy link
Contributor

agreiner commented Nov 7, 2018

That would certainly work for the purposes of getting things into a catalog in a helpful way. What I've been hoping for is a way to reason about these things using RDF. I realize the main purpose of DCAT is just to let people catalog datasets, but it is so close to being useful for this use case, and there are now several related use cases that seem like they would benefit, it just seems a shame not to seize the opportunity.

@davebrowning
Copy link
Contributor

While its clear that we have quite a bit to do, I'd prefer that we don't absolutely rule a problem area out quite yet. We have a list of high priority requirements referenced here as discussed and agreed at the F2F, and a target date of mid-January for the rec-track work. We've also talked about generating more examples and/or a primer after that date. That plan gives us some flex on how we can talk to this issue - extend timescales of the rec-track work, provide examples and suggestions or leave it for a further iteration. (All subject to agreement within the WG and the broader W3M)

On this specific case, I would agree with @agreiner's comment - it would be good to 'seize the opportunity' - and that it would be great if @dr-shorthair pithy summary "a dataset's distribution, while a single entity, is composed of multiple artefacts" found its way into the recommendation even if just a comment. I don't think it has higher priority than the requirements that we're focussing on now. But I hope we have the luxury of deferring any final inclusion decisions to January.

@smrgeoinfo
Copy link
Contributor

For distributions that are aggregations, pointing to something like an OAI-ORE resource map would be a solution. See also DataOne discussion of packaging for an implementation

@makxdekkers
Copy link
Contributor

Based on the discussion at https://www.w3.org/2019/02/05-dxwgdcat-minutes, I will draft a proposal to suggest that a situation that there are several files that need to be considered together could be handled by using dcat:accessURL to link to a page where a list of the files and their relationships can be given.

@makxdekkers
Copy link
Contributor

I think this is taken care of in #730?

@davebrowning davebrowning added the due for closing Issue that is going to be closed if there are no objection within 6 days label Feb 26, 2019
@davebrowning
Copy link
Contributor

Addressed in #746 and closing as agreed at https://www.w3.org/2019/02/27-dxwgdcat-minutes#x08.

@davebrowning davebrowning removed the due for closing Issue that is going to be closed if there are no objection within 6 days label Feb 28, 2019
@agreiner
Copy link
Contributor

Hm, #746 doesn't really address my use case. Indicating the compression and packaging algorithms used to combine multiple files into one doesn't help with the case where we have multiple separate files and don't want to combine them.

@makxdekkers
Copy link
Contributor

@agreiner I agree that #746 doesn't address your use case. As far as I understand, a solution using dcat:accessURL or dcat:landingPage would give you a way to link to a set of files, although I admit that this places the solution outside of DCAT.
Would something like an rdf:Bag help here?

@davebrowning davebrowning reopened this Mar 1, 2019
@davebrowning
Copy link
Contributor

davebrowning commented Mar 1, 2019

Re-opening issue to track discussion between @agreiner and @makxdekkers - unless you want to open a separate new issue to discuss that use case.

@agreiner
Copy link
Contributor

agreiner commented Mar 1, 2019

I'm interested in finding a way that specific files within a dataset can be described in DCAT such that they can be reasoned about as individual entities. That would allow us to store metadata for each file from a set of log files and perform queries about temporal coverage or subject. Since we've now agreed that distributions can be informationally different, it seems that they come very close to filling the fill. We would just need to add attributes to a distribution that distinguish them from each other. Perhaps temporal coverage and subject.

@makxdekkers
Copy link
Contributor

@agreiner I agree with you that this is an interesting discussion and we addressed this in several issues over the last year, e.g. #52, #317, #411, #531.
@dr-shorthair came to a conclusion at #531 (comment), that "the consensus is that anything short of losslessly-convertible would be use-case specific". @davebrowning wrote at #411 (comment): "As it stands this appears to be an issue best addressed using profiles".
I don't know what we can do more on this, and I'd like to hand this back to the editors @agbeltran, @davebrowning, @dr-shorthair and @pwin.

@agbeltran
Copy link
Member

While we agreed that we shouldn't impose that distributions should be informationally equivalent, as in some cases we need more flexibility, we also discussed that the distinction between dataset and distributions is important.

To ensure that distributions are informationally equivalent would potentially require automated transformations between distributions. So, we still leave it to a judgment call of the data providers to determine what can be distributions of a dataset, but we don't want to encourage a dataset having multiple distributions that are unrelated.

Including at the distribution level properties that now belong to the dataset level (such as subject and temporal coverage) would blur the distinction between dataset and distribution. Thus, I think that if you need to specify the subject or temporal coverage (or similar properties) of a distribution, you should be considering if they are actually distributions of different datasets.

@agbeltran
Copy link
Member

@agreiner maybe you have in mind a specific example where the distinction between dataset and distribution is maintained and there is still a need to provide more details about the distribution - if so, please let us know, but otherwise I think that adding more properties to the distribution will not be helpful

@agreiner
Copy link
Contributor

agreiner commented Mar 2, 2019

My use case here is the one I described at the top of the thread, where we have log data for supercomputing systems. I realize this is a very specific domain, so I wanted to make suggestions of properties that would be more general. By "subject", I meant to identify the particular node for which the log file records data. It is "subject" in the sense of a thing being operated upon, not an abstract topic or domain. The nodes are all part of the same system, so they are indeed very closely related. A set of files for different nodes all record the same fields. You could think of the nodes as parts of a whole, so maybe "part" or "partOfWhole" would make more sense. Others may have better ideas as to what to name such a property. I just didn't want to be as domain specific as to say "node", of course. I suppose similar issues could be addressed in geographic datasets by allowing for identifying locales as parts. Back when distributions just varied by media type, they had a property to describe the media type. It just seems that once we are choosing to allow them to differ more, we should provide analogous properties to describe these new ways of differing. We seem to be okay with adding spatial and temporal resolution properties to fulfill that need; why not enable other common differences as well, if it can be done cleanly?

@makxdekkers
Copy link
Contributor

@agreiner As far as I can see, there are two ways to understand "not informationally equivalent". You can read it as "don't have to be the same data", as in your case where the same kind of data is recorded for different entities, such as 'nodes', sensors', 'stations' or what have you; and you can read it in the sense of "not exactly the same", for example as a result of profiling or lossy transformation.
From my understanding of the earlier discussion, the consensus seemed to be the second interpretation, because we felt that requiring exact equivalence was too strict -- but I don't think we agreed that distributions can contain different data.

@agbeltran
Copy link
Member

agbeltran commented Mar 5, 2019

Further to @makxdekkers' comment, indeed by saying that distributions don't need to be strictly fully informationally equivalent, we don't mean that they can hold totally different data (and I am not talking about the data type as in @agreiner's example of log files, but about the data itself). So, in your use case @agreiner, those would be different datasets and not different distributions of the same dataset.

The ED currently states:

In some cases all distributions of a dataset will be fully informationally equivalent, in the sense that lossless transformations between the representations are possible. An example would be different serializations of an RDF graph using RDF/XML, Turtle, N3, JSON-LD. However, in other cases the distributions might have different levels of fidelity to the underlying data. For example, a graphical representation alongside a CSV file. The question of whether different representations can be understood to be distributions of the same dataset is use-case specific, so the judgement is the responsibility of the provider.

In my opinion, that text clarifies the points we made a few times in this discussion. The example given about a CSV file and a graphical representation shows that in terms of the information they convey, different representations may not be identical in information, but we are not implying that they can be totally different.

@agreiner do you think we need to add further clarifications on this? If so, can you please suggest some text? Thanks

@agbeltran
Copy link
Member

Link to the relevant section in the ED: https://w3c.github.io/dxwg/dcat/#Class:Distribution

@agbeltran
Copy link
Member

I added a bit more detail in the note about distributions - see PR: #789

@agreiner
Copy link
Contributor

agreiner commented Mar 7, 2019

I think this is much more clear now. Thanks! It looks like our log files use case is going to be awkward for DCAT, but we are developing our own extension that should work.

@davebrowning davebrowning modified the milestones: DCAT Backlog, DCAT CR Mar 14, 2019
@davebrowning davebrowning added the due for closing Issue that is going to be closed if there are no objection within 6 days label Apr 2, 2019
@davebrowning
Copy link
Contributor

This now looks ready to close - the ED is clear enough about what it means. There might be additional requirement around the log files use case, but that would be future work.

@davebrowning
Copy link
Contributor

@davebrowning davebrowning removed the due for closing Issue that is going to be closed if there are no objection within 6 days label Apr 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants