-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distribution composed of more than one file, but not packaged #482
Comments
One particular case for the multiple files of a distribution is that of the checksum files. This is something we want to include in DATS (see datatagsuite/schema#11) and it would be good to have specific vocabulary to refer to them. |
I have a use case where we have log files that we release as datasets. One "log" could contain hundreds of raw files, making it impractical to treat each as a separate dataset. They are often grouped in multiple layers of directories, such as a directory per cabinet, and within that a directory per chassis, and within that a directory per slot, etc. We've been releasing them as a single gzip, but they are getting so big that it's impractical to use a simple http download. (We offer Globus instead.) |
This seem to be a completely new requirement. And, given the resource and time constraints, I think it will be hard to incorporate a solution in the current revision of DCAT. I also think that it might be difficult to define a single approach/vocabulary that covers all possible relationships -- even the three examples from @dr-shorthair, @agbeltran and @agreiner seem to require different solutions. |
@agreiner I wonder if this is just another case of part-whole, as discussed in #411 (proposed new property @makxdekkers Yes - I tend to agree - our docket is quite full. However, perhaps we can recommend the solution agreed for #256 (i.e. use |
dcat:componentDistribution seems like it would work for my use case. dct:relation, if restricted to files that have no relationship, would not. The tricky bit is to find a place to use componentDistribution. I don't fancy giving people separate RDF descriptions for each of the hundreds of files in a set. I would really like to have a property of a distribution that is a list of component parts, by relative path, sort of a manifest, or maybe even just a URI for a manifest. But we are creating our own vocabulary for handling log files, so we will invent our own for that if DCAT can't cover it. |
@agreiner You can always use |
That would certainly work for the purposes of getting things into a catalog in a helpful way. What I've been hoping for is a way to reason about these things using RDF. I realize the main purpose of DCAT is just to let people catalog datasets, but it is so close to being useful for this use case, and there are now several related use cases that seem like they would benefit, it just seems a shame not to seize the opportunity. |
While its clear that we have quite a bit to do, I'd prefer that we don't absolutely rule a problem area out quite yet. We have a list of high priority requirements referenced here as discussed and agreed at the F2F, and a target date of mid-January for the rec-track work. We've also talked about generating more examples and/or a primer after that date. That plan gives us some flex on how we can talk to this issue - extend timescales of the rec-track work, provide examples and suggestions or leave it for a further iteration. (All subject to agreement within the WG and the broader W3M) On this specific case, I would agree with @agreiner's comment - it would be good to 'seize the opportunity' - and that it would be great if @dr-shorthair pithy summary "a dataset's distribution, while a single entity, is composed of multiple artefacts" found its way into the recommendation even if just a comment. I don't think it has higher priority than the requirements that we're focussing on now. But I hope we have the luxury of deferring any final inclusion decisions to January. |
For distributions that are aggregations, pointing to something like an OAI-ORE resource map would be a solution. See also DataOne discussion of packaging for an implementation |
Based on the discussion at https://www.w3.org/2019/02/05-dxwgdcat-minutes, I will draft a proposal to suggest that a situation that there are several files that need to be considered together could be handled by using |
I think this is taken care of in #730? |
Addressed in #746 and closing as agreed at https://www.w3.org/2019/02/27-dxwgdcat-minutes#x08. |
Hm, #746 doesn't really address my use case. Indicating the compression and packaging algorithms used to combine multiple files into one doesn't help with the case where we have multiple separate files and don't want to combine them. |
Re-opening issue to track discussion between @agreiner and @makxdekkers - unless you want to open a separate new issue to discuss that use case. |
I'm interested in finding a way that specific files within a dataset can be described in DCAT such that they can be reasoned about as individual entities. That would allow us to store metadata for each file from a set of log files and perform queries about temporal coverage or subject. Since we've now agreed that distributions can be informationally different, it seems that they come very close to filling the fill. We would just need to add attributes to a distribution that distinguish them from each other. Perhaps temporal coverage and subject. |
@agreiner I agree with you that this is an interesting discussion and we addressed this in several issues over the last year, e.g. #52, #317, #411, #531. |
While we agreed that we shouldn't impose that distributions should be informationally equivalent, as in some cases we need more flexibility, we also discussed that the distinction between dataset and distributions is important. To ensure that distributions are informationally equivalent would potentially require automated transformations between distributions. So, we still leave it to a judgment call of the data providers to determine what can be distributions of a dataset, but we don't want to encourage a dataset having multiple distributions that are unrelated. Including at the distribution level properties that now belong to the dataset level (such as subject and temporal coverage) would blur the distinction between dataset and distribution. Thus, I think that if you need to specify the subject or temporal coverage (or similar properties) of a distribution, you should be considering if they are actually distributions of different datasets. |
@agreiner maybe you have in mind a specific example where the distinction between dataset and distribution is maintained and there is still a need to provide more details about the distribution - if so, please let us know, but otherwise I think that adding more properties to the distribution will not be helpful |
My use case here is the one I described at the top of the thread, where we have log data for supercomputing systems. I realize this is a very specific domain, so I wanted to make suggestions of properties that would be more general. By "subject", I meant to identify the particular node for which the log file records data. It is "subject" in the sense of a thing being operated upon, not an abstract topic or domain. The nodes are all part of the same system, so they are indeed very closely related. A set of files for different nodes all record the same fields. You could think of the nodes as parts of a whole, so maybe "part" or "partOfWhole" would make more sense. Others may have better ideas as to what to name such a property. I just didn't want to be as domain specific as to say "node", of course. I suppose similar issues could be addressed in geographic datasets by allowing for identifying locales as parts. Back when distributions just varied by media type, they had a property to describe the media type. It just seems that once we are choosing to allow them to differ more, we should provide analogous properties to describe these new ways of differing. We seem to be okay with adding spatial and temporal resolution properties to fulfill that need; why not enable other common differences as well, if it can be done cleanly? |
@agreiner As far as I can see, there are two ways to understand "not informationally equivalent". You can read it as "don't have to be the same data", as in your case where the same kind of data is recorded for different entities, such as 'nodes', sensors', 'stations' or what have you; and you can read it in the sense of "not exactly the same", for example as a result of profiling or lossy transformation. |
Further to @makxdekkers' comment, indeed by saying that distributions don't need to be strictly fully informationally equivalent, we don't mean that they can hold totally different data (and I am not talking about the data type as in @agreiner's example of log files, but about the data itself). So, in your use case @agreiner, those would be different datasets and not different distributions of the same dataset. The ED currently states:
In my opinion, that text clarifies the points we made a few times in this discussion. The example given about a CSV file and a graphical representation shows that in terms of the information they convey, different representations may not be identical in information, but we are not implying that they can be totally different. @agreiner do you think we need to add further clarifications on this? If so, can you please suggest some text? Thanks |
Link to the relevant section in the ED: https://w3c.github.io/dxwg/dcat/#Class:Distribution |
I added a bit more detail in the note about distributions - see PR: #789 |
I think this is much more clear now. Thanks! It looks like our log files use case is going to be awkward for DCAT, but we are developing our own extension that should work. |
This now looks ready to close - the ED is clear enough about what it means. There might be additional requirement around the log files use case, but that would be future work. |
As notified/discussed: https://www.w3.org/2017/dxwg/wiki/Meetings:Telecon2019.04.02 |
A Distribution may be composed of multiple files which cannot be used independently, such as a shapefile and its attendant sidecars (index and database files). These might not be packaged into a single distributable artefact, such as a tar or zip archive (see #54 and #259) . So a dataset's distribution, while a single entity, is composed of multiple artefacts. We need to show patterns about how these will appear in a catalog.
The text was updated successfully, but these errors were encountered: