-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Further clarification for distributions #789
Conversation
I am just wondering if we would want to make the note even more clear by explicitly saying that all distributions of a dataset should contain (be about?) the same data. Maybe even give an example like "For example, budget data for different years or observations from different sensors should be modelled as different datasets"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clarification is good. Makx's suggestion could be an additional note, or perhaps just a sentence following?
Yes, thanks, I agree and will add Makx's sentence as a further clarification in the same note. |
@makxdekkers , I'm not sure I agree about the recommendation you propose. IMO, also in the scenario you mention, the decision is up to the data provider, and it ends up to depend on the dataset granularity used in different communities, and on how the data are supposed to be used. |
@andrea-perego Yes, I see what you mean. It is basically the tension between interoperability versus flexibility. Not including this clarification means that people can argue that the specification does not explicitly recommend against having distributions with different data. In any case, even if the specification includes the clarification, people will still do what they want. The clarification just intends to help people who are looking for advice how to do it to ensure maximum interoperability. |
@makxdekkers , I totally agree in providing guidance, but in cases like this one I think we should provide alternatives. Specifically about time series, there's both the option of having different datasets or different distributions of the same dataset. And there are also cases where there's only one dataset with one distribution that is updated every year. Moreover, the problem I see when we recommend using different datasets is that a user, finding just one of them, have no clue that other datasets exist about earlier / later years, unless the metadata include a specific relationship for that purpose making this explicit. But this is not part of the current DCAT spec, and such a feature is not commonly supported in existing catalogue platforms. Note that I'm not recommending against this approach. Only, if we provide guidance, it should be clear which are the pros and cons. |
@andrea-perego As far as I understand, the discussion that we've had reached a consensus that distributions under a dataset should all be about the same data. We established that differences between distributions might be the result of lossy translations, different profiles or different representations (e.g. spreadsheet versus graphic visualisation). If that was the consensus, my proposal was to make that consensus explicit in the clarification. In fact, the "for example" was just to reinforce the sentences right before it that state that distributions should be about the same data. |
Hm, "the same data" isn't what you get with a different profile. It might be, but you might get completely disjoint sets with two different profiles. I'm pretty agnostic about how we ultimately define distributions, because both options have their own resulting trickle-down effects, some of which I like and some of which I don't. IMHO, one of the effects of saying that distributions must have the same data is that profiles can't define distributions. |
so maybe DCAT could recognize a 'series' as another kind of resource type? |
Thanks for pointing this out, @makxdekkers . Yes, I guess I'm not completely happy also with other points of the note. And, actually, I think there's another thing to be fixed, concerning As you say, since your revision is implementing the current consensus, it should indeed be merged. I'll open two separate issues for further discussion. |
@agreiner When I wrote "same data", I was not implying that the output from conversion or profiling is the same -- you are right, the result could look quite different -- but that the input to the conversion or profiling is the same. |
@agbeltran can you maybe have a look at where we are on this? If there are serious concerns about what I thought was consensus, and both @agreiner and @andrea-perego seem to disagree, maybe we should consider to either fall back to the silence of DCAT-2014, or see if we can, in the next few days, come up with text that gives advice for various approaches? |
To be clear, I'm not pushing back against saying that distributions can be informationally nonequivalent. I think the text now does a much better job than before of clarifying what we mean by that. I just meant to point out that using a different profile returns something more different than a distribution. I wonder if it would be helpful to think about profiles in terms of data services. In a way, they return subsets the way a service does. The service serves the dataset in its entirety, but individual queries return subsets. |
@makxdekkers said:
This is indeed an option. Re-thinking about this, and considering the possible different cases and approaches we can provide as examples, I wonder whether providing guidance on something that deals with data management practices is in the scope of DCAT. In DCAT-AP this has been done separately, with the work on the DCAT-AP Implementation Guidelines. So, it might be more in scope of a DCAT primer (although we may not be able to prepare it). |
it seems that the controversial phrase is the example that @makxdekkers proposed, i.e. "For example, budget data for different years or observations from different sensors should be modeled as different datasets" - or @andrea-perego are you against the whole clarification about distributions? |
Re-reading the note, I think my main concern is on the proposed clarification:
This looks to me possibly conflicting with the preceding and following sentences:
The clarification seems to say that "budget data for different years" and "observations from different sensors" are both to be considered as different data, which might be questionable (time series are supposed to follow the same data schema, data collection methodology, etc., whereas data from different sensors may have nothing in common). Moreover, the clarification seems to contradict the second sentence - the one saying that it's eventually up to the data provider to decide. Said that, I do think the content of the note would be more in scope of a primer. The current definition and usage note of distribution is, IMO, good enough to clarify what it should be used for:
BTW, I have also a concern about the last two paragraphs of the note - I reported this in a separate issue (#809) |
@smrgeoinfo you might be right about 'Data Series' - this is perhaps a common application (I know it has its own slot in ISO 19115, for example). Perhaps you could write a new UC for this and we can put it on the backlog for the next (soon) revision. But in general I agree with the original intention that different years-worth would typically correspond with different datasets. It's just that these datasets have a rather predictable relationship between them - i.e. they are part of a series in which only (e.g.) the temporal extent is different. We do have a general mechanism to deal with 'relationships between datasets' (i.e. qualified relations) but data-series is probably a special case that is worth giving special treatment. |
Where @makxdekkers had written
perhaps we could just tweak it to "For example, budget data for different years would typically be modeled as different datasets" (as a sensor-observations guy I think the discussion of different sensors is potentially much bigger and out of scope here.) |
@dr-shorthair-- I'll get a UC in the back log. Other common 'series' is for satellite remote sensed data-- same sensor, but different spatial and temporal extents. |
And @pwin (Scottish Government) has many! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks clearer
Remove paras refering to change of scope
I've attempted to address these matters in #832 |
I would change my review to 'approve' if #832 is accepted But I guess with a PR on the PR this will require two cycles of plenary approval to get through ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good addition, even better with @dr-shorthair 's additions (#832).
Does this mean we can drop the subsequent note about "intention of the phrase "informationally equivalent" needs to be clarified"? #411 is actually already closed and this PR improves on our description - by discussing different levels of fidelity, making it explicit that its down to the data provider to judge, as well as providing a counter example.
#411 note tidy up is now in #839, but it would be good to add @dr-shorthair's merge.... I can do this if every one (particularly @agbeltran) is okay with it? |
this PR now also includes the changes by @dr-shorthair |
Extended the clarification on not fully informationally equivalent distributions related to discussion in #482
Pre-view in the note after the Distribution definition: https://rawgit.com/w3c/dxwg/agb-issue-482/dcat/index.html#Class:Distribution