Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packaging and compression of files in a distribution #746

Merged
merged 15 commits into from
Feb 20, 2019

Conversation

makxdekkers
Copy link
Contributor

I added text for the packaging/compression of files in the distribution. I simplified the proposal by only adding one property: dcat:wrapFormat.

I could have created four properties dcat:compressionFormat, dcat:compressionMediaType, dcat:packagingFormat and dcat:packagingMediaType, but I thought that would make the solution much too complex. I would think that it would be fairly trivial for the system to decompress or unpack using a single property, and the duplication between mediaType and format seemed unnecessary.

@agbeltran
Copy link
Member

@smrgeoinfo
Copy link
Contributor

Why is the range dct:MediaTypeOrExtent instead of dct:MediaType?

@jakubklimek
Copy link
Contributor

I still think that 2 or 4 properties would be cleaner than just the one.

  1. The separation between mediaType and format is due to the incompleteness of IANA Media Types list (e.g. does not contain Media Type for .tar files) and, therefore, usage of, e.g. File Type EU MDR NAL in DCAT-AP in addition to it. Since we have both properties describing the main content of the distribution, why should the packaging and compression be described by only one of those?
  2. Could we go back to How to express distributions provided as compressed files #259 and try to represent the various use cases (compressed file, packaged files, compressed and packaged files) using just the wrapFormat property and see how the files can be automatically processed based on that description? I think this will only work for simply compressed file, e.g. .csv.gz. We will be unable to determine that the file is actually an archive of multiple files, e.g. .zip containing multiple .csv files, and how to unpack it (e.g. compared to .tar.gz containing multiple .csv files). That is why I would rather have a robust solution than a less complex one. I would say that the three use cases are pretty common and it would be nice to be able to properly describe them.

@makxdekkers
Copy link
Contributor Author

I am happy to consider the more complex solution with multiple properties, if that is what the group wants. In response to @jakubklimek I would say that 'cleaner' is not the main objective here -- I would say that a more complex solution should be chosen if it is essential and a simpler solution does not work.
Furthermore, I think that even the approach with four properties can only distinguish simple cases; you can only indicate the outermost and the innermost formats. For example, there is no way to know whether a RAR archive contains multiple ZIP files that contain CSV and XLS files.

@makxdekkers
Copy link
Contributor Author

@smrgeoinfo I made the range dct:MediaTypeOrExtent instead of dct:MediaType to align with the range of dcat:mediaType and dct:format, but you are right, it could easily be dct:MediaType.

@smrgeoinfo
Copy link
Contributor

I was assuming that for a packaged distribution, one could use dct:conformsTo to specify the packaging convention (e.g. bagit, OASIS OpenDocument Package Format...). In those schemes there is a manifest and metadata document in the package that details the internal structure and file formats.

@jakubklimek
Copy link
Contributor

@smrgeoinfo That may apply to distributions complying with the package formats you mention. However, we could have a more loose package. If we had for instance a tar or zip file with 100 CSV files, one per day, all with the same schema, this schema (CSV on the Web JSON-LD descriptor) would go to dct:conformsTo and we need a separate property (or properties) to describe the package/compression.

@jakubklimek
Copy link
Contributor

@makxdekkers You are right about the simple cases. However, I would say that double compression and similar techniques are often enough just bad practice. I often see compressed single files and packaged (and compressed) sets of files with the same schema where it makes sense (splitting larger files into smaller ones or just reducing size of file on disk). However, when I come across a zip file containing many other zip files, it is often because the publisher does not know what they are doing. The question seems to be where do we draw the line.

@makxdekkers
Copy link
Contributor Author

Can I ask the members of the WG to vote for one of the two alternatives:

  1. use a single property, dcat:wrapFormat, for both compression and packaging, as in the pull request
  2. use two properties, e.g. dcat:compressFormat and dcat:packFormat, to distinguish between the two

Thanks.

@larsgsvensson
Copy link
Contributor

@makxdekkers How does the vote take place? My suggestion would be one comment for each of the proposals (single property vs two properties) and then use thumbs-up for the vote.

@makxdekkers
Copy link
Contributor Author

Proposal 1: use a single property, dcat:wrapFormat, for both compression and packaging, as in the pull request
Please give thumbs-up if you agree.

@makxdekkers
Copy link
Contributor Author

Proposal 2: use two properties, e.g. dcat:compressFormat and dcat:packFormat, to distinguish between the two
Please give thumbs-up if you agree.

Copy link
Contributor

@dr-shorthair dr-shorthair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. These new properties also need to be expressed in the RDF file (i.e. dcat.ttl)
  2. Also add an entry in the change-log (Annex E)

@agbeltran
Copy link
Member

Linking here the related issues: #54, #259, #482

We should also add examples to cover the different use cases.

dcat/index.html Outdated Show resolved Hide resolved
dcat/index.html Outdated Show resolved Hide resolved
@agbeltran
Copy link
Member

@makxdekkers I merged the other branch into this PR

@dr-shorthair if you could have a final check and merge, we resolved this on the call today: https://www.w3.org/2019/02/20-dxwgdcat-minutes.html#x13

@agbeltran
Copy link
Member

the remaining work on this is to add the examples

@dr-shorthair dr-shorthair merged commit e4ba77f into gh-pages Feb 20, 2019
@dr-shorthair dr-shorthair deleted the makxdekkers-patch-1 branch February 20, 2019 23:31
@dr-shorthair
Copy link
Contributor

Merging for now, noting that instance examples also needed.
Working on the assumption that the discussion thread above is sufficient endorsement by the team.

@dr-shorthair
Copy link
Contributor

Hmmm. This merge was probably premature. I don't think we have a decision recorded to support it. My bad. And not competent to unwind ...

@makxdekkers makxdekkers deleted the makxdekkers-patch-1 branch February 21, 2019 07:58
@agbeltran
Copy link
Member

We did decide about this on the call yesterday: https://www.w3.org/2019/02/20-dxwgdcat-minutes.html#x13

@dr-shorthair
Copy link
Contributor

dr-shorthair commented Feb 22, 2019

Phew. It was resolved after I left the meeting and before I read the minutes :-) No reverts needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants