Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Droid upgrade #387

Open
pwinckles opened this issue Jul 23, 2023 · 12 comments · May be fixed by #388
Open

Droid upgrade #387

pwinckles opened this issue Jul 23, 2023 · 12 comments · May be fixed by #388

Comments

@pwinckles
Copy link
Contributor

@awoods

I started working on upgrading to the latest version of Droid. There have been some changes in Droid that require changing the FITS integration, which is on the hacky side of things because Droid does not expose an easy to use API that's ideally suited for what FITS wants to do.

I have a working, work in progress, branch that you can look at here: https://github.com/pwinckles/fits/tree/droid-6.7

In that branch, I change FITS to use the primary Droid entry point, SubmissionGateway, and use most of Droid's standard configuration. DroidWrapperFactory may look like its doing a ton of customization, but I sourced almost all of that from Droid's Spring xml. DroidWrapper is what does the analysis. With this setup, FITS would now output info on all of the archive types that Droid supports, and not just zips.

However, that branch does not currently replicate the existing FITS behavior entirely, and I want to know to what extent it needs to.

  1. It does not output originalSize and compressionMethod elements. This is because this data isn't really from Droid. Instead, it's being computed by FITS in ZipArchiveContentIdentifier.
  2. It will recursively descend into nested archives. The current FITS behavior only goes one level deep. This recursive behavior is not configurable on Droid (GUI takes much longer to analyse archives now that it expands sub archives recursively digital-preservation/droid#370). The test on multiple-file-types-and-folders.zip seems to suggest that FITS considers the behavior of only going one level deep to be the desirable behavior.

For the originalSize and compressionMethod, I assume that you don't want these values to disappear, which is fair. However, I would rather not hackup Droid again to get them back in, as they really have absolutely nothing to do with Droid. My proposal is to instead make a new "FITS zip tool" that runs on zip files and computes that information.

For the recursion, is there a hard requirement on it being restricted to a depth of 1? If so, I can try to figure out a way to make this happen, but it'll be a little tricky so I'm not going to unless it's required.

Thoughts?

@awoods
Copy link
Contributor

awoods commented Jul 25, 2023

Thanks, @pwinckles .

Your approach to '1.' sounds good: Make a new "FITS zip tool".

Regarding '2.', would you mind asking in Slack (c4l#fits-community) and on the list (https://groups.google.com/g/fits-users)? I will check locally as well.

@cvicary
Copy link

cvicary commented Jul 25, 2023

Hi @pwinckles,
You're right that we don't want originalSize and compressionMethod to disappear. They end up in the preservation record via containerMD. The "FITS zip tool" idea sounds good to me.

About the Droid recursion issue, we have no current use for file metadata below the first level. I'm concerned that the increased processing time for recursion would be an issue for some of our partners. I'm also wary of having a forked/hacked version of Droid in there. If there is a way to cleanly limit recursion to 1 level, or to make the depth configurable, I would vote for that. If not, it would be helpful to understand the performance implications of recursion on our typical use cases before deciding next steps.

@pwinckles
Copy link
Contributor Author

@awoods @cvicary:

Thanks for the input. I still need to investigate options for limiting Droid's recursion.

Note that with these changes, FITS would now be recursing into the following archive formats, which are all of the formats that Droid supports. I know that you'd expressed interest in having more formats supported in the past. Let me know if you are good with this list, or if you'd like any of them disabled.

ZIP
TAR
GZ
ARC
WARC
BZ
7Z
ISO
RAR
FAT

Additionally, which of those formats would you like supported with the new tool that will be computing originalSize and compressionMethod? This would require unique code per format, and, obviously, it doesn't really make sense for some formats, like tar.

@awoods
Copy link
Contributor

awoods commented Jul 26, 2023

We need to support: zip, gz, and 7z

@pwinckles
Copy link
Contributor Author

@awoods @cvicary

Okay, there's some updated WIP code in the branch that limits the recursion to a depth of 1, and also adds back originalSize and compressionMethod. I thought of a fairly easy way of doing this. All but one of the tests is passing now unmodified.

New questions:

  1. Can you provide me with example gz and 7z files to add to the test collection?
  2. I assume that compressionMethod, which can currently only have the values deflate or stored, only makes sense for zip files? Do your metadata folks have guidance for the other formats, or should I just emit for zip, which would not be a deviation from prior behavior because FITS has historically only supported zip here.

@pwinckles
Copy link
Contributor Author

@awoods @cvicary

The one test that isn't passing as is is for https://github.com/harvard-lts/fits/blob/main/testfiles/input/aliceDynamic_images_metadata_tableOfContents.epub

Droid doesn't consider that file to be an epub because it's missing a root mimetype file (spec), so it just treats it as a zip. I'm planning on accepting this because the file is invalid.

Which raises the question of whether or not there is interest in integrating epubcheck into FITS? If so, we could create a separate ticket for that.

@pwinckles pwinckles linked a pull request Jul 29, 2023 that will close this issue
@pwinckles
Copy link
Contributor Author

pwinckles commented Jul 29, 2023

I created a draft PR that has what I consider to be final a version of the code, just pending your input on #387 (comment)

@pwinckles
Copy link
Contributor Author

@awoods @cvicary Droid 6.7.0 has been released now, and I updated the PR. The PR is still pending answers to these questions:

  1. Can you provide me with example gz and 7z files to add to the test collection?
  2. I assume that compressionMethod, which can currently only have the values deflate or stored, only makes sense for zip files? Do your metadata folks have guidance for the other formats, or should I just emit for zip, which would not be a deviation from prior behavior because FITS has historically only supported zip here.

I can add made up gz and 7z test files if you'd prefer, but I thought it might be nice if they were representative of what you were interested in processing.

@pwinckles
Copy link
Contributor Author

@awoods @cvicary: Pinging this again in the new year. I understand if this is not a priority, just checking in. I have a draft PR up with the droid upgrade (#388). It would be great to include sample gz and 7z test files that demonstrate it working on your archives. However, I could also create some example archives, if that is not feasible.

@awoods
Copy link
Contributor

awoods commented Jan 27, 2024

@pwinckles : Thanks for resurfacing this issue and associated PR. I will work with @cvicary to dig up example/test files for gz and 7z.

For question 2, let's stick with only emitting compressionMethod for zip.

Thanks again.

@pwinckles
Copy link
Contributor Author

pwinckles commented Feb 3, 2024

@awoods Thanks!

See the current output for those files here: https://gist.github.com/pwinckles/6b76bee3d37b9142081781ffcfa35c3e

Points of interest:

  1. For the 7z file the droid output is not included because its file id does not include a mime type. I can open an issue with them to correct this. (Opened: 7z file format is missing mime type digital-preservation/droid#1065)
  2. There is not an agreement on the 7z format name. We see various tools report it as "7-zip archive data", "7-zip archive", and "7Zip format". Do you all have a preference for how this is normalized?
  3. For the gz file, the mime is being reported as "application/gzip" and as "application/x-gzip". Would you like this normalized? If so, which do you prefer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants