-
Couldn't load subscription status.
- Fork 1
Updates to commons library to support compressed and other alternate file paths for data #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Most of the internal work is done in registry-common. Now, when a data file is missing, an attempt is made to find compresed versions based on regular expressions and possible new extension. If found, the file is recorded with normal and extended compression information.
|
Can you check this to make sure this is the logic that you really are trying to describe? I need to have harvest update the pattern list next to complete it; meaning, there will be tweaks but the basic changes are in and would be function if |
src/main/java/gov/nasa/pds/registry/common/meta/FileMetadataExtractor.java
Outdated
Show resolved
Hide resolved
|
@al-niessner let's assume all versions of the products are available. if any of the files we expect are unavailable, let's put UNK or null (or maybe -1 for file size) |
|
The availability will tell the user which one is available. the file size/checksum, if available, will tell you that information. if it is not available, and all we have is a filename, then so be it. we will populate that information later or maybe never. |
Kept the original metadata then added three new items. Two of them are like the rest where it is an array of strting that all index with the FileName. They are the compression algorithm (default is none) and if the reference is available. The third is now a list of a list of objects that represent the alternate locations. When they are added, all of the same metadata other than alternate locations is expected. Since harvest is ingesting what it fines, it populates the original allowing for all of the others fields to be populated when the alternate location is known.
|
Here is the form for the dataInfo to support all of the upcoming additions. First, I know it does not compile right now. I am using those to flush out the implementation. However the meat of the changes for data info are here. Second, I kept all of the old metadata as is. Hence, nothing needs to change that works already. Third, added 3 new metadata for future use:
I am going to add the bit that generates the JSON and uploads it in my test environment. Should have it all done by COB today unless the heat gets to me. It should solve all of the compressed, uncompresesd, different protocol, etc issues. You just add them all to altlocs. No more s3_file_ref etc. |
|
Taken from some test data with enough information to show my points and the lidvid for context: |
|
Now, lets say rest_inst_overview.docx moved to s3 glacier while the rest_inst_overview.pdf got gzipped and a copy it archived in glacier. |
|
All I care about is the structure. If you want to change names that is great. Take a look at the examples on this PR. |
|
@al-niessner from a user perspective, is there a particular benefit to having that information in a separate alternative locations area of the response, versus putting alongside the other metadata we have and let them make the choice of what they want? This way they wouldn't need to know or care that there are alternative files, they can just make the decision based upon what is the available, the protocol, and whether or not it is compressed. |
|
Yes. If you are changing the design of |
|
See above comment. As a more practical example. A user wants to search for all of the alternative locations for In the current altlocs, the search for With your example? Use regular expression? Make the registry-api super complicated and say that they are looking for |
I definitely understand your viewpoint, but I think what we were considering is the In a future iteration of the API responses (hopefully coming in the next few months), we will probably pull out the entire Do we think adding a flag like |
lol - no. I do appreciate the various attempts at avoiding the direct and simplest solution. I will change the code again and just pile more junk at the end of array to be figured out later when the information needed to so has been lost. I have protested enough and have made no impression - Casandra effect. |
|
All file relation information has been erased. It is now a list of random files that may or may not be related to each other or the label. You should like it now. |
|
@al-niessner I appreciate your willingness to change based on my dictatorship, but I was actually coming around to the other design 😄 . Was just in the process of iterating with GPT on other design considerations we should be taking into account. That being said, let's not change it again for now. We have an API WG meeting tomorrow where we can present what you have now, and what you implemented before and we can see if anyone has a preference. Can you provide an output of a response that we can include in our discussion tomorrow? |
|
Sorry, output of a response? |
|
Not a dictatorship. I have been a Casandra most of my life and have not discovered a way out of the curse. |
|
@al-niessner something like this: |
|
This is what it looks like now: and you comment is what it will look like in the future. |
|
@al-niessner thanks! @tloubrieu-jpl see above for the two alternatives for handling alternate file paths in terms of the response. we may need to discuss at the breakout what else we might need for the config updates per https://github.com/NASA-PDS/harvest/pull/238/files#diff-524d40d335139db5bb01517fcd5be50deaeb94dd5205235fd646581acbc98c90 |
|
|
Okay, it is done. Merge when ready. @jordanpadams wanted it to be in draft for a bit longer, but all requests are implemented. Just accept it and when you want to re-org with structure we will. |



🗒️ Summary
If data files are not found, look for them in compressed form given some regular expressions and metadata information.
⚙️ Test Data and/or Report
NA
♻️ Related Issues
Basis (part 1 of 2) for NASA-PDS/registry#86