Skip to content

Conversation

@al-niessner
Copy link
Contributor

@al-niessner al-niessner commented May 7, 2025

🗒️ Summary

If data files are not found, look for them in compressed form given some regular expressions and metadata information.

⚙️ Test Data and/or Report

NA

♻️ Related Issues

Basis (part 1 of 2) for NASA-PDS/registry#86

Most of the internal work is done in registry-common. Now, when a data file is missing, an attempt is made to find compresed versions based on regular expressions and possible new extension. If found, the file is recorded with normal and extended compression information.
@al-niessner al-niessner self-assigned this May 7, 2025
@al-niessner al-niessner marked this pull request as draft May 7, 2025 21:47
@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Can you check this to make sure this is the logic that you really are trying to describe? I need to have harvest update the pattern list next to complete it; meaning, there will be tweaks but the basic changes are in and would be function if compressed could ever be a non-zero list.

@al-niessner al-niessner marked this pull request as ready for review May 16, 2025 17:37
@al-niessner al-niessner marked this pull request as draft May 16, 2025 17:37
@jordanpadams
Copy link
Member

@al-niessner let's assume all versions of the products are available. if any of the files we expect are unavailable, let's put UNK or null (or maybe -1 for file size)

@jordanpadams
Copy link
Member

The availability will tell the user which one is available. the file size/checksum, if available, will tell you that information. if it is not available, and all we have is a filename, then so be it. we will populate that information later or maybe never.

Kept the original metadata then added three new items. Two of them are like the rest where it is an array of strting that all index with the FileName. They are the compression algorithm (default is none) and if the reference is available.

The third is now a list of a list of objects that represent the alternate locations. When they are added, all of the same metadata other than alternate locations is expected.

Since harvest is ingesting what it fines, it populates the original allowing for all of the others fields to be populated when the alternate location is known.
@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Here is the form for the dataInfo to support all of the upcoming additions.

First, I know it does not compile right now. I am using those to flush out the implementation. However the meat of the changes for data info are here.

Second, I kept all of the old metadata as is. Hence, nothing needs to change that works already.

Third, added 3 new metadata for future use:

  1. compression algorithm - default is none and missing field should be considered none
  2. availability of reference - true/false that says if this and only this file_ref is available for public download
  3. altlocs - do not like this name but y'all seem concerned about name length so wanted to avoid alternate_locations which seems more right to me. Anyway, this is an array objects where each object is the same metadata as the basic DataFileInfo minus altlocs.

I am going to add the bit that generates the JSON and uploads it in my test environment. Should have it all done by COB today unless the heat gets to me.

It should solve all of the compressed, uncompresesd, different protocol, etc issues. You just add them all to altlocs. No more s3_file_ref etc.

@al-niessner
Copy link
Contributor Author

Taken from some test data with enough information to show my points and the lidvid for context:

          "lidvid" : "urn:nasa:pds:nh_documents:rex:rex_inst_overview::1.0",

          "ops:Data_File_Info/ops:compression_algorithm" : [
            "none",
            "none"
          ],
          "ops:Data_File_Info/ops:creation_date_time" : [
            "2024-11-07T15:20:45Z",
            "2024-11-07T15:21:57Z"
          ],
          "ops:Data_File_Info/ops:file_name" : [
            "rex_inst_overview.docx",
            "rex_inst_overview.pdf"
          ],
          "ops:Data_File_Info/ops:file_ref" : [
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.docx",
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.pdf"          ],

          ],
          "ops:Data_File_Info/ops:file_size" : [
            "4610645",
            "499098"
          ],
          "ops:Data_File_Info/ops:md5_checksum" : [
            "5180f1482619b7a80e8e1a3b8ce57af1",
            "a161853d316351d6a4f0a873daf3353f"
          ],
          "ops:Data_File_Info/ops:mime_type" : [
            "application/x-tika-ooxml",
            "application/pdf"
          ],
          "ops:Data_File_Info/ops:ref_file_available" : [
            "true",
            "true"
          ],
          "ops:Data_File_Info/ops:altlocs" : [
            [ ],
            [ ]
          ],

@al-niessner
Copy link
Contributor Author

al-niessner commented May 21, 2025

Now, lets say rest_inst_overview.docx moved to s3 glacier while the rest_inst_overview.pdf got gzipped and a copy it archived in glacier.

          "lidvid" : "urn:nasa:pds:nh_documents:rex:rex_inst_overview::1.0",

          "ops:Data_File_Info/ops:compression_algorithm" : [
            "none",
            "none"
          ],
          "ops:Data_File_Info/ops:creation_date_time" : [
            "2024-11-07T15:20:45Z",
            "2024-11-07T15:21:57Z"
          ],
          "ops:Data_File_Info/ops:file_name" : [
            "rex_inst_overview.docx",
            "rex_inst_overview.pdf"
          ],
          "ops:Data_File_Info/ops:file_ref" : [
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.docx",
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.pdf"          ],

          ],
          "ops:Data_File_Info/ops:file_size" : [
            "4610645",
            "499098"
          ],
          "ops:Data_File_Info/ops:md5_checksum" : [
            "5180f1482619b7a80e8e1a3b8ce57af1",
            "a161853d316351d6a4f0a873daf3353f"
          ],
          "ops:Data_File_Info/ops:mime_type" : [
            "application/x-tika-ooxml",
            "application/pdf"
          ],
          "ops:Data_File_Info/ops:ref_file_available" : [
            "false",
            "false"
          ],


          "ops:Data_File_Info/ops:altlocs" : [
            [
              {
                "ops:compression_algorithm" : "none",
                "ops:creation_date_time" : "2025-11-07T15:21:57Z",
                "ops:file_ref" : 
                    "s3://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.docx",
                "ops:file_size" : "4610645",
                "ops:md5_checksum" : "5180f1482619b7a80e8e1a3b8ce57af1",
                "ops:mime_type" : "application/x-tika-ooxml",
                "ops:ref_file_available" : "true"
              }
            ],

            [ 
              {
                "ops:compression_algorithm" : "gzip",
                "ops:creation_date_time" : "2025-12-07T15:21:57Z",
                "ops:file_ref" : 
                  "s3://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.pdf.gzip",
                "ops:file_size" : "-1",
                "ops:md5_checksum" : "45abc",
                "ops:mime_type" : "UNK",
                "ops:ref_file_available" : "true"
              },
              {
                "ops:compression_algorithm" : "gzip",
                "ops:creation_date_time" : "2025-12-07T15:21:57Z",
                "ops:file_ref" : 
                    "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.pdf.gzip",
                "ops:Data_File_Info/ops:file_size" : "-1",
                "ops:Data_File_Info/ops:md5_checksum" : "45abc",
                "ops:Data_File_Info/ops:mime_type" : "UNK",
                "ops:Data_File_Info/ops:ref_file_available" : "true"
              }
            ]
          ],

@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

All I care about is the structure. If you want to change names that is great. Take a look at the examples on this PR.

@jordanpadams
Copy link
Member

jordanpadams commented May 21, 2025

@al-niessner from a user perspective, is there a particular benefit to having that information in a separate alternative locations area of the response, versus putting alongside the other metadata we have and let them make the choice of what they want? This way they wouldn't need to know or care that there are alternative files, they can just make the decision based upon what is the available, the protocol, and whether or not it is compressed.

          "lidvid" : "urn:nasa:pds:nh_documents:rex:rex_inst_overview::1.0",

          "ops:Data_File_Info/ops:compression_algorithm" : [
            "none",
            "none",
            "none",
            "gzip"
          ],
          "ops:Data_File_Info/ops:creation_date_time" : [
            "2024-11-07T15:20:45Z",
            "2024-11-07T15:21:57Z",
            "2024-11-07T15:21:57Z",
            "2024-11-07T15:21:57Z"
          ],
          "ops:Data_File_Info/ops:file_name" : [
            "rex_inst_overview.docx",
            "rex_inst_overview.pdf",
            "rex_inst_overview.docx",
            "rex_inst_overview.docx.gzip",
          ],
          "ops:Data_File_Info/ops:file_ref" : [
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.docx",
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.pdf",         
            "s3://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.docx",
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.docx.gzip"
          ],
          "ops:Data_File_Info/ops:file_size" : [
            "4610645",
            "499098",
            "4610645",
            "12345",
          ],
          "ops:Data_File_Info/ops:md5_checksum" : [
            "5180f1482619b7a80e8e1a3b8ce57af1",
            "a161853d316351d6a4f0a873daf3353f"
          ],
          "ops:Data_File_Info/ops:mime_type" : [
            "application/x-tika-ooxml",
            "application/pdf",
            "UNK",
            "UNK"
          ],
          "ops:Data_File_Info/ops:ref_file_available" : [
            "false",
            "false",
            "true",
            "true"
          ],

@al-niessner
Copy link
Contributor Author

Yes. ops:file_name is directly from the label. It is not just some random file on the file system. It allows one to say the number of ops:file_name match the number of data file names in the label. Using altlocs as a list of objects allows for simple and direct mapping of the file name in a label somebody is looking for to some list of possible alternatives - no guessing.

If you are changing the design of ops:Data_File_Info to no longer have any real relation to the label, then your solution is the shortest and brittlest. Brittle because nobody will ever find rest_inst_overview.dox.gzip in the label. Brittle from a lack of enforced name mappings - as if we can enforce anything - may leave you with file names that cannot be mapped back to the label file names.

@al-niessner
Copy link
Contributor Author

@jordanpadams

See above comment.

As a more practical example. A user wants to search for all of the alternative locations for <file_name> from the label that they have.

In the current altlocs, the search for ops:Data_File_Info/ops:file_name to match their expectation and look at altlocs. Period. No changes to registry-api. No magic functions to try and map names around. Simple what you ask for is what you get.

With your example? Use regular expression? Make the registry-api super complicated and say that they are looking for ops:Data_File_Info/ops:file_name to not listen to them and widen the search to find similar names? How would you differentiate the PDF from DOX from any other extension then? Is it really .dox.gzip or are half of them .dgz? Way more complication post harvest.

@jordanpadams
Copy link
Member

@al-niessner

If you are changing the design of ops:Data_File_Info to no longer have any real relation to the label, then your solution is the shortest and brittlest. Brittle because nobody will ever find rest_inst_overview.dox.gzip in the label. Brittle from a lack of enforced name mappings - as if we can enforce anything - may leave you with file names that cannot be mapped back to the label file names.

I definitely understand your viewpoint, but I think what we were considering is the ops: namespace is not really part of the label to begin with. It is meta-metadata (e.g. the file_ref is harvest config + label info, file_size is generated by harvest, etc.)

In a future iteration of the API responses (hopefully coming in the next few months), we will probably pull out the entire ops portion of the response to clearly delineate that information from the actual label information.

Do we think adding a flag like primary or archival or preferred (or several of these?) would help delineate between the files that are specified in the label vs. what is available vs. what is preferred for download?

@al-niessner
Copy link
Contributor Author

@jordanpadams

Do we think adding a flag like primary or archival or preferred (or several of these?) would help delineate between the files that are specified in the label vs. what is available vs. what is preferred for download?

lol - no. I do appreciate the various attempts at avoiding the direct and simplest solution. I will change the code again and just pile more junk at the end of array to be figured out later when the information needed to so has been lost. I have protested enough and have made no impression - Casandra effect.

@al-niessner
Copy link
Contributor Author

@jordanpadams

All file relation information has been erased. It is now a list of random files that may or may not be related to each other or the label. You should like it now.

@al-niessner al-niessner marked this pull request as ready for review May 27, 2025 15:33
@jordanpadams
Copy link
Member

jordanpadams commented May 27, 2025

@al-niessner I appreciate your willingness to change based on my dictatorship, but I was actually coming around to the other design 😄 . Was just in the process of iterating with GPT on other design considerations we should be taking into account.

That being said, let's not change it again for now. We have an API WG meeting tomorrow where we can present what you have now, and what you implemented before and we can see if anyone has a preference.

Can you provide an output of a response that we can include in our discussion tomorrow?

@al-niessner
Copy link
Contributor Author

@jordanpadams

Sorry, output of a response?

@al-niessner
Copy link
Contributor Author

@jordanpadams

Not a dictatorship. I have been a Casandra most of my life and have not discovered a way out of the curse.

@jordanpadams
Copy link
Member

@al-niessner something like this:

#126 (comment)

@al-niessner
Copy link
Contributor Author

@jordanpadams

This is what it looks like now:

          "lidvid" : "urn:nasa:pds:nh_documents:rex:rex_inst_overview::1.0",

          "ops:Data_File_Info/ops:compression_algorithm" : [
            "none",
            "none"
          ],
          "ops:Data_File_Info/ops:creation_date_time" : [
            "2024-11-07T15:20:45Z",
            "2024-11-07T15:21:57Z"
          ],
          "ops:Data_File_Info/ops:file_name" : [
            "rex_inst_overview.docx",
            "rex_inst_overview.pdf"
          ],
          "ops:Data_File_Info/ops:file_ref" : [
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.docx",
            "https://pdssbn.astro.umd.edu/holdings/pds4-nh_documents-v4.0/rex/documents/rex_inst_overview.pdf"          ],

          ],
          "ops:Data_File_Info/ops:file_size" : [
            "4610645",
            "499098"
          ],
          "ops:Data_File_Info/ops:md5_checksum" : [
            "5180f1482619b7a80e8e1a3b8ce57af1",
            "a161853d316351d6a4f0a873daf3353f"
          ],
          "ops:Data_File_Info/ops:mime_type" : [
            "application/x-tika-ooxml",
            "application/pdf"
          ],
          "ops:Data_File_Info/ops:ref_file_available" : [
            "true",
            "true"
          ]
          ],

and you comment is what it will look like in the future.

@jordanpadams
Copy link
Member

@al-niessner thanks!

@tloubrieu-jpl see above for the two alternatives for handling alternate file paths in terms of the response. we may need to discuss at the breakout what else we might need for the config updates per https://github.com/NASA-PDS/harvest/pull/238/files#diff-524d40d335139db5bb01517fcd5be50deaeb94dd5205235fd646581acbc98c90

@jordanpadams jordanpadams changed the title registry 86: common changes for compression Updates to commons library to support compressed and other alternate file paths for data. May 27, 2025
@jordanpadams jordanpadams changed the title Updates to commons library to support compressed and other alternate file paths for data. Updates to commons library to support compressed and other alternate file paths for data May 27, 2025
@jordanpadams jordanpadams marked this pull request as draft May 27, 2025 21:27
@sonarqubecloud
Copy link

@al-niessner
Copy link
Contributor Author

@jordanpadams @tloubrieu-jpl

Okay, it is done. Merge when ready. @jordanpadams wanted it to be in draft for a bit longer, but all requests are implemented. Just accept it and when you want to re-org with structure we will.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants