-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weโll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deep archive working with registry produces unexpected results with primary vs secondary products #164
Comments
@nutjob4life added some details to the ticket |
@jordanpadams thanks, much appreciated! |
Need a work around since the API cannot be ready soon for that. |
@nutjob4life I updated the ticket with the proposed / hacked solution |
@nutjob4life note: this is still blocked by NASA-PDS/registry#185 |
@jordanpadams copy that; standing by |
The "develop" branch of the registry loads data from this repository There is one secondary collection Thanks to @jordanpadams for pointing this out โฆ and for providing this file over Slack |
Note to self: to load the file in this comment: Start the registry$ git clone https://github.com/NASA-PDS/registry.git
$ cd registry/docker/certs
$ ./generate-certs.sh
$ cd..
$ docker compose --profile=int-registry-service-loader up Let that run for a while as it does its thing. Eventually you'll see
meaning things have more or less gone idle. Leave this running in a terminal session. Load the fileIn a new terminal session, unzip the $ cd /tmp
$ unzip urn-nasa-pds-insight_rad.zip Then create <?xml version='1.0' encoding='UTF-8'?>
<harvest nodeName='PDS_ENG'>
<directories>
<path>/mnt/urn-nasa-pds-insight_rad</path>
</directories>
<registry url='https://elasticsearch:9200' index='registry' auth='/etc/es-auth.cfg'/>
<fileInfo>
<fileRef replacePrefix='/mnt/urn-nasa-pds-insight_rad' with='http://localhost:81/archive'/>
</fileInfo>
<autogenFields/>
</harvest> Finally, back in $ docker compose --profile int-registry-batch-loader run \
--rm --entrypoint harvest \
--volume /tmp/harvest-config.xml:/mnt/harvest-config.xml \
--volume /tmp/urn-nasa-pds-insight_rad:/mnt/urn-nasa-pds-insight_rad \
registry-loader-test-init \
-c /mnt/harvest-config.xml --overwrite And eventually you'll see:
Query the APIRunning
should then give you a valid response but gives 404 for some reason. So forget all of the above and just use the test data without loading |
@jordanpadams I think I need a little help in reproducing this. I've started up a local Registry API loaded with test data and queried it with
And sure enough I see {
"id" : "urn:nasa:pds:insight_rad::2.1",
โฆ
"pds:Bundle_Member_Entry.pds:lid_reference" : [
"urn:nasa:pds:insight_rad:data_raw",
"urn:nasa:pds:insight_rad:data_calibrated",
"urn:nasa:pds:insight_rad:data_derived",
"urn:nasa:pds:insight_documents:document_hp3rad"
],
"pds:Bundle_Member_Entry.pds:member_status" : [
"Primary",
"Primary",
"Primary",
"Secondary"
],
โฆ
} I run
Checking the primary references: $ egrep -c 'data_raw|data_calibrated|data_derived' *.tab
insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab:48
insight_rad_v2.1_20240719_sip_v1.0.tab:48
insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab:48 But checking the secondary reference: $ egrep -c document_hp3rad *.tab
insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab:0
insight_rad_v2.1_20240719_sip_v1.0.tab:0
insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab:0 So โฆย working as intended? Or maybe I'm just not "getting" it? |
@nutjob4life so this may be working for bundles then, but it does not for the underlying collections. If you grep for "test" in the .tab, they should show up. If not, this may be because deep-archive skipped it because the files don't actually exist on the file system. You may need to create those test files to make that work. Sorry. Not familiar enough with with how the registry and API work. |
Okay, thanks @jordanpadams. Let me unpack what you said:
They don't: $ egrep -c test *.tab
insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab:0
insight_rad_v2.1_20240719_sip_v1.0.tab:0
insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab:0
We're talking pds-deep-registry-archive, not pds-deep-archive; and pds-deep-registry-archive doesn't look at the filesystem; it uses the Registry API. Isn't that the crux of this ticket? That the pds-deep-registry-archive produces different results from the filesystem version because of the API does not convey "secondary"-ness? |
FYI @jordanpadams, thanks for the pseudocode. I've implemented in as follows: bundlelid = bundlelidvid.split(".")[0] # @jordanpadams, here
for collection in _getproducts(url, bundlelidvid, allcollections):
if bundlelid in collection['properties']['lid']: # @jordanpadams, and here
_addfiles(collection, bac)
for product in _getproducts(url, collection["id"]):
if collection['properties']['lid'] in product['properties']['lid']: # @jordanpadams, and finally here
_addfiles(product, bac) It results in trimmed down $ wc -l *.tab
2 insight_rad_v2.1_20240719_checksum_manifest_v1.0.tab
2 insight_rad_v2.1_20240719_sip_v1.0.tab
2 insight_rad_v2.1_20240719_transfer_manifest_v1.0.tab
6 total So I think I'm definitely missing the point! Should we split on I've got to put out a CrowdStrike fire so I'll check back on this ticket over the weekend ๐ |
Tried with Will try again when my thinking is clearer later on
Hah! Welcome to my world ๐ |
Update: multiple attempts to load custom data into a local registry have met with frustrating failure Will try again on Monday |
Monday update: okay, so I finally figured out why my specifically-crafted test data isn't getting loaded: |
@nutjob4life per one of your comments above, we should split on
|
Yep, tried it |
@nutjob4life here is an updated data set with those test foo/bar products actually included, and the LIDs updated to be valid. |
Oh okay, going from |
@jordanpadams okay, loaded the ZIP file from your comment into my local registry and generated a deep archive against it; here's what I get:
The registry never emitted the secondaries. I ran it again, turning on
And try it myself with
All the primaries are there, sure:
That makes sense; there are 7 primaries in the Takeaway: my local registry is smart enough to not say a single peep about secondaries or something else is going on that ticket. PS: Just to make sure I wasn't using some other registry or was using older data or was otherwise confused, I edited the
And the secondaries still don't appear:
Nor do they appear in the generated deep archive, as before:
My Docker composition is running |
@nutjob4life copy that. Ok, maybe let's pause on this ticket and jump back to the Wordpress CD work for now until we have the new working registry and API up and running to test against. |
@jordanpadams sure thing. Say, is there a chance we can find out how in the other ticket they're invoking |
@nutjob4life I could ask, but, honestly, I doubt they remember at this point. I also think this person has moved positions. For now, I would just say let's pause and see how the software operates with the new registry and API up and running. If we can't reproduce, and the issue happens again, then we can go from their with fixing however they are running the software. |
@jordanpadams okie doke โฆย letting go for now |
Tested this with latest version of deep archive and it does not appear to be grabbing these secondary products anymore. Closing as invalid |
Whew! Thanks @jordanpadams! ๐ |
Checked for duplicates
Yes - I've already checked
๐ Describe the bug
For context, please see this ticket and the discussion that followed.
๐ต๏ธ Expected behavior
Honestly, not sure what to expect and I'm leaving on vacation today but hope @jshughes can provide some details in the meantime.
๐ To Reproduce
See this ticket.
๐ฅ Environment Info
๐ Version of Software Used
No response
๐ฉบ Test Data / Additional context
No response
๐ฆ Related requirements
๐ฆ #50
โ๏ธ Engineering Details
In PDS4, collections can be either primary or secondary members of a bundle. A primary member essentially means as far as the archive is concerned, this is where the collection "resides" in the archive forever. A secondary member can essentially be thought of a symlink to a collection that does not technically belong to that bundle. More like a reference for informational purposes for the data user.
Since the API will not be updated in the near term to support this, let's hack this by looking at the LID of the products returned.
Here is some pseudocode:
๐ Integration & Test
No response
The text was updated successfully, but these errors were encountered: