Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

superseded_by fields with null values causing multiple versions of same products to appear in API searches #112

Closed
jordanpadams opened this issue Mar 28, 2024 · 9 comments
Assignees
Labels
B15.0 bug Something isn't working s.high High severity sprint-backlog

Comments

@jordanpadams
Copy link
Member

Checked for duplicates

Yes - I've already checked

πŸ› Describe the bug

https://pds.nasa.gov/api/search/1/products?q=(lid%20eq%20%22urn:nasa:pds:lro_lola_edr:data_raw%22) 2 records appear.

πŸ•΅οΈ Expected behavior

I expected 1 product to be returned

πŸ“œ To Reproduce

https://pds.nasa.gov/api/search/1/products?q=(lid%20eq%20%22urn:nasa:pds:lro_lola_edr:data_raw%22)

"summary": {
  "q": "(lid eq \"urn:nasa:pds:lro_lola_edr:data_raw\")",
  "hits": 2,
  "took": 90,
  "search_after": [],
  "limit": 100,
  "sort": [],
  "properties": [
  ...
  "lidvid": [
    "urn:nasa:pds:lro_lola_edr:data_raw::4.0"
  ],
  ...
  "ops:Provenance.ops:superseded_by": [
    "null"
  ],
  ..
  "lidvid": [
    "urn:nasa:pds:lro_lola_edr:data_raw::5.0"
  ],
  ...

πŸ–₯ Environment Info

Chrome

πŸ“š Version of Software Used

Latest

🩺 Test Data / Additional context

No response

πŸ¦„ Related requirements

No response

βš™οΈ Engineering Details

No response

@jordanpadams jordanpadams added bug Something isn't working needs:triage labels Mar 28, 2024
@jordanpadams jordanpadams self-assigned this Mar 28, 2024
@jordanpadams jordanpadams added s.high High severity s.medium Medium level severity B15.0 and removed needs:triage s.high High severity s.medium Medium level severity labels Mar 28, 2024
@alexdunnjpl
Copy link
Contributor

@jordanpadams I'll investigate shortly, but is it plausible that the two documents in question exist on different nodes and that's why they lack correct provenance history?

@jordanpadams
Copy link
Member Author

@alexdunnjpl hmmm. that would be interesting. they both say GEO, but I wonder if we loaded that data as a test case into the EN registry?

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Apr 18, 2024

I'll dig into that now. If it ends up being the case that they're in two nodes,

  • is this an expected case which the API must handle? Is this actually a failure of the API, or a limitation of the definition of provenance metadata?
  • is the expected behaviour in this case that the higher VID should prevail?

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Apr 18, 2024

@jordanpadams if I'm interpreting the swagger docs correctly, I don't think the request/expectation in the original ticket is even valid.

latest/all subroutes are only defined for /products/{identifier}/{all | latest}. In that case, it's reasonable to expect that no latest/all specification yields only the latest product.

But why would there be any expectation that /products/?q=somequery would only return the latest version of the LIDs of the full set of products resulting from that query?

  1. Intuitively, the result of a query should be that you get what you ask for
  2. Even if 1) isn't persuasive, there's no way to specify /products/all?q=somequery because all would be parsed as an identifier. And it seems implausible that there should be no way to use a custom query to get all the products you want.

EDIT: Having said all that, there are actually four versions in the registry - 1.0, 2.0, 4.0, 5.0, so it looks as though the API may indeed be filtering to superseded_by=null in that query

Looking for the v5.0 doc, it's in geo just like the others, but it doesn't have any sweepers metadata set.

I'm running a local instance of sweepers on it - my suspicion is that GEO hasn't been running sweepers, based on what I'm seeing.

@jordanpadams
Copy link
Member Author

But why would there be any expectation that /products/?q=somequery would only return the latest version of the LIDs of the full set of products resulting from that query?

It is a requirement of the system that only the latest versions of products be returned, by default, unless otherwise stated.

NASA-PDS/registry-api#428
NASA-PDS/registry-api#426

Moving the rest of the discussion to NASA-PDS/registry-api#428 since I think it is more deservedly had there.

@alexdunnjpl
Copy link
Contributor

alexdunnjpl commented Apr 18, 2024

Task definition is v5 ARNarn:aws:ecs:us-west-2:445837347542:task-definition/pds-geo-prod-registry-sweepers-task:5 despite latest version of that task being v9 arn:aws:ecs:us-west-2:445837347542:task-definition/pds-geo-prod-registry-sweepers-task:9

This may explain #123 as well, if it is also lagging behind in task-definition versions and ending up using an out-of-date image as a result.

@sjoshi-jpl is there any reason why the schedulers shouldn't be targeting the latest version of the relevant task definition?

@alexdunnjpl
Copy link
Contributor

@sjoshi-jpl re-pinging for the question above. "This doesn't matter because we're moving to MCP" is an acceptable answer, too.

@tloubrieu-jpl
Copy link
Member

If we use the latest version of the reistry-sweeper this should not happend again.

superseeded_by: null should be only for the latest product.

@alexdunnjpl
Copy link
Contributor

N.B. the root cause here is that sweepers has not run successfully since the latest product was ingested (probably due to the timeout issue).

It's not a bug, but may resurface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B15.0 bug Something isn't working s.high High severity sprint-backlog
Projects
Status: 🏁 Done
Status: 🏁 Done
Development

No branches or pull requests

3 participants