Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected discrepancy in image counts between 2024-01-09 and 2024-02-01 pipelines #2549

Closed
paul-butcher opened this issue Feb 7, 2024 · 15 comments
Assignees

Comments

@paul-butcher
Copy link
Contributor

The production API is currently linked to the 01-09 pipeline

https://api.wellcomecollection.org/catalogue/v2/images

The staging API is currently linked to the 02-01 pipeline
https://api-stage.wellcomecollection.org/catalogue/v2/images

There are 140736 images in production, but only 138333 in staging. A shortfall of 2403.

@paul-butcher paul-butcher self-assigned this Feb 7, 2024
@paul-butcher
Copy link
Contributor Author

paul-butcher commented Feb 7, 2024

One such missing image is w245y2sf (M0001627), which is absent from stage.

The discrepancy in numbers seems to start with images-initial which has 140737 in 01-09 and 138334 in 02-01. (don't know why both these indices have exactly one more image in them than the API reports, but I don't think that is pertinent to this investigation).

The difference of behaviour is therefore either in or upstream of the merger.

@paul-butcher paul-butcher moved this from Backlog to In progress in Digital platform Feb 7, 2024
@paul-butcher
Copy link
Contributor Author

The matcher graph is identical between the two indices.

npbc5rxa-02-01

@paul-butcher
Copy link
Contributor Author

I have also checked all of those records: wregvytb, zr767v5e, fnaw3tjz, npbc5rxa in works-identified, and there are no significant differences between them.

This implies that the problem must be within the merger? The data going in to the merger is the same.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Feb 7, 2024

But the only merger changes between the two pipelines have been to tests and documentation.

@paul-butcher
Copy link
Contributor Author

The changes between the two pipelines in the whole repository seem to also be irrelevant. They are either test/docs, internal tools, or transformer related.

Since the content in works-indexed is the same for the records I've examined above, I wouldn't expect transformer changes to be the culprit.

There are some global changes that would touch the merger, but they can't be significant, as they simply make implicit things explicit to fix some warnings (#2533, #2532).

@paul-butcher
Copy link
Contributor Author

One possibility is that the new pipeline is actually correct, and it is the old pipeline that is wrong. Perhaps this is an evidence that we need to Support states for images in the pipeline, a la Works, and that these images have all been made suppressed at some point during January.

I cannot see M0001627 on the Suppression List

This idea was inspired by another of the missing Images, k5bp74r4.

Its corresponding Work describes a potentially controversial photograph. The terms of use state that

This item has been digitised but is not available online. It needs to be accessed onsite using specialist equipment

However, https://wellcomecollection.org/search/images?query=k5bp74r4 is a search for that very photograph, and displays, online, this image which is "not available".

However, I cannot see M0000158 on the suppression list either.

@paul-butcher
Copy link
Contributor Author

The relevant log entries for the two pipelines look the same:

2024-01-09:

13:06:56.736 [main-actor-system-akka.actor.default-dispatcher-10] INFO w.p.merger.services.PlatformMerger$ - Merged target(id=Work[calm-record-id/dae7eeb1-3114-4bb2-b247-d94b0d0d06c6]) with redirected[(id=Work[miro-image-number/M0001627]),(id=Work[sierra-system-number/b33067491]),(id=Work[mets/b33067491])] and remaining[]

2024-02-09:

10:13:27.567 [main-actor-system-akka.actor.default-dispatcher-7] INFO w.p.merger.services.PlatformMerger$ - Merged target(id=Work[calm-record-id/dae7eeb1-3114-4bb2-b247-d94b0d0d06c6]) with redirected[(id=Work[miro-image-number/M0001627]),(id=Work[sierra-system-number/b33067491]),(id=Work[mets/b33067491])] and remaining[]

The same calm record is the target, then M0001627 and the two b33067491s are redirected to it. There are no remaining records to become records on their own.

@paul-butcher
Copy link
Contributor Author

I believe I have found out what is happening, but it is a surprise that this has not happened to such a degree before.

The magic is in these three log entries from 2024-01-09:

Jan 9, 2024 @ 16:34:47.737 [main-actor-system-akka.actor.default-dispatcher-8] INFO w.p.transformer.TransformerWorker - TransformerWorker: from Version(M0001627,1) transformed work with id Work[miro-image-number/M0001627]

Jan 9, 2024 @ 16:39:02.717
16:39:02.717 [main-actor-system-akka.actor.default-dispatcher-8] INFO w.p.merger.services.PlatformMerger$ - Processing (id=Work[miro-image-number/M0001627])

Jan 9, 2024 @ 16:39:02.717
16:39:02.717 [main-actor-system-akka.actor.default-dispatcher-8] INFO w.p.merger.services.PlatformMerger$ - Created images [(id=Image[miro-image-number/M0001627])]

Which are followed quickly by:

Jan 9, 2024 @ 16:40:20.534
16:40:20.534 [main-actor-system-akka.actor.default-dispatcher-9] INFO w.p.merger.services.PlatformMerger$ - Attempting to merge target(id=Work[sierra-system-number/b33067491]) with sources[(id=Work[miro-image-number/M0001627])]

Jan 9, 2024 @ 16:40:20.536
16:40:20.536 [main-actor-system-akka.actor.default-dispatcher-9] INFO w.p.merger.services.PlatformMerger$ - Merged target(id=Work[sierra-system-number/b33067491]) with redirected[(id=Work[miro-image-number/M0001627])] and remaining[]

Whereas in 02-01, the first time M0001627 is encountered, it is already part of a match with b33067491.

09:21:47.614 [main-actor-system-akka.actor.default-dispatcher-11] INFO w.p.merger.services.PlatformMerger$ - Attempting to merge target(id=Work[sierra-system-number/b33067491]) with sources[(id=Work[miro-image-number/M0001627])]

16:40:20.536 [main-actor-system-akka.actor.default-dispatcher-9] INFO w.p.merger.services.PlatformMerger$ - Merged target(id=Work[sierra-system-number/b33067491]) with redirected[(id=Work[miro-image-number/M0001627])] and remaining[]

So. In 01-09, the Image is created as its own entity because the merger first encounters it on its own. Then later, the merger encounters all the other things it has merged with, but there is no mechanism by which that image on its own can be turned into a redirect.

In 02-01, the image is never created, because the first time it appears, it is as part of a merge.

@paul-butcher
Copy link
Contributor Author

So, the questions are:

When a merge subsumes an Image in this scenario, currently, no "standalone images" (imagesWithSources) are emitted. Should they be? Is ImagesRule incorrect?

Should Images have state so that they can become redirects?

Less important, but how did this not happen at this scale before? It is vaguely possible that changes to transformers have led to the METS Transformer being a bit slower and maybe the Sierra transformer is a bit faster, so now the records land on the merger in a different order, but that's clutching at straws a bit.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Feb 7, 2024

This is also an argument in favour of a DAG-based approach to the pipeline.

This discrepancy happens when the matcher/merger operates with an information deficit. In the earlier pipeline, the Miro record is processed (in principle, incorrectly) in full before the other records in its matcher graph are encountered. It is then processed (in principle, correctly) again.

I say "in principle" there, because there is a possibility that neither process is actually correct, but the correctness stated is from the PoV of the spirit of the existing code and entirety of the data.

@paul-butcher
Copy link
Contributor Author

I wonder if the reason this has not been spotted before is because we have become desensitised to diff_tool reports that simply declare that the "result count differs". When 01-09 was deployed, there was a difference reported for /catalogue/v2/images, and also for the one in October, but there is no record as to how much they differed, nor whether anyone actually looked.

@paul-butcher
Copy link
Contributor Author

Following an excellent discussion on Slack, it looks like the current behaviour as expressed in the merger application is correct. However, the bug is caused by the information deficit mentioned above.

There is currently no way for the Image record to be revoked upon it later being revealed as digmiro.

@paul-butcher
Copy link
Contributor Author

This issue is not sufficient to block the deployment of a new pipeline.

  • It is already present on the current production pipeline
  • The new pipeline is numerically better, as it has 2400 fewer records that shouldn't be there.

So I'll release that pipeline now.

@paul-butcher
Copy link
Contributor Author

paul-butcher commented Feb 9, 2024

It looks like this could be quite a significant amount of work to fix this definitively. Pipeline Rearchitecture stuff. A short-term fix could be to change the manual part of the reindex process to run everything-but-Miro when reindexing, then run miro.

@paul-butcher
Copy link
Contributor Author

I have run that process for a new pipeline 2023-02-09, and it resulted in 2500 fewer (unwanted) Images than 02-01,

@pollecuttn pollecuttn moved this from In progress to Next in Digital platform Feb 12, 2024
@pollecuttn pollecuttn moved this from Next to In progress in Digital platform Feb 15, 2024
@paul-butcher paul-butcher moved this from In progress to Blocked in Digital platform Feb 20, 2024
@paul-butcher paul-butcher moved this from Blocked to Done in Digital platform Feb 20, 2024
@pollecuttn pollecuttn moved this from Done to Archive in Digital platform Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant