-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected discrepancy in image counts between 2024-01-09 and 2024-02-01 pipelines #2549
Comments
One such missing image is w245y2sf (M0001627), which is absent from stage. The discrepancy in numbers seems to start with images-initial which has 140737 in 01-09 and 138334 in 02-01. (don't know why both these indices have exactly one more image in them than the API reports, but I don't think that is pertinent to this investigation). The difference of behaviour is therefore either in or upstream of the merger. |
I have also checked all of those records: wregvytb, zr767v5e, fnaw3tjz, npbc5rxa in works-identified, and there are no significant differences between them. This implies that the problem must be within the merger? The data going in to the merger is the same. |
But the only merger changes between the two pipelines have been to tests and documentation. |
The changes between the two pipelines in the whole repository seem to also be irrelevant. They are either test/docs, internal tools, or transformer related. Since the content in works-indexed is the same for the records I've examined above, I wouldn't expect transformer changes to be the culprit. There are some global changes that would touch the merger, but they can't be significant, as they simply make implicit things explicit to fix some warnings (#2533, #2532). |
One possibility is that the new pipeline is actually correct, and it is the old pipeline that is wrong. Perhaps this is an evidence that we need to Support states for images in the pipeline, a la Works, and that these images have all been made suppressed at some point during January. I cannot see M0001627 on the Suppression List This idea was inspired by another of the missing Images, k5bp74r4. Its corresponding Work describes a potentially controversial photograph. The terms of use state that
However, https://wellcomecollection.org/search/images?query=k5bp74r4 is a search for that very photograph, and displays, online, this image which is "not available". However, I cannot see M0000158 on the suppression list either. |
The relevant log entries for the two pipelines look the same: 2024-01-09:
2024-02-09:
The same calm record is the target, then M0001627 and the two b33067491s are redirected to it. There are no remaining records to become records on their own. |
I believe I have found out what is happening, but it is a surprise that this has not happened to such a degree before. The magic is in these three log entries from 2024-01-09:
Which are followed quickly by:
Whereas in 02-01, the first time M0001627 is encountered, it is already part of a match with b33067491.
So. In 01-09, the Image is created as its own entity because the merger first encounters it on its own. Then later, the merger encounters all the other things it has merged with, but there is no mechanism by which that image on its own can be turned into a redirect. In 02-01, the image is never created, because the first time it appears, it is as part of a merge. |
So, the questions are: When a merge subsumes an Image in this scenario, currently, no "standalone images" (imagesWithSources) are emitted. Should they be? Is ImagesRule incorrect? Should Images have state so that they can become redirects? Less important, but how did this not happen at this scale before? It is vaguely possible that changes to transformers have led to the METS Transformer being a bit slower and maybe the Sierra transformer is a bit faster, so now the records land on the merger in a different order, but that's clutching at straws a bit. |
This is also an argument in favour of a DAG-based approach to the pipeline. This discrepancy happens when the matcher/merger operates with an information deficit. In the earlier pipeline, the Miro record is processed (in principle, incorrectly) in full before the other records in its matcher graph are encountered. It is then processed (in principle, correctly) again. I say "in principle" there, because there is a possibility that neither process is actually correct, but the correctness stated is from the PoV of the spirit of the existing code and entirety of the data. |
I wonder if the reason this has not been spotted before is because we have become desensitised to diff_tool reports that simply declare that the "result count differs". When 01-09 was deployed, there was a difference reported for /catalogue/v2/images, and also for the one in October, but there is no record as to how much they differed, nor whether anyone actually looked. |
Following an excellent discussion on Slack, it looks like the current behaviour as expressed in the merger application is correct. However, the bug is caused by the information deficit mentioned above. There is currently no way for the Image record to be revoked upon it later being revealed as digmiro. |
This issue is not sufficient to block the deployment of a new pipeline.
So I'll release that pipeline now. |
It looks like this could be quite a significant amount of work to fix this definitively. Pipeline Rearchitecture stuff. A short-term fix could be to change the manual part of the reindex process to run everything-but-Miro when reindexing, then run miro. |
I have run that process for a new pipeline 2023-02-09, and it resulted in 2500 fewer (unwanted) Images than 02-01, |
The production API is currently linked to the 01-09 pipeline
https://api.wellcomecollection.org/catalogue/v2/images
The staging API is currently linked to the 02-01 pipeline
https://api-stage.wellcomecollection.org/catalogue/v2/images
There are 140736 images in production, but only 138333 in staging. A shortfall of 2403.
The text was updated successfully, but these errors were encountered: