Decentralise duplicate detection #225

Previously we detected duplicate samples upon job submission. This was a very intricate process that covered two stages of detection: local duplicates and other Peekaboo instances in a cluster analysing the same sample concurrently. Apart from being hard to understand and maintain this was inefficient for analyses which didn't involve any expensive operations such as offloading a job to Cuckoo. This degraded into a downright throughput bottleneck for analyses of large numbers (> 10000) of nonidentical samples which are eventually ignored. This change moves duplicate handling out of the queueing into a new duplicate toolbox module. Duplicate detection is moved into individual rules. Resubmission of withheld samples is done in the worker at the end of ruleset processing after the processing result is saved to the database. Handling of local and cluster duplicates is stricly separated. While that makes the actual code not much easier to understand and maintain, the underlying concepts at least are somewhat untangled. The cluster duplicate handler stays mostly the same, primarily consisting of a coroutine which periodically tries to lock samples from its backlog and then submit it to the local queue. The local duplicate handler is now a distinct module very similar to the cluster duplicate handler but doesn't need any repeated polling. Instead potential duplicates are still resubmitted once a sample finishes processing. The cluster duplicate handler no longer directly interacts with the local duplicate handler by putting samples from its backlog into their backlog. Instead cluster duplicates are submitted to the local queue in bulk and the duplicate handler is expected to either never come into play again (because of the known rule and its cached previous analysis result) or detect the local duplicates and put all but one of them into its own backlog automatically. This new design highlighted an additional point for optimisation: If a sample can be locked by the cluster duplicate handler (i.e. is not currently being processed by another instance) but we find siblings of it in our own cluster duplicate backlog, then obviously this sample was at an earlier point in time a cluster duplicate and withheld samples are waiting for the next polling run to be resubmitted. In this case we short-circuit the process from the cluster duplicate detection and submit them to the job queue immediately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decentralise duplicate detection #225

Decentralise duplicate detection #225

Commits on May 23, 2023

Decentralise duplicate detection #225

Are you sure you want to change the base?

Decentralise duplicate detection #225

Commits on May 23, 2023