Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decentralise duplicate detection #225

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Commits on May 23, 2023

  1. Decentralise duplicate detection

    Previously we detected duplicate samples upon job submission. This was a
    very intricate process that covered two stages of detection: local
    duplicates and other Peekaboo instances in a cluster analysing the same
    sample concurrently. Apart from being hard to understand and maintain
    this was inefficient for analyses which didn't involve any expensive
    operations such as offloading a job to Cuckoo. This degraded into a
    downright throughput bottleneck for analyses of large numbers (> 10000)
    of nonidentical samples which are eventually ignored.
    
    This change moves duplicate handling out of the queueing into a new
    duplicate toolbox module. Duplicate detection is moved into individual
    rules. Resubmission of withheld samples is done in the worker at the end
    of ruleset processing after the processing result is saved to the
    database.
    
    Handling of local and cluster duplicates is stricly separated. While
    that makes the actual code not much easier to understand and maintain,
    the underlying concepts at least are somewhat untangled.
    
    The cluster duplicate handler stays mostly the same, primarily
    consisting of a coroutine which periodically tries to lock samples from
    its backlog and then submit it to the local queue.
    
    The local duplicate handler is now a distinct module very similar to the
    cluster duplicate handler but doesn't need any repeated polling. Instead
    potential duplicates are still resubmitted once a sample finishes
    processing.
    
    The cluster duplicate handler no longer directly interacts with the
    local duplicate handler by putting samples from its backlog into their
    backlog. Instead cluster duplicates are submitted to the local queue in
    bulk and the duplicate handler is expected to either never come into
    play again (because of the known rule and its cached previous analysis
    result) or detect the local duplicates and put all but one of them into
    its own backlog automatically.
    
    This new design highlighted an additional point for optimisation: If a
    sample can be locked by the cluster duplicate handler (i.e. is not
    currently being processed by another instance) but we find siblings of
    it in our own cluster duplicate backlog, then obviously this sample was
    at an earlier point in time a cluster duplicate and withheld samples are
    waiting for the next polling run to be resubmitted. In this case we
    short-circuit the process from the cluster duplicate detection and
    submit them to the job queue immediately.
    michaelweiser committed May 23, 2023
    Configuration menu
    Copy the full SHA
    346f9bc View commit details
    Browse the repository at this point in the history