Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fastq-sync service #871

Open
alexiswl opened this issue Feb 21, 2025 · 9 comments · May be fixed by #900
Open

Add fastq-sync service #871

alexiswl opened this issue Feb 21, 2025 · 9 comments · May be fixed by #900
Assignees

Comments

@alexiswl
Copy link
Member

alexiswl commented Feb 21, 2025

This is a simple service that performs the following

  1. Listens to 'CheckFastqAvailableSync' events generated by workflow glue services.

    • The input contains a library id and a task token.
    • Registers the task token in the database with the library id as the primary key (as there may be multiple tasks relying on the same library id being available)
    • Workflow Glue services will not move off their step function state until they are unlocked by this service.
  2. Listens to 'FastqAvailable' events from the (future) fastq-glue services. See Add 'fastq-glue' service #870

    • When a fastq set becomes available for that library (and has readSet information for all fastq objects in the set), then a 'sendtasksuccess' is sent through for that task token id, see point 1 above.
  3. Listens to events from glue services requesting fastq unarchiving and calls the unarchive manager (asynchronous call from glue services- see Add 'fastq-glue' service #870)

  4. Listens to the unarchive manager (Add synchronous s3 copy service into Orcabus #869) and checks with any FastqAvailableSync pending tokens to determine if the data is available yet for their workflow.

This allows the glue services to perform the following:

  1. As soon as the fastq-glue services says that there are new sets available on the instrument run, we can trigger a step function to run that starts with 'CheckFastqAvailable' events for each library required in the analysis, these will hang until the data for these services is available.

  2. Glue services can then re-query the readset now with the fastq manager now pointing to the restored file uris, and as such a READY event knowing that the data is readily available.

@reisingerf
Copy link
Member

The task token here is to synchronise the "task success" event for a set of libraries?

@alexiswl
Copy link
Member Author

The task token here is to synchronise the "task success" event for a set of libraries?

No the workflow glue service step function task would be waiting for one library

Image

@reisingerf
Copy link
Member

Ah, the task token is to be able to resume (possibly multiple) paused step functions (as those are waiting for a particular token to be passed to them)?

I can't say I like that concept, but I can understand where it's coming from in that case.

Would it be possible to simply fail the "check fastqs" part and repeat it each time a new "fastq available" event comes through?
Granted it would mean potentially quite a few unnecessary checks / executions, but they should be quick and you'd remove the direct coupling via task tokens (which also means any fastq could be "made available" without knowing any task tokens).

Note: conceptual thinking for the future, does not have to influence initial implementations!

@alexiswl
Copy link
Member Author

alexiswl commented Feb 23, 2025

Ah, the task token is to be able to resume (possibly multiple) paused step functions (as those are waiting for a particular token to be passed to them)?

It would only be able to pause one single step function, as the task token is bound to the task itself. It cannot be generated prior to the task (in this case a put event).

Would it be possible to simply fail the "check fastqs" part and repeat it each time a new "fastq available" event comes through?
Granted it would mean potentially quite a few unnecessary checks / executions, but they should be quick and you'd remove the direct coupling via task tokens (which also means any fastq could be "made available" without knowing any task tokens).

Benefit here is that you already have the tn coupling ready to go. If I'm pulling out a normal from archive to run on a new tumor sample for a patient with multiple tumors, how do we know what to pair when we get the fastq available event from the normal being thawed. With the task token syncing, the pairing happens at the samplesheet initialisation stage where we have pairing knowledge for a given sequencing run.

Fastqs will still be made available by the FastqAvailable event and can / and will be done so independent of any task tokens. This is essentially a wrapper around that service.

@reisingerf
Copy link
Member

Understood.
And that's fine.

I am not sure I fully grasp the details here, so I was just wondering if another, less coupled path would also be feasible, but I recognise that this may only be possible with additional setup/services in the future.

E.g. The pairing would happen on a predefined trigger, a new sequencing run, the arrival of new data, etc. That's independent of the availability of any data though. So If I see this correct, then your current path would start the related workflows, which in turn would check the FASTQ availability. Depending on that it might end up in a pause to the execution waiting until the required file(s) become available. That "waiting" is realised via task tokens send along to the availability check and corresponding "release" events for each execution/token whenever a fastq is restored/becomes available.
Sorry I may be off, but that's what I understood high level.

My idea is very similar, but without the direct coupling or task tokens: I'd simply "fail" (or exit) the initial workflow execution if the required files are not available. I'd record those as "pending" or "waiting for requirement" and on each new file available event I'd run them again to see if the requirements are now met with the new data.
Granted that's very crude and could/should be refined. It also assumes that an execution of the workflow (at least up until the "FASTQ check") is idempotent and quick/cheap to run.

It would however align with a future vision where the "READY-ness" of a workflow is evaluated outside and independently of any workflow execution. Such a "READY-ness check" would have to be quick to run and might have to consider a number of factors changing. Any workflow that was triggered before it was ready, would simply "fail" (with appropriate response).

Again, this question / idea is more for the future and not meant to replace/change any current setups. For now I am just interested whether it would make conceptual sense at all...

@alexiswl
Copy link
Member Author

So If I see this correct, then your current path would start the related workflows

No, the step function above is in the 'glue' bit, after the fastqs are available, the ready command would be generated

@reisingerf
Copy link
Member

the ready command would be generated

And that would then start the workflow?

@alexiswl
Copy link
Member Author

Yep, see the diagram above, the ready event isn't triggered until the fastqs become available through the sync services

@reisingerf
Copy link
Member

OK, so it's the same scenario, but one level up?
Essentially your sync services optimise the "READY-ness" evaluation introducing a coupling between they sync partners (for now the glue bit and the fastq sync/availability service). Again, that's perfectly OK. I just have the feeling that this might run into issues if/when the "READY-ness" evaluation starts to get more complex (but that's a future issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants