Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the skip pre-analysis parameter logic #337

Open
ypriverol opened this issue Jan 10, 2024 · 2 comments
Open

Improve the skip pre-analysis parameter logic #337

ypriverol opened this issue Jan 10, 2024 · 2 comments
Assignees
Labels
dia analysis enhancement New feature or request

Comments

@ypriverol
Copy link
Member

Description of feature

In the following, PR #335 @jspaezp introduced the possibility to perform the DIA without the preanalysis step. It would be great to use the SDRF information to sub-select a few RAW files to perform the preanalysis which is better to generate the final results. Some of the SDRF columns that could be selected to generate the preanalysis are:

  • factor value: We can randomly select files for each factor value.
  • technical and biological replicates: We should not pre-analyse biological and technical replicates.
@ypriverol ypriverol added the enhancement New feature or request label Jan 10, 2024
@jspaezp
Copy link

jspaezp commented Jan 12, 2024

With internal discussion we talked about an automatic random subsetting of files to generate an empirical library.

we favored subsetting a 'maximum number of files', rather than to a 'percentage subset', since it offers both

  1. More re-usability of configurations (I would just not be used on small runs).
  2. It provides better control on computational resources.

Suggested implementation:
Config option

params {
   ...
   empirical_assembly_sample_n = 200
}

(not real code ... just meant to show where in the workflow it would happen)

if (len(all_files) > params['empirical_assembly_sample_n'] ) {
    empirical_assembly_files = all_files
        .randomSample( params['empirical_assembly_sample_n'] )
} else {
    empirical_assembly_files = all_files
}
empirical_assembly_files = all_files
    .randomSample( params['empirical_assembly_sample_n'] )

first_search_results = FIRST_SEARCH(empirical_assembly_files)
empirical_assembly = EMPIRICAL_ASSEMBLY_FILES(first_search_results)

final_individual_results = FINAL_INDIV_SEARCH(all_files.mix(empirical_assembly))

@ypriverol
Copy link
Member Author

As @jspaezp mentioned, we should refine the current proposal for skipping the assembly library and pre-analysis using the following logic:

  • One variable called random_preliminary_analysis will control how many files will be used for random selection of files.
  • Another variable empirical_assembly_sample_n should control the number of variables needed to randomly perform the preanalysis.

This will help us to perform the pre-analysis on a certain number of files only depending on the size of the cluster and the resources, by making it false, it will use always all the files. For really large datasets users can define the number of files they want to use.

My main questions is how the previous PR #335 by @jspaezp overlaps with this idea. In the previous PR the user needs to do job1 and job2 configurations manually?

@ypriverol ypriverol linked a pull request Jan 13, 2024 that will close this issue
11 tasks
@ypriverol ypriverol removed a link to a pull request Jan 13, 2024
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dia analysis enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants