Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added custom/restoregffids #5002

Closed
wants to merge 2 commits into from
Closed

Conversation

GallVp
Copy link
Member

@GallVp GallVp commented Feb 26, 2024

PR checklist

Closes #XXX

  • Added custom/restoregffids for Added a draft for fasta_ltrretriever_lai #4984
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@GallVp GallVp requested a review from a team as a code owner February 26, 2024 21:05
@GallVp GallVp requested review from Aratz, mahesh-panchal and a team and removed request for a team February 26, 2024 21:05
@mahesh-panchal
Copy link
Member

Have you checked if there are other ways to achieve this? Generally custom modules should be applicable to a wider audience to help reduce maintenance burden and increase findability. Can this be done, for example, with some existing tool and a combination of Groovy/Channel manipulation?

@GallVp
Copy link
Member Author

GallVp commented Feb 27, 2024

Hi @mahesh-panchal

That's a good point. I have searched this on google and all the results lead me to custom scripts.

Yes, we can replace it with groovy code.

@mahesh-panchal
Copy link
Member

Just to be clear, I mean replace the whole module in the subworkflow, so we don't have to add a custom script to modules.

@GallVp
Copy link
Member Author

GallVp commented Feb 29, 2024

It is possible but the way I am doing it might be flimsy. Here is, for example, an other part of the sub workflow that I have implemented in groovy. It resumes correctly. If you are happy with it, I'll also port custom/restoregffids to pure groovy.

In specific, I am not happy with the way I am extracting meta.id out of file name.

| collectFile(newLine:true)
| map { seqs ->
    def id = seqs.name.split('.mapped.monoploid.seqs.txt')[0]

    [ [ id: id ], seqs ]
}
    ch_short_monoploid_seqs         = ch_short_ids_tsv
                                    | join(
                                        ch_monoploid_seqs ?: Channel.empty()
                                    )
                                    | map { meta, short_ids_tsv, monoploid_seqs ->
                                        map_monoploid_seqs_to_new_ids(meta, short_ids_tsv, monoploid_seqs)
                                    }
                                    | collectFile(newLine:true)
                                    | map { seqs ->
                                        def id = seqs.name.split('.mapped.monoploid.seqs.txt')[0]

                                        [ [ id: id ], seqs ]
                                    }



def map_monoploid_seqs_to_new_ids(meta, short_ids_tsv, monoploid_seqs) {

    def short_ids_head = short_ids_tsv.text.split('\n')[0]

    if (short_ids_head == "IDs have acceptable length and character. No change required.") {
        return [ "${meta.id}.mapped.monoploid.seqs.txt" ] + monoploid_seqs.text.split('\n')
    }

    def orig_to_new_ids = [:]
    short_ids_tsv.text.eachLine { line ->
        def (original_id, renamed_id) = line.split('\t')
        orig_to_new_ids[original_id] = renamed_id
    }

    def mapped_ids = []
    monoploid_seqs.text.eachLine { original_id ->
        if (!orig_to_new_ids[original_id]) {
            error "Faild to find $original_id in ${monoploid_seqs}" +
            "The monoploid_seqs file is malformed!"
        }

        mapped_ids.add(orig_to_new_ids[original_id])
    }

    return [ "${meta.id}.mapped.monoploid.seqs.txt" ] + mapped_ids
}

@mahesh-panchal
Copy link
Member

collectFile can take a closure that lets you name the file how you want. You can use that to your advantage for the split, by also passing the meta along with the list. See nf-core/website#2242 for inspiration.

Also Nextflow has some file operators that have the same args as the equivalent channel operators. This means you can call

short_ids_tsv.splitCsv(skip:1, header:['original_id','renamed_id'], sep: '\t')

for example.

@GallVp
Copy link
Member Author

GallVp commented Mar 1, 2024

Thank you @mahesh-panchal

How is nf-core/website#2242 different from what I have done above? In both cases, meta.id is first injected as first item in the data list. collectFile uses that as filename. Then, meda.id is extracted from the file name. It works but lacks "grooviness".

Ideally, collectFile should be changed such that the following data flow is possible:

Channel.of( [ [ id:'test', etc:'etc' ], [ 'a', 'b', 'c', 'd' ] ] )
| collectFile { meta, data ->  [ "${meta.id}.txt", data.join("\n"), meta] } // Closure returns n items: file name, data, any other items which are passed through as is
| map { file, meta -> [ meta, file ] }

@mahesh-panchal
Copy link
Member

Apologies, I didn't notice the function was doing that. I'm used to seeing the renaming in a closure to collectFile so one can see quickly the file is being named from the meta. It was too hidden for me in the function.

Make an issue on Nextflow for the change in behaviour of collectFile as it's a common enough use case. With records being supported soon, I wonder though if there's already a solution.

@GallVp GallVp marked this pull request as draft March 7, 2024 23:57
@GallVp GallVp added the WIP Work in progress label Mar 7, 2024
@GallVp
Copy link
Member Author

GallVp commented Mar 7, 2024

Thank you @mahesh-panchal

I have raised an issue on the NextFlow repo. In the meantime, I'll utilise your suggestions to convert this issue into a groovy function.

@GallVp
Copy link
Member Author

GallVp commented Mar 20, 2024

Hi @mahesh-panchal

I am looking at this module after a break on some other work. After converting the python code to groovy, I realised that I can not publish the *.restored.ids.gff3 file (

tuple val(meta), path("*.restored.ids.gff3") , emit: restored_ids_gff3
) through publishDir anymore. Subworkflow users must be able to control the output files through the familiar publishDir directive. Any suggestions? Thank you!

@mahesh-panchal
Copy link
Member

I've asked a question on #subworkflows.
Native execution aside, one solution is to use a process in the same file as the subworkflow, but I'm not sure if it's permitted. This means the process gets maintained with the subworkflow.

Another solution is to add a params to the workflow to where the output gets published, since one can set storeDir on collectFile. Then this can be set using the addParams when including the workflow. I'm not fond of this idea though as it adds yet another place to check for setting stuff.

@GallVp
Copy link
Member Author

GallVp commented May 9, 2024

Hi @mahesh-panchal

Another way to resolve this issue is to use groovy functions for this module and the other custom module (#5001), emit the output files and let the pipeline users decide how and where to publish these files.

@mahesh-panchal
Copy link
Member

Sorry, I need to be reminded what we were discussing. What were the issues again?

  • Avoid custom processes?
  • Subworkflow needs to publish certain files?

Can you remind me what the functions did then? There's not much difference I guess between a Nextflow function and a Nextflow flow exec process, aside from caching.

@GallVp
Copy link
Member Author

GallVp commented May 13, 2024

Yes, we want to avoid custom modules which may not have wider application across sub-workflows. We were discussing if we can replace the custom modules with processes or functions local to the sub-workflow. You had also created a post on slack.

The functions will run on the head node where the NextFlow parent job is executed. Processes may run on the head node or other nodes depending on resource requirements.

@GallVp
Copy link
Member Author

GallVp commented Jun 5, 2024

Replaced with Groovy code, see #4984

@GallVp GallVp closed this Jun 5, 2024
@GallVp GallVp deleted the restoregffids branch June 7, 2024 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP Work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants