Added custom/restoregffids #5002

GallVp · 2024-02-26T21:05:47Z

PR checklist

Closes #XXX

mahesh-panchal · 2024-02-27T08:44:13Z

Have you checked if there are other ways to achieve this? Generally custom modules should be applicable to a wider audience to help reduce maintenance burden and increase findability. Can this be done, for example, with some existing tool and a combination of Groovy/Channel manipulation?

GallVp · 2024-02-27T20:55:09Z

Hi @mahesh-panchal

That's a good point. I have searched this on google and all the results lead me to custom scripts.

Yes, we can replace it with groovy code.

mahesh-panchal · 2024-02-28T08:28:18Z

Just to be clear, I mean replace the whole module in the subworkflow, so we don't have to add a custom script to modules.

GallVp · 2024-02-29T09:05:13Z

It is possible but the way I am doing it might be flimsy. Here is, for example, an other part of the sub workflow that I have implemented in groovy. It resumes correctly. If you are happy with it, I'll also port custom/restoregffids to pure groovy.

In specific, I am not happy with the way I am extracting meta.id out of file name.

| collectFile(newLine:true)
| map { seqs ->
    def id = seqs.name.split('.mapped.monoploid.seqs.txt')[0]

    [ [ id: id ], seqs ]
}

    ch_short_monoploid_seqs         = ch_short_ids_tsv
                                    | join(
                                        ch_monoploid_seqs ?: Channel.empty()
                                    )
                                    | map { meta, short_ids_tsv, monoploid_seqs ->
                                        map_monoploid_seqs_to_new_ids(meta, short_ids_tsv, monoploid_seqs)
                                    }
                                    | collectFile(newLine:true)
                                    | map { seqs ->
                                        def id = seqs.name.split('.mapped.monoploid.seqs.txt')[0]

                                        [ [ id: id ], seqs ]
                                    }



def map_monoploid_seqs_to_new_ids(meta, short_ids_tsv, monoploid_seqs) {

    def short_ids_head = short_ids_tsv.text.split('\n')[0]

    if (short_ids_head == "IDs have acceptable length and character. No change required.") {
        return [ "${meta.id}.mapped.monoploid.seqs.txt" ] + monoploid_seqs.text.split('\n')
    }

    def orig_to_new_ids = [:]
    short_ids_tsv.text.eachLine { line ->
        def (original_id, renamed_id) = line.split('\t')
        orig_to_new_ids[original_id] = renamed_id
    }

    def mapped_ids = []
    monoploid_seqs.text.eachLine { original_id ->
        if (!orig_to_new_ids[original_id]) {
            error "Faild to find $original_id in ${monoploid_seqs}" +
            "The monoploid_seqs file is malformed!"
        }

        mapped_ids.add(orig_to_new_ids[original_id])
    }

    return [ "${meta.id}.mapped.monoploid.seqs.txt" ] + mapped_ids
}

mahesh-panchal · 2024-03-01T09:21:07Z

collectFile can take a closure that lets you name the file how you want. You can use that to your advantage for the split, by also passing the meta along with the list. See nf-core/website#2242 for inspiration.

Also Nextflow has some file operators that have the same args as the equivalent channel operators. This means you can call

short_ids_tsv.splitCsv(skip:1, header:['original_id','renamed_id'], sep: '\t')

for example.

GallVp · 2024-03-01T22:52:32Z

Thank you @mahesh-panchal

How is nf-core/website#2242 different from what I have done above? In both cases, meta.id is first injected as first item in the data list. collectFile uses that as filename. Then, meda.id is extracted from the file name. It works but lacks "grooviness".

Ideally, collectFile should be changed such that the following data flow is possible:

Channel.of( [ [ id:'test', etc:'etc' ], [ 'a', 'b', 'c', 'd' ] ] )
| collectFile { meta, data ->  [ "${meta.id}.txt", data.join("\n"), meta] } // Closure returns n items: file name, data, any other items which are passed through as is
| map { file, meta -> [ meta, file ] }

mahesh-panchal · 2024-03-02T08:27:44Z

Apologies, I didn't notice the function was doing that. I'm used to seeing the renaming in a closure to collectFile so one can see quickly the file is being named from the meta. It was too hidden for me in the function.

Make an issue on Nextflow for the change in behaviour of collectFile as it's a common enough use case. With records being supported soon, I wonder though if there's already a solution.

GallVp · 2024-03-07T23:59:00Z

Thank you @mahesh-panchal

I have raised an issue on the NextFlow repo. In the meantime, I'll utilise your suggestions to convert this issue into a groovy function.

GallVp · 2024-03-20T03:29:30Z

Hi @mahesh-panchal

I am looking at this module after a break on some other work. After converting the python code to groovy, I realised that I can not publish the *.restored.ids.gff3 file (

modules/modules/nf-core/custom/restoregffids/main.nf

Line 15 in 422640c

tuple val(meta), path("*.restored.ids.gff3") , emit: restored_ids_gff3

) through publishDir anymore. Subworkflow users must be able to control the output files through the familiar publishDir directive. Any suggestions? Thank you!

mahesh-panchal · 2024-03-20T13:43:42Z

I've asked a question on #subworkflows.
Native execution aside, one solution is to use a process in the same file as the subworkflow, but I'm not sure if it's permitted. This means the process gets maintained with the subworkflow.

Another solution is to add a params to the workflow to where the output gets published, since one can set storeDir on collectFile. Then this can be set using the addParams when including the workflow. I'm not fond of this idea though as it adds yet another place to check for setting stuff.

GallVp · 2024-05-09T07:51:22Z

Hi @mahesh-panchal

Another way to resolve this issue is to use groovy functions for this module and the other custom module (#5001), emit the output files and let the pipeline users decide how and where to publish these files.

mahesh-panchal · 2024-05-13T08:30:19Z

Sorry, I need to be reminded what we were discussing. What were the issues again?

Avoid custom processes?
Subworkflow needs to publish certain files?

Can you remind me what the functions did then? There's not much difference I guess between a Nextflow function and a Nextflow flow exec process, aside from caching.

GallVp · 2024-05-13T21:26:56Z

Yes, we want to avoid custom modules which may not have wider application across sub-workflows. We were discussing if we can replace the custom modules with processes or functions local to the sub-workflow. You had also created a post on slack.

The functions will run on the head node where the NextFlow parent job is executed. Processes may run on the head node or other nodes depending on resource requirements.

GallVp · 2024-06-05T02:28:48Z

Replaced with Groovy code, see #4984

Added custom/restoregffids

7afddc9

GallVp requested a review from a team as a code owner February 26, 2024 21:05

GallVp requested review from Aratz, mahesh-panchal and a team and removed request for a team February 26, 2024 21:05

Added author and license info

422640c

GallVp mentioned this pull request Mar 7, 2024

Emit n (or at least 2) items from collectFile closure nextflow-io/nextflow#4800

Open

GallVp marked this pull request as draft March 7, 2024 23:57

GallVp added the WIP Work in progress label Mar 7, 2024

GallVp closed this Jun 5, 2024

GallVp deleted the restoregffids branch June 7, 2024 09:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added custom/restoregffids #5002

Added custom/restoregffids #5002

GallVp commented Feb 26, 2024

mahesh-panchal commented Feb 27, 2024

GallVp commented Feb 27, 2024 •

edited

Loading

mahesh-panchal commented Feb 28, 2024

GallVp commented Feb 29, 2024 •

edited

Loading

mahesh-panchal commented Mar 1, 2024

GallVp commented Mar 1, 2024 •

edited

Loading

mahesh-panchal commented Mar 2, 2024

GallVp commented Mar 7, 2024

GallVp commented Mar 20, 2024

mahesh-panchal commented Mar 20, 2024

GallVp commented May 9, 2024

mahesh-panchal commented May 13, 2024

GallVp commented May 13, 2024

GallVp commented Jun 5, 2024

Added custom/restoregffids #5002

Added custom/restoregffids #5002

Conversation

GallVp commented Feb 26, 2024

PR checklist

mahesh-panchal commented Feb 27, 2024

GallVp commented Feb 27, 2024 • edited Loading

mahesh-panchal commented Feb 28, 2024

GallVp commented Feb 29, 2024 • edited Loading

mahesh-panchal commented Mar 1, 2024

GallVp commented Mar 1, 2024 • edited Loading

mahesh-panchal commented Mar 2, 2024

GallVp commented Mar 7, 2024

GallVp commented Mar 20, 2024

mahesh-panchal commented Mar 20, 2024

GallVp commented May 9, 2024

mahesh-panchal commented May 13, 2024

GallVp commented May 13, 2024

GallVp commented Jun 5, 2024

GallVp commented Feb 27, 2024 •

edited

Loading

GallVp commented Feb 29, 2024 •

edited

Loading

GallVp commented Mar 1, 2024 •

edited

Loading