Error executing process > 'format_EX_clusters_input' #35

evaesquinas · 2023-04-13T13:50:16Z

Hello,

I have been trying to use ExOrthist and although I have tried several things, talking to @fedemantica included, I am still unable to make it run.

Although the title (or the error) of this issue is almost the same as the one that I opened a month ago, my current problem is different. Basically, in that previous issue I was having problems with the testing data due that I did not remove the brackets in the command. This problem was solved and I managed to run the testing data without any problems.

However, now that I am trying to use the whole genome and annotations (not subsetted), I am getting the same error (and I am writing the code properly).

executor >  local (14)
[d8/0c4443] process > check_input (1)                [100%] 1 of 1 ✔
[cf/4db79c] process > generate_annotations (hg38)    [100%] 2 of 2 ✔
[2c/2c53dd] process > split_clusters_by_species_p... [100%] 1 of 1 ✔
[bc/450e28] process > split_clusters_in_chunks (h... [100%] 1 of 1 ✔
[59/860fab] process > parse_IPA_prot_aln (hg38-mm... [100%] 1 of 1 ✔
[02/61c530] process > split_EX_pairs_to_realign (1)  [100%] 1 of 1 ✔
[83/3e953e] process > realign_EX_pairs (1)           [100%] 1 of 1 ✔
[24/c2b623] process > merge_PROT_EX_INT_aln_info ... [100%] 1 of 1 ✔
[2d/a65731] process > score_EX_matches (hg38-mm10)   [100%] 1 of 1 ✔
[86/669d1b] process > filter_and_select_best_EX_m... [100%] 1 of 1 ✔
[11/faabcb] process > join_filtered_EX_matches       [100%] 1 of 1 ✔
[87/831d4a] process > collapse_overlapping_matches   [100%] 1 of 1 ✔
[8d/174a47] process > format_EX_clusters_input       [  0%] 0 of 1
[-        ] process > cluster_EXs                    -
[-        ] process > format_EX_clusters_output      -
[-        ] process > recluster_genes_by_species_... -
[-        ] process > recluster_EXs_by_species_pair  -
Error executing process > 'format_EX_clusters_input'

Caused by:
  Missing output file(s) `PART_*-cluster_input.tab` expected by process `format_EX_clusters_input`

Command executed:

  if [ `echo mm10_hg38_v100_fromBroccoli.tab | grep ".gz"` ]; then
      zcat mm10_hg38_v100_fromBroccoli.tab > cluster_file
      D1_format_EX_clusters_input.pl cluster_file filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
      rm cluster_file
   else
      D1_format_EX_clusters_input.pl mm10_hg38_v100_fromBroccoli.tab filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
   fi

Command exit status:
  0

Command output:
  (empty)

Command error:
  INFO:    Convert SIF file to sandbox...
  Number of parts:	0
  INFO:    Cleaning up image...

Work dir:
  /mnt/lustre/scratch/nlsas/home/usc/gr/eer/Tools/ExOrthist/work/8d/174a47b714b384f3d6036eb6b93cd6

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`


Failed to invoke `workflow.onComplete` event handler

 -- Check script 'main.nf' at line: 676 or see '.nextflow.log' file for more details
--- Pipeline BIOCORE@CRG ExOrthist ---
Started at  2023-04-13T14:26:58.654647070+02:00
Finished at 2023-04-13T14:30:42.002558998+02:00
Time elapsed: 3m 43s
Execution status: failed

executor >  local (14)
[d8/0c4443] process > check_input (1)                [100%] 1 of 1 ✔
[cf/4db79c] process > generate_annotations (hg38)    [100%] 2 of 2 ✔
[2c/2c53dd] process > split_clusters_by_species_p... [100%] 1 of 1 ✔
[bc/450e28] process > split_clusters_in_chunks (h... [100%] 1 of 1 ✔
[59/860fab] process > parse_IPA_prot_aln (hg38-mm... [100%] 1 of 1 ✔
[02/61c530] process > split_EX_pairs_to_realign (1)  [100%] 1 of 1 ✔
[83/3e953e] process > realign_EX_pairs (1)           [100%] 1 of 1 ✔
[24/c2b623] process > merge_PROT_EX_INT_aln_info ... [100%] 1 of 1 ✔
[2d/a65731] process > score_EX_matches (hg38-mm10)   [100%] 1 of 1 ✔
[86/669d1b] process > filter_and_select_best_EX_m... [100%] 1 of 1 ✔
[11/faabcb] process > join_filtered_EX_matches       [100%] 1 of 1 ✔
[87/831d4a] process > collapse_overlapping_matches   [100%] 1 of 1 ✔
[8d/174a47] process > format_EX_clusters_input       [100%] 1 of 1, failed: 1 ✘
[-        ] process > cluster_EXs                    -
[-        ] process > format_EX_clusters_output      -
[-        ] process > recluster_genes_by_species_... -
[-        ] process > recluster_EXs_by_species_pair  -
Error executing process > 'format_EX_clusters_input'

Caused by:
  Missing output file(s) `PART_*-cluster_input.tab` expected by process `format_EX_clusters_input`

Command executed:

  if [ `echo mm10_hg38_v100_fromBroccoli.tab | grep ".gz"` ]; then
      zcat mm10_hg38_v100_fromBroccoli.tab > cluster_file
      D1_format_EX_clusters_input.pl cluster_file filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
      rm cluster_file
   else
      D1_format_EX_clusters_input.pl mm10_hg38_v100_fromBroccoli.tab filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
   fi

Command exit status:
  0

Command output:
  (empty)

Command error:
  INFO:    Convert SIF file to sandbox...
  Number of parts:	0
  INFO:    Cleaning up image...

Work dir:
  /mnt/lustre/scratch/nlsas/home/usc/gr/eer/Tools/ExOrthist/work/8d/174a47b714b384f3d6036eb6b93cd6

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Find the hidden log attached (.nextflow.log).

hidden.nextflow.log

This is the script that I am using:

#!/usr/bin/bash
#SBATCH --job-name=ExOrthist
#SBATCH --time=5:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH -o slurm_outputs/ExOrthist.o
#SBATCH -e slurm_outputs/ExOrthist.e

cd /home/usc/gr/eer/LUSTRE/Tools/ExOrthist
path='/mnt/lustre/scratch/nlsas/home/usc/gr/eer/Analyses/CESGA/Splicing/Results/MALU29_dataset/ExOrthist/'

module load nextflow
module load singularity
module load cesga/2020 gcccore/system mafft/7.475-with-extensions

nextflow run main.nf -with-singularity > $path'log.txt'

#move the hidden .log
mv .nextflow.log $path

Note that I am working on a cluster (slurm environment) and when I load:

Nextflow appears:

module load nextflow
jdk/17.0.2 loaded
squashfs/4.3 loaded
singularity/3.6.3 loaded
nextflow/22.10.4 loaded

Singularity (after nextflow)

module load singularity
singularity/3.6.3 unloaded
go/1.17.8 loaded
singularity/3.9.7 loaded

The following have been reloaded with a version change:
  1) singularity/3.6.3 => singularity/3.9.7

And you would wonder, why did you load both? At the beginning I was just loading nextflow, but just in case, I tried to load singularity after and now I was using both.

After having talked with @fedemantica, I also tried to load mafft ( module unload cesga/2020 gcccore/system mafft/7.475-with-extensions) just in case was the problem... but it seems that it is not.

This is how the "params.config" file looks:

Note that I also tried to use NXF_VER=20.04.1 as I did for the testing data (that works):
NXF_VER=20.04.1 nextflow run main.nf -with-singularity > test_log.txt

but it gives me the same error.

Also note that:
a) I am using the Ensembl release 100 for both (genome and GTF), from here:
GENOME:

MOUSE: https://ftp.ensembl.org/pub/release-100/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
HUMAN: https://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
(both were renamed to hg38_gDNA.fasta.gz & mm10_gDNA.fasta.gz)

Just in case you want to know the format of each one (to compare with the GTF file):

zgrep '^>' mm10_gDNA.fasta.gz
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
>10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
>11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
>12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF

zgrep '^>' hg38_gDNA.fasta.gz
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF

GTF:

HUMAN : https://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz
MOUSE : https://ftp.ensembl.org/pub/release-100/gtf/mus_musculus/Mus_musculus.GRCm38.100.gtf.gz
(both were renamed to hg38_annot.gtf.gz & mm10_annot.gtf.gz)

Format:

 zcat mm10_annot.gtf.gz | head
#!genome-build GRCm38.p6
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.8
#!genebuild-last-updated 2020-02
1       havana  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1";
1       havana  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; transcript_name "4933401J01Rik-201"; transcript_source "havana"; transcript_biotype "TEC"; havana_transcript "OTTMUST00000127109"; havana_transcript_version "1"; tag "basic"; transcript_support_level "NA";
1       havana  exon    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; exon_number "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; transcript_name "4933401J01Rik-201"; transcript_source "havana"; transcript_biotype "TEC"; havana_transcript "OTTMUST00000127109"; havana_transcript_version "1"; exon_id "ENSMUSE00001343744"; exon_version "1"; tag "basic"; transcript_support_level "NA";
1       ensembl gene    3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";
1       ensembl transcript      3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; transcript_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; transcript_biotype "snRNA"; tag "basic"; transcript_support_level "NA";

zcat hg38_annot.gtf.gz | head
#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.28
#!genebuild-last-updated 2019-06
1       havana  gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1       havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1       havana  exon    11869   12227   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1       havana  exon    12613   12721   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
1       havana  exon    13221   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

b) The gene orthogroups file was generated with Broccoli in the following way:

Run broccoli with the proteome FASTA (Ensembl version 100):
HUMAN: https://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz
MOUSE: https://ftp.ensembl.org/pub/release-100/fasta/mus_musculus/pep/Mus_musculus.GRCm38.pep.all.fa.gz
I downloaded the file orthologous_groups.txt from dir3.
I created a txt file getting some information from the proteome fasta files --> to get the PEP, GeneID, Species and Biotype like this:

awk -F '[: ]' '/^>/ {print $1,$4,$10,$14}' Homo_sapiens.GRCh38.pep.all.fa > proteome_info_reduced_hg38.txt
awk -F '[: ]' '/^>/ {print $1,$4,$10,$14}' Mus_musculus.GRCm38.pep.all.fa > proteome_info_reduced_mm10.txt

Having those two files I went to R, I load them and do the following things. Remove orthogroups containing more than 20 genes, merge my orthogroups with the info file from the fasta… I kept only the genes that are protein_coding, I remove the duplicated ones (because there could be several PEP for the same gene, I do some formatting to remove the version of the gene id…) and I generated the file mm10_hg38_v100_fromBroccoli.tab with the following format:

OG_1	mm10	ENSMUSG00000038324
OG_1	hg38	ENSG00000100991
OG_10	mm10	ENSMUSG00000114004
OG_10	hg38	ENSG00000284638
OG_100	mm10	ENSMUSG00000025193
OG_100	hg38	ENSG00000119929
OG_10000	mm10	ENSMUSG00000015053
OG_10000	hg38	ENSG00000179348

I do not know how to continue and be able to start working with the tool. Could anybody help me, please?
Sorry if I posted too much information, but I wanted to give you the enough information to solve what it is going on. Of course, if you need to check any particular file or you need more information, let me know. What is more, if you prefer to meet through a vide ocall, feel free to contact me and we can set it up.

Thanks very much in advance

Kind Regards,
Eva

The text was updated successfully, but these errors were encountered:

malszycki · 2024-06-04T13:16:24Z

I just had the same error and fixed it by creating separate conda environment with fresh mafft installation (I also needed to install hashmap R package from github).
Did you check if you don't have empty alignments in the step "parse_IPA_prot_aln". This in my case resulted in the same error as yours in the step "format_EX_clusters_input". I needed to go stepwise back and all the output files were empty (headers only)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error executing process > 'format_EX_clusters_input' #35

Error executing process > 'format_EX_clusters_input' #35

evaesquinas commented Apr 13, 2023

malszycki commented Jun 4, 2024 •

edited

Loading

Error executing process > 'format_EX_clusters_input' #35

Error executing process > 'format_EX_clusters_input' #35

Comments

evaesquinas commented Apr 13, 2023

malszycki commented Jun 4, 2024 • edited Loading

malszycki commented Jun 4, 2024 •

edited

Loading