Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error executing process > 'format_EX_clusters_input' #35

Open
evaesquinas opened this issue Apr 13, 2023 · 1 comment
Open

Error executing process > 'format_EX_clusters_input' #35

evaesquinas opened this issue Apr 13, 2023 · 1 comment

Comments

@evaesquinas
Copy link

Hello,

I have been trying to use ExOrthist and although I have tried several things, talking to @fedemantica included, I am still unable to make it run.

Although the title (or the error) of this issue is almost the same as the one that I opened a month ago, my current problem is different. Basically, in that previous issue I was having problems with the testing data due that I did not remove the brackets in the command. This problem was solved and I managed to run the testing data without any problems.

However, now that I am trying to use the whole genome and annotations (not subsetted), I am getting the same error (and I am writing the code properly).

executor >  local (14)
[d8/0c4443] process > check_input (1)                [100%] 1 of 1 ✔
[cf/4db79c] process > generate_annotations (hg38)    [100%] 2 of 2 ✔
[2c/2c53dd] process > split_clusters_by_species_p... [100%] 1 of 1 ✔
[bc/450e28] process > split_clusters_in_chunks (h... [100%] 1 of 1 ✔
[59/860fab] process > parse_IPA_prot_aln (hg38-mm... [100%] 1 of 1 ✔
[02/61c530] process > split_EX_pairs_to_realign (1)  [100%] 1 of 1 ✔
[83/3e953e] process > realign_EX_pairs (1)           [100%] 1 of 1 ✔
[24/c2b623] process > merge_PROT_EX_INT_aln_info ... [100%] 1 of 1 ✔
[2d/a65731] process > score_EX_matches (hg38-mm10)   [100%] 1 of 1 ✔
[86/669d1b] process > filter_and_select_best_EX_m... [100%] 1 of 1 ✔
[11/faabcb] process > join_filtered_EX_matches       [100%] 1 of 1 ✔
[87/831d4a] process > collapse_overlapping_matches   [100%] 1 of 1 ✔
[8d/174a47] process > format_EX_clusters_input       [  0%] 0 of 1
[-        ] process > cluster_EXs                    -
[-        ] process > format_EX_clusters_output      -
[-        ] process > recluster_genes_by_species_... -
[-        ] process > recluster_EXs_by_species_pair  -
Error executing process > 'format_EX_clusters_input'

Caused by:
  Missing output file(s) `PART_*-cluster_input.tab` expected by process `format_EX_clusters_input`

Command executed:

  if [ `echo mm10_hg38_v100_fromBroccoli.tab | grep ".gz"` ]; then
      zcat mm10_hg38_v100_fromBroccoli.tab > cluster_file
      D1_format_EX_clusters_input.pl cluster_file filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
      rm cluster_file
   else
      D1_format_EX_clusters_input.pl mm10_hg38_v100_fromBroccoli.tab filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
   fi

Command exit status:
  0

Command output:
  (empty)

Command error:
  INFO:    Convert SIF file to sandbox...
  Number of parts:	0
  INFO:    Cleaning up image...

Work dir:
  /mnt/lustre/scratch/nlsas/home/usc/gr/eer/Tools/ExOrthist/work/8d/174a47b714b384f3d6036eb6b93cd6

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`


Failed to invoke `workflow.onComplete` event handler

 -- Check script 'main.nf' at line: 676 or see '.nextflow.log' file for more details
--- Pipeline BIOCORE@CRG ExOrthist ---
Started at  2023-04-13T14:26:58.654647070+02:00
Finished at 2023-04-13T14:30:42.002558998+02:00
Time elapsed: 3m 43s
Execution status: failed

executor >  local (14)
[d8/0c4443] process > check_input (1)                [100%] 1 of 1 ✔
[cf/4db79c] process > generate_annotations (hg38)    [100%] 2 of 2 ✔
[2c/2c53dd] process > split_clusters_by_species_p... [100%] 1 of 1 ✔
[bc/450e28] process > split_clusters_in_chunks (h... [100%] 1 of 1 ✔
[59/860fab] process > parse_IPA_prot_aln (hg38-mm... [100%] 1 of 1 ✔
[02/61c530] process > split_EX_pairs_to_realign (1)  [100%] 1 of 1 ✔
[83/3e953e] process > realign_EX_pairs (1)           [100%] 1 of 1 ✔
[24/c2b623] process > merge_PROT_EX_INT_aln_info ... [100%] 1 of 1 ✔
[2d/a65731] process > score_EX_matches (hg38-mm10)   [100%] 1 of 1 ✔
[86/669d1b] process > filter_and_select_best_EX_m... [100%] 1 of 1 ✔
[11/faabcb] process > join_filtered_EX_matches       [100%] 1 of 1 ✔
[87/831d4a] process > collapse_overlapping_matches   [100%] 1 of 1 ✔
[8d/174a47] process > format_EX_clusters_input       [100%] 1 of 1, failed: 1 ✘
[-        ] process > cluster_EXs                    -
[-        ] process > format_EX_clusters_output      -
[-        ] process > recluster_genes_by_species_... -
[-        ] process > recluster_EXs_by_species_pair  -
Error executing process > 'format_EX_clusters_input'

Caused by:
  Missing output file(s) `PART_*-cluster_input.tab` expected by process `format_EX_clusters_input`

Command executed:

  if [ `echo mm10_hg38_v100_fromBroccoli.tab | grep ".gz"` ]; then
      zcat mm10_hg38_v100_fromBroccoli.tab > cluster_file
      D1_format_EX_clusters_input.pl cluster_file filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
      rm cluster_file
   else
      D1_format_EX_clusters_input.pl mm10_hg38_v100_fromBroccoli.tab filtered_best_scored_EX_matches_by_targetgene-NoOverlap.tab 500
   fi

Command exit status:
  0

Command output:
  (empty)

Command error:
  INFO:    Convert SIF file to sandbox...
  Number of parts:	0
  INFO:    Cleaning up image...

Work dir:
  /mnt/lustre/scratch/nlsas/home/usc/gr/eer/Tools/ExOrthist/work/8d/174a47b714b384f3d6036eb6b93cd6

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

Find the hidden log attached (.nextflow.log).

hidden.nextflow.log

This is the script that I am using:

#!/usr/bin/bash
#SBATCH --job-name=ExOrthist
#SBATCH --time=5:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH -o slurm_outputs/ExOrthist.o
#SBATCH -e slurm_outputs/ExOrthist.e

cd /home/usc/gr/eer/LUSTRE/Tools/ExOrthist
path='/mnt/lustre/scratch/nlsas/home/usc/gr/eer/Analyses/CESGA/Splicing/Results/MALU29_dataset/ExOrthist/'

module load nextflow
module load singularity
module load cesga/2020 gcccore/system mafft/7.475-with-extensions

nextflow run main.nf -with-singularity > $path'log.txt'

#move the hidden .log
mv .nextflow.log $path

Note that I am working on a cluster (slurm environment) and when I load:

  • Nextflow appears:
module load nextflow
jdk/17.0.2 loaded
squashfs/4.3 loaded
singularity/3.6.3 loaded
nextflow/22.10.4 loaded
  • Singularity (after nextflow)
module load singularity
singularity/3.6.3 unloaded
go/1.17.8 loaded
singularity/3.9.7 loaded

The following have been reloaded with a version change:
  1) singularity/3.6.3 => singularity/3.9.7

And you would wonder, why did you load both? At the beginning I was just loading nextflow, but just in case, I tried to load singularity after and now I was using both.

After having talked with @fedemantica, I also tried to load mafft ( module unload cesga/2020 gcccore/system mafft/7.475-with-extensions) just in case was the problem... but it seems that it is not.

This is how the "params.config" file looks:

image

Note that I also tried to use NXF_VER=20.04.1 as I did for the testing data (that works):
NXF_VER=20.04.1 nextflow run main.nf -with-singularity > test_log.txt

but it gives me the same error.

Also note that:
a) I am using the Ensembl release 100 for both (genome and GTF), from here:
GENOME:

Just in case you want to know the format of each one (to compare with the GTF file):

zgrep '^>' mm10_gDNA.fasta.gz
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
>10 dna:chromosome chromosome:GRCm38:10:1:130694993:1 REF
>11 dna:chromosome chromosome:GRCm38:11:1:122082543:1 REF
>12 dna:chromosome chromosome:GRCm38:12:1:120129022:1 REF
zgrep '^>' hg38_gDNA.fasta.gz
>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
>10 dna:chromosome chromosome:GRCh38:10:1:133797422:1 REF
>11 dna:chromosome chromosome:GRCh38:11:1:135086622:1 REF
>12 dna:chromosome chromosome:GRCh38:12:1:133275309:1 REF

GTF:

Format:

 zcat mm10_annot.gtf.gz | head
#!genome-build GRCm38.p6
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.8
#!genebuild-last-updated 2020-02
1       havana  gene    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1";
1       havana  transcript      3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; transcript_name "4933401J01Rik-201"; transcript_source "havana"; transcript_biotype "TEC"; havana_transcript "OTTMUST00000127109"; havana_transcript_version "1"; tag "basic"; transcript_support_level "NA";
1       havana  exon    3073253 3074322 .       +       .       gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; exon_number "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; transcript_name "4933401J01Rik-201"; transcript_source "havana"; transcript_biotype "TEC"; havana_transcript "OTTMUST00000127109"; havana_transcript_version "1"; exon_id "ENSMUSE00001343744"; exon_version "1"; tag "basic"; transcript_support_level "NA";
1       ensembl gene    3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";
1       ensembl transcript      3102016 3102125 .       +       .       gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; transcript_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; transcript_biotype "snRNA"; tag "basic"; transcript_support_level "NA";
zcat hg38_annot.gtf.gz | head
#!genome-build GRCh38.p13
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.28
#!genebuild-last-updated 2019-06
1       havana  gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
1       havana  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
1       havana  exon    11869   12227   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1       havana  exon    12613   12721   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
1       havana  exon    13221   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

b) The gene orthogroups file was generated with Broccoli in the following way:

awk -F '[: ]' '/^>/ {print $1,$4,$10,$14}' Homo_sapiens.GRCh38.pep.all.fa > proteome_info_reduced_hg38.txt
awk -F '[: ]' '/^>/ {print $1,$4,$10,$14}' Mus_musculus.GRCm38.pep.all.fa > proteome_info_reduced_mm10.txt
  • Having those two files I went to R, I load them and do the following things. Remove orthogroups containing more than 20 genes, merge my orthogroups with the info file from the fasta… I kept only the genes that are protein_coding, I remove the duplicated ones (because there could be several PEP for the same gene, I do some formatting to remove the version of the gene id…) and I generated the file mm10_hg38_v100_fromBroccoli.tab with the following format:
OG_1	mm10	ENSMUSG00000038324
OG_1	hg38	ENSG00000100991
OG_10	mm10	ENSMUSG00000114004
OG_10	hg38	ENSG00000284638
OG_100	mm10	ENSMUSG00000025193
OG_100	hg38	ENSG00000119929
OG_10000	mm10	ENSMUSG00000015053
OG_10000	hg38	ENSG00000179348

I do not know how to continue and be able to start working with the tool. Could anybody help me, please?
Sorry if I posted too much information, but I wanted to give you the enough information to solve what it is going on. Of course, if you need to check any particular file or you need more information, let me know. What is more, if you prefer to meet through a vide ocall, feel free to contact me and we can set it up.

Thanks very much in advance

Kind Regards,
Eva

@malszycki
Copy link

malszycki commented Jun 4, 2024

I just had the same error and fixed it by creating separate conda environment with fresh mafft installation (I also needed to install hashmap R package from github).
Did you check if you don't have empty alignments in the step "parse_IPA_prot_aln". This in my case resulted in the same error as yours in the step "format_EX_clusters_input". I needed to go stepwise back and all the output files were empty (headers only)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants