Skip to content

Metazoan genomes

Vyacheslav Brover edited this page Sep 28, 2021 · 11 revisions

Get the input assemblies

Suppose genome.list is a list of Metazoan assembly ids.
Let genome/<asm>/<asm>.prot be the protein FASTA files for each assembly <asm> in this list of assemblies.

Create the library of universal HMMs hmm-univ.LIB

Download https://busco-data.ezlab.org/v5/data/lineages/metazoa_odb10.2021-02-24.tar.gz and unpack into the directory metazoa_odb10/.

cat metazoa_odb10/hmms/* > hmms
$TT/genetics/hmmAddCutoff hmms metazoa_odb10/scores_cutoff GA hmm-univ.LIB
rm hmms

Find universal proteins in assemblies

For each assembly <asm> run

$TT/genetics/prots2hmm_univ.sh genome/<asm>/<asm> hmm-univ.LIB 1 <asm>.log

which will create files

genome/<asm>/<asm>.univ
genome/<asm>/<asm>.prot-univ

Remove the assemblies with the number of universal proteins (in file <asm>.prot-univ) below 850 from genome.list.

Compute dissimilarities based on universal proteins

Get the standard script to compute dissimilarities for Metazoa:

$TT/phylogeny/distTree_inc_init_stnd.sh inc genome/Metazoa "" "" "" ""

Create the file with pairs of assemblies:

$TT/list2pairs genome.list > pairs

Compute the dissimilarities:

inc/pairs2dissim.sh pairs "" dissim log

(The file pairs can be split into parts, inc/pairs2dissim.sh can be run on each part separately, and then the dissim files can be concatenated.)

Convert the dissimilarity file dissim into the Data Master format:

$TT/dm/pairs2dm dissim 1 "cons" 6  -distance > data.dm

Make the tree

$TT/phylogeny/makeDistTree  -threads 5  -data data  -dissim_attr "cons"  -variance linExp  \
   -optimize  -subgraph_iter_max 10  -noqual  -output_tree tree