Skip to content

Bacterial genomes

Vyacheslav Brover edited this page Sep 28, 2021 · 36 revisions

Get bacterial genomes

Suppose there is a list of bacterial GenBank assembly ids genome.list.
Create a directory for each of them grouped by file2hash i a new directory genome/:

$TT/trav genome.list "mkdir -p genome/%h/%f"

Put the FASTA file of each genome with contigs longer than 1000 bp into genome/.
The result for 120 assemblies should look like this:

$ ls -d genome/*/* | wc -l
120
$ ls -d genome/*/* | head -3
genome/106/2568628
genome/11/5064078
genome/110/5687818
$ ls genome/106/2568628/
2568628.dna

Prepare data in genome/

Compute nucleotide statistics of genomes:

$TT/trav genome.list "$TT/genetics/dna2stat genome/%h/%f/%f.dna > genome/%h/%f/%f.stat"

Find CDSs in the genomes with Prodigal, create files <asm>.cds:

$TT/trav genome.list  \
    "prodigal  -o /dev/null  -i genome/%h/%f/%f.dna  -d genome/%h/%f/%f.cds"

Create files with CDS hashes <asm>.hash-CDS:

$TT/trav genome.list "$TT/genetics/fasta2hash genome/%h/%f/%f.cds \
    genome/%h/%f/%f.hash-CDS  -cds  -gene_finder prodigal  -min_prot_len 150"

Create files with protein hashes <asm>.hash-PRT:

$TT/trav genome.list "$TT/genetics/fasta2hash genome/%h/%f/%f.cds \
    genome/%h/%f/%f.hash-PRT  -cds  -gene_finder prodigal  -min_prot_len 150 \
    -translate"

Create protein FASTA files <asm>.prot:

$TT/trav genome.list "$TT/genetics/fasta2hash genome/%h/%f/%f.cds \
    genome/%h/%f/%f.hash-PRT1  -cds  -gene_finder prodigal  -translate \
    -min_prot_len 0 -out_prot genome/%h/%f/%f.prot \
    && rm genome/%h/%f/%f.hash-PRT1"

Find universal proteins using the BUSCO HMMs bacteria_odb10.LIB and create FASTA files <asm>.prot-univ with segments of universal proteins matching HMMs:

$TT/trav genome.list \
   "$TT/genetics/prots2hmm_univ.sh genome/%h/%f/%f bacteria_odb10.LIB 1 log"

Create files with Pfam HMM hashes <asm>.hash-HMM:

$TT/trav genome.list "$TT/genetics/prots2hmm_hash.sh genome/%h/%f/%f.prot Pfam-A.hmm \
    0 genome/%h/%f/%f.HMM genome/%h/%f/%f.hash-HMM log"

Prepare the database

Insert the list of assemblies into the table Genome:

$TT/database/bulk.sh $SERVER $BULK_LOCAL $BULK_REMOTE genome.list $DATABASE..List
sqsh-ms  -S $SERVER  -D $DATABASE  -C \
    "insert into Genome (id, tax_root, annot_tried)  select id, 2, 1 from List"

Populate column Genome.total_len:

$TT/trav genome.list \
   "echo -e %Qupdate Genome set total_len = %G$TT/genetics/fasta2len \
    genome/%h/%f/%f.dna | cut -f 2 -d ' ' | $TT/dm/count | grep -w ^sum \
    | cut -f 2%G where id = %f\ngo%Q" > total_len.sql
sqsh-ms  -S $SERVER  -D $DATABASE  -i total_len.sql

Populate table GenomeHash:

$TT/trav genome.list "$TT/database/hash2table.sh %f CDS" >  GenomeHash
$TT/trav genome.list "$TT/database/hash2table.sh %f PRT" >> GenomeHash
$TT/trav genome.list "$TT/database/hash2table.sh %f HMM" >> GenomeHash
cat GenomeHash | tr '\t' '|' > GenomeHash.bulk
$TT/database/bulk.sh $SERVER $BULK_LOCAL $BULK_REMOTE \
    GenomeHash.bulk $DATABASE..GenomeHash

Populate columns Genome.cdss and Genome.prots:

select genome, [type], count(*) c
  into #T
  from GenomeHash (nolock)
  group by genome, [type];
update Genome
  set cdss = #T.c
  from      #T
       join Genome on Genome.id = #T.genome
  where     #T.[type] = 'CDS'
        and #T.c != isnull (Genome.cdss,0);
update Genome
  set prots = #T.c
  from      #T
       join Genome on Genome.id = #T.genome
  where     #T.[type] = 'PRT'
        and #T.c != isnull (Genome.prots,0);

Populate columns Genome.{long_len,ambig,gc,repeat8}:

$TT/trav genome.list "cat genome/%h/%f/%f.stat | sed 's|^.*/||1' | sed 's/\.dna//1' \
    | awk '{printf %Qupdate Genome set long_len = %Pd, ambig = %Pf, gc = %Pf, \
         repeat8 = %Pf where id = %Pd\ngo\n%Q, %D2, %D3, %D4, %D5, %D1};'" > stat.sql
sqsh-ms  -S $SERVER  -D $DATABASE  -i stat.sql

Create an incremental distance tree directory inc/

$TT/phylogeny/distTree_inc_init_stnd.sh inc $TT/phylogeny/inc/genome/bacteria \
   $SERVER $DATABASE $BULK_LOCAL $BULK_REMOTE

If the Univa Grid Engine is not available then the example sequences can be processed by disabling the grid engine by this command:

echo "10000" > inc/grid_min

Build an initial tree

Create a list of objects start.list for the initial tree:

sort -R genome.list | head -100 | sort > start.list

Build an initial tree for 100 sequences:

$TT/phylogeny/distTree_inc_complete.sh inc start.list

Build a tree incrementally from the initial tree

Create a list of objects new.list to add to the tree incrementally:

$TT/setMinus genome.list start.list > new.list

Transfer the objects in new.list to inc/new/:

$TT/trav new.list "touch inc/new/%f"

Run on a computer with large memory:

$TT/phylogeny/distTree_inc.sh inc 1

Update the table FreqHash

This is needed to make inc/request_closest.sh faster:

select GenomeHash.[type], GenomeHash.[hash]   -- PAR
  into #T
  from      GenomeHash (nolock)
       join Genome (nolock) on Genome.id = GenomeHash.genome
  where Genome.tax_root = 2  
  group by GenomeHash.[hash], GenomeHash.[type]
  having count(*) > 5000  -- PAR 
  option (hash join);
truncate table FreqHash;
insert into FreqHash (tax_root, [type], [hash])
  select                     2, [type], [hash]
    from #T;