-
Notifications
You must be signed in to change notification settings - Fork 8
Bacterial genomes
Suppose there is a list of bacterial GenBank assembly ids genome.list
.
Create a directory for each of them grouped by file2hash
i a new directory genome/
:
$TT/trav genome.list "mkdir -p genome/%h/%f"
Put the FASTA file of each genome with contigs longer than 1000 bp into genome/
.
The result for 120 assemblies should look like this:
$ ls -d genome/*/* | wc -l
120
$ ls -d genome/*/* | head -3
genome/106/2568628
genome/11/5064078
genome/110/5687818
$ ls genome/106/2568628/
2568628.dna
Compute nucleotide statistics of genomes:
$TT/trav genome.list "$TT/genetics/dna2stat genome/%h/%f/%f.dna > genome/%h/%f/%f.stat"
Find CDSs in the genomes with Prodigal, create files <asm>.cds
:
$TT/trav genome.list \
"prodigal -o /dev/null -i genome/%h/%f/%f.dna -d genome/%h/%f/%f.cds"
Create files with CDS hashes <asm>.hash-CDS
:
$TT/trav genome.list "$TT/genetics/fasta2hash genome/%h/%f/%f.cds \
genome/%h/%f/%f.hash-CDS -cds -gene_finder prodigal -min_prot_len 150"
Create files with protein hashes <asm>.hash-PRT
:
$TT/trav genome.list "$TT/genetics/fasta2hash genome/%h/%f/%f.cds \
genome/%h/%f/%f.hash-PRT -cds -gene_finder prodigal -min_prot_len 150 \
-translate"
Create protein FASTA files <asm>.prot
:
$TT/trav genome.list "$TT/genetics/fasta2hash genome/%h/%f/%f.cds \
genome/%h/%f/%f.hash-PRT1 -cds -gene_finder prodigal -translate \
-min_prot_len 0 -out_prot genome/%h/%f/%f.prot \
&& rm genome/%h/%f/%f.hash-PRT1"
Find universal proteins using the BUSCO HMMs bacteria_odb10.LIB
and create FASTA files <asm>.prot-univ
with segments of universal proteins matching HMMs:
$TT/trav genome.list \
"$TT/genetics/prots2hmm_univ.sh genome/%h/%f/%f bacteria_odb10.LIB 1 log"
Create files with Pfam HMM hashes <asm>.hash-HMM
:
$TT/trav genome.list "$TT/genetics/prots2hmm_hash.sh genome/%h/%f/%f.prot Pfam-A.hmm \
0 genome/%h/%f/%f.HMM genome/%h/%f/%f.hash-HMM log"
Insert the list of assemblies into the table Genome
:
$TT/database/bulk.sh $SERVER $BULK_LOCAL $BULK_REMOTE genome.list $DATABASE..List
sqsh-ms -S $SERVER -D $DATABASE -C \
"insert into Genome (id, tax_root, annot_tried) select id, 2, 1 from List"
Populate column Genome.total_len
:
$TT/trav genome.list \
"echo -e %Qupdate Genome set total_len = %G$TT/genetics/fasta2len \
genome/%h/%f/%f.dna | cut -f 2 -d ' ' | $TT/dm/count | grep -w ^sum \
| cut -f 2%G where id = %f\ngo%Q" > total_len.sql
sqsh-ms -S $SERVER -D $DATABASE -i total_len.sql
Populate table GenomeHash
:
$TT/trav genome.list "$TT/database/hash2table.sh %f CDS" > GenomeHash
$TT/trav genome.list "$TT/database/hash2table.sh %f PRT" >> GenomeHash
$TT/trav genome.list "$TT/database/hash2table.sh %f HMM" >> GenomeHash
cat GenomeHash | tr '\t' '|' > GenomeHash.bulk
$TT/database/bulk.sh $SERVER $BULK_LOCAL $BULK_REMOTE \
GenomeHash.bulk $DATABASE..GenomeHash
Populate columns Genome.cdss
and Genome.prots
:
select genome, [type], count(*) c
into #T
from GenomeHash (nolock)
group by genome, [type];
update Genome
set cdss = #T.c
from #T
join Genome on Genome.id = #T.genome
where #T.[type] = 'CDS'
and #T.c != isnull (Genome.cdss,0);
update Genome
set prots = #T.c
from #T
join Genome on Genome.id = #T.genome
where #T.[type] = 'PRT'
and #T.c != isnull (Genome.prots,0);
Populate columns Genome.{long_len,ambig,gc,repeat8}
:
$TT/trav genome.list "cat genome/%h/%f/%f.stat | sed 's|^.*/||1' | sed 's/\.dna//1' \
| awk '{printf %Qupdate Genome set long_len = %Pd, ambig = %Pf, gc = %Pf, \
repeat8 = %Pf where id = %Pd\ngo\n%Q, %D2, %D3, %D4, %D5, %D1};'" > stat.sql
sqsh-ms -S $SERVER -D $DATABASE -i stat.sql
$TT/phylogeny/distTree_inc_init_stnd.sh inc $TT/phylogeny/inc/genome/bacteria \
$SERVER $DATABASE $BULK_LOCAL $BULK_REMOTE
If the Univa Grid Engine is not available then the example sequences can be processed by disabling the grid engine by this command:
echo "10000" > inc/grid_min
Create a list of objects start.list
for the initial tree:
sort -R genome.list | head -100 | sort > start.list
Build an initial tree for 100 sequences:
$TT/phylogeny/distTree_inc_complete.sh inc start.list
Create a list of objects new.list
to add to the tree incrementally:
$TT/setMinus genome.list start.list > new.list
Transfer the objects in new.list
to inc/new/
:
$TT/trav new.list "touch inc/new/%f"
Run on a computer with large memory:
$TT/phylogeny/distTree_inc.sh inc 1
This is needed to make inc/request_closest.sh
faster:
select GenomeHash.[type], GenomeHash.[hash] -- PAR
into #T
from GenomeHash (nolock)
join Genome (nolock) on Genome.id = GenomeHash.genome
where Genome.tax_root = 2
group by GenomeHash.[hash], GenomeHash.[type]
having count(*) > 5000 -- PAR
option (hash join);
truncate table FreqHash;
insert into FreqHash (tax_root, [type], [hash])
select 2, [type], [hash]
from #T;