Pannagram is a package for constructing pan-genome alignments, analyzing structural variants, and translating annotations between genomes. Additionally, Pannagram contains useful functions for visualization. The manual is available in the examples folder.
Make sure you have Conda or Mamba installed. To create and activate the package environment run:
conda env create -f pannagram.yaml
conda activate pannagram
# OR
mamba env create -f pannagram.yaml
mamba activate pannagram
The environment downloads required R interpreter version and all needed libraries, including BLAST, MAFFT and others.
should also run:
brew install coreutils
to make sure all the needed shell commands are installed.
Can try running code from this repo under WSL (as Bash and /
path separator are used extensively in the code). Nevertheless it was never tested in such environment, so good luck.
Pangenome alignment can be built in two modes:
- reference-free:
./pannagram.sh -path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8
- reference-based:
./pannagram.sh -ref '<reference genome name>' \
-path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8
- quick look: If there is no information on genomes and corresponding chromosomes available, one can run preparation steps:
./pannagram.sh -ref '<reference genome name>' \
-path_in '<genome files directory path>' \
-path_out '<output files path>' \
-cores 8 -pre
An extended description of the parameters for all three scripts are avaliable by executing scripts with the flag -help
.
Synteny blocks, SNPs, and sequence consensus (for the IGV browser) can be extracted from the alignment:
./analys.sh -path_msa '<output path with consensus>' \
-path_chr '<path with chromosomes>' \
-blocks \ # Find Synteny block inforamtion for visualisation
-seq \ # Create consensus sequence of the pangenome
-snp # SNP calling
When the pangenome linear alignment is built, SVs can be called using the following script:
./analys.sh -path_msa '<output path with consensus>' \
-sv_call \ # Create output .gff and .fasta files with SVs
-sv_sim te.fasta \ # Compare with a set of sequences (e.g., TEs)
-sv_graph # Construct the graph of SVs
Pannagram contains a number of useful methods for visualization in R.
All genomes together:
A dotplot for a pair of genomes:
Every node is an SV:
Every node is a unique sequence, size - the amount of this sequence in SVs:
- In the ACTG-mode:
# --- Quick start code ---
source('utils/utils.R') # Functions to work with sequences
source('visualisation/msaplot.R') # Visualisation
aln.seq = readFastaMy('aln.fasta') # Vector of strings
aln.mx = aln2mx(aln.seq) # Transfom into the matrix
msaplot(aln.mx) # ggplot object
- In the Polymorphism mode:
# --- Quick start code ---
msadiff(aln.mx) # ggplot object
Simultaneously on forward (dark color) and reverse complement (pink color) strands:
# --- Quick start code ---
source('utils/utils.R') # Functions to work with sequences
source('visualisation/dotplot.R') # Visualisation
s = sample(c("A","C","G","T"), 100, replace = T)
dotplot(s, s, 15, 9) # ggplot object
# --- Quick start code ---
source('utils/utils.R') # Functions to work with sequences
source('visualisation/orfplot.R') # Visualisation
str = nt2seq(s)
orfs = orfFinder(str)
orfplot(orfs$pos) # ggplot object
The first approach involves searching against entire genomes or individual chromosomes. The quickstart toy-example is:
./simsearch.sh -in_seq genes.fasta -on_genome genome.fasta -out out.txt
The result is a GFF file with hits matching the similarity threshold.
The second approach, in contrast, is designed to search for similarities against another set of sequences. The quickstart toy-example is:
./simsearch.sh -in_seq genes.fasta -on_seq genome.fasta -out out.txt
The result is an RDS (R Data Structure) table. This table shows the coverage of one sequence over another and includes a flag column that indicates whether the sequences meet the similarity threshold. Additionally, the second script takes into account the coverage strand, determining not just if a sequence is covered, but also if it's covered in a specific orientation.
Development:
- Anna Igolkina - Lead Developer and Project Initiator
- Alexander Bezlepsky - Assistant
Testing:
- Anna Igolkina: Lead Tester
- Anna Glushkevich: Testing the alignment on A. lyrata genomes
- Elizaveta Grigoreva: Testing the alignment on A. thaliana and A. lyrata genomes
- Jilong Ma: Testing the SV-graph on spider genomes
- Alexander Bezlepsky: Testing the Pannagram's functionality on Rhizobial genomes
- Gregoire Bohl-Viallefond: Testing the annotation converter on A. thaliana alignment
Resources:
- Logo was generated with the help of DALL-E
- Parallel Processing Tool: O. Tange (2018): GNU Parallel 2018, ISBN 9781387509881, DOI https://doi.org/10.5281/zenodo.1146014.