Previously known as HOMEBREW, developed by Rudiger Brauning.
Existing SNP calling pipelines often come with built-in filtering processes, which can introduce systematic bias. Each method has its own algorithm to identify and define SNPs, it can lead to moderate to large variations in final outputs. A simple, intuitive and consistent bioinformatics workflow is thus needed for developing new analytical methods.
Here we present snpGBS, a simple three-step approach to identify SNPs from GBS data:
We use cutadapt to demultiplex the raw GBS data (i.e. .fastq or .fastq.gz file). More information about cutadapt: https://cutadapt.readthedocs.io/en/stable/guide.html#demultiplexing
## 1.1 trimming common adapter
cutadapt -j 8 -a common_adapter=AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG -o example.trimmed.fastq example.fastq >01.trimmed.stdout 2>01.trimmed.stderr
## 1.2 demultiplexing
cutadapt -j 8 -e 0 --no-indels -g file:barcodes.fasta -o "demultiplexed_{name}.fastq.gz" example.trimmed.fastq >01.demultiplexed.stdout 2>01.demultiplexed.stderrWe use bowtie2 to align and map GBS reads. More information about bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml
## 2.1 indexing genome
bowtie2-build ref.fa example >index.stdout 2>index.stderr
## 2.2 alignment
for i in ./demultiplexed_barcode*.fastq.gz
 do
    echo $i;
    bowtie2 --very-fast-local -x example -U $i -S ./${i##*/}.sam 2>./${i##*/}.bowtie2.stdout;
doneWe use bcftools-mpileup to identify SNPs. More information about bcftools-mpileup: http://www.htslib.org/doc/bcftools.html#mpileup
## convert SAM to BAM
## 3.1 convert SAM to BAM
for i in *.sam;
  do
    echo $i;
    samtools view -bS $i > "${i%.sam}.bam";
done
## 3.2 sort bam
for i in *.bam;
  do
    echo $i;
    samtools sort $i -o "${i%.bam}.sorted.bam";
done
## 3.3 create bamlist
for i in *.sorted.bam;
  do
    echo $i;
done > bamlist;
## 3.4 calling SNPs
bcftools mpileup -I -Ou -f ref.fa -b bamlist -a AD | bcftools call -cv - | bcftools view -M2 - >example.vcf- 
Raw GBS data (e.g.
example.fastq) - 
Barcode sequences (e.g.
barcodes.fasta)createBarcodeFASTA.shis provided to convertbarcodes.txttobarcodes.fastaas below#!/bin/bash filename='barcodes.txt' n=0 while read line; do echo ">barcode$n" echo "^$line" n=$((n+1)) done < $filename
 
- Reference genome (e.g.
ref.fa) 
- Nil
 
To help users with testing snpGBS, we put together an example with the following files
- 
Raw GBS data:
example.fastqgenerated by SimGBS (can be downloaded from https://figshare.com/articles/dataset/snpGBS/13591274) - 
Barcode sequences:
barcodes.txtandbarcodes.fastaare stored in https://github.com/AgResearch/snpGBS/tree/main/example/datasets - 
Reference Genome:
ref.facan also be found in https://figshare.com/articles/dataset/snpGBS/13591274 
Here's a list of expected outputs
- 
Demultiplexed FASTQ files:
demultiplexed-barcode*.fastq.gzin https://github.com/AgResearch/snpGBS/tree/main/example/demultiplexed_fastq - 
Mapped, sorted and indexed BAM files:
demultiplexed-barcode*.fastq.gz.sorted.bam(.bai)in https://github.com/AgResearch/snpGBS/tree/main/example/mapping - 
VCF file:
example.vcfstores all the SNP information, it can be downloaded from https://figshare.com/articles/dataset/snpGBS/13591274 
If you use snpGBS, please cite
- Kang, J., Dodds, K., Byrne, S., Faville, M., Black, M., Hess, A., Hess, M., McCulloch, A., Jacobs, J., Milbourne, D., Wilcox, P., & Brauning, R. (n.d.). snpGBS: A Simple and Flexible Bioinformatics Workflow to Identify SNPs from Genotyping-by-Sequencing Data. In Exploiting genetic diversity of forages to fulfil their economic and environmental roles: Proceedings of the 2021 Meeting of the Fodder Crops and Amenity Grasses Section of EUCARPIA. Exploiting genetic diversity of forages to fulfil their economic and environmental roles. Univerzita Palackého v Olomouci. https://doi.org/10.5507/vup.21.24459677.16
 
- 
KGD: R code for the analysis of genotyping-by-sequencing (GBS) data, primarily to construct a genomic relationship matrix for the genotyped individuals.
 - 
GUSLD: An R package for estimating linkage disequilibrium using low and/or high coverage sequencing data without requiring filtering with respect to read depth.
 - 
SMAP a software package that analyzes read mapping distributions and performs haplotype calling to create multi-allelic molecular markers.