SNPbinner is a Python 2.7 package and command line utility for the generation of genotype binmaps based on SNP genotype data across populations of recombinant inbred lines (RILs). Analysis using SNPbinner is performed in three parts: crosspoints
, bins
, and visualize
.
SNPbinner can be cited as:
Gonda, I., H. Ashrafi, D.A. Lyon, S.R. Strickler, A.M. Hulse-Kemp, Q. Ma, H. Sun, K. Stoffel, A.F. Powell, S. Futrell, T.W. Thannhauser, Z. Fei, A.E. Van Deynze, L.A. Mueller, J.J. Giovannoni, and M.R. Foolad. 2019. Sequencing-based bin map construction of a tomato mapping population, facilitating high-resolution quantitative trait loci detection. Plant Genome 12:180010. doi:10.3835/plantgenome2018.02.0010
Installation and Usage
Commands
crosspoints
bins
visualize
SNPbinner requires Python 2.7. Python 3 is currently not supported.
The only non‑standard dependency of SNPbinner is Pillow, a PIL fork.
To install the SNPbinner utility, download or clone the repository and run
$ pip install REPO-PATH
Once installed, one can execute any of the commands below like so
$ snpbinner COMMAND [ARGS...]
Alternatively, without installing the package, one can execute any of the commands below using
$ python REPO-PATH/snpbinner COMMAND [ARGS...]
Description | Usage | Input Format | Output Format |
---|
crosspoints
uses genotyped SNP data to identify likely crossover points. First, the script uses a pair of hidden Markov models (HMM) to predict genotype regions along the chromosome both with (3‑state) and without (2‑state) heterozygous regions. Then, the script identifies groupings of regions which are too short (based on a minimum distance between crosspoints set by the user). After that it follows the rules below to find crosspoints and merge away regions which are too short. The script then outputs the crosspoints for each RIL and the genotyped regions between them to a CSV file.
- If a group of alternating too‑short regions is long enough to be its own acceptably‑long genotype region, it will be treated as such and assigned the most likely genotype using the 3‑state HMM.
- If a group of alternating too‑short regions is surrounded by regions of the same genotype, all regions within that group are assigned the surrounding genotype.
- If a too‑short region has been genotyped as heterozygous by the 3‑state HMM, that section is replaced by the regions identified by the 2‑sate HMM.
- If the first or last too‑short region is neighboring an acceptably‑long heterozygous region, the whole grouping will be assigned the heterozygous genotype.
- If a group of alternating too-short regions is bounded by two homozygous regions, the leftmost or rightmost too-short region (whichever is shortest) will be merged with it's bounding homozygous region. This repeats until the group is empty, the contents having been merged into the two bounding regions.
Running the crosspoints
command requires an input path, output path, and a minimum size argument. There are also three optional arguments which can be found in the table below.
$ snpbinner crosspoints --input PATH --output PATH (--min-length INT | --min-ratio FLOAT) [optional args]
Type | Description | ||
---|---|---|---|
‑i |
‑‑input |
PATH |
Path to a SNP TSV, multiple paths, or a glob (e.g. myGenome.chr*.tsv). |
‑o |
‑‑output |
PATH |
Path for the output CSV when there is a single input, or for a folder when there are multiple. |
‑m |
‑‑min‑length |
INT |
Minimum distance between crosspoints in basepairs. Cannot be used with min‑ratio . |
‑r |
‑‑min‑ratio |
FLOAT |
Minimum distance between crosspoints as a ratio. (0.01 would be 1% of the chromosome.) Cannot be used with min‑length . |
Input should be formatted as a tab‑separated value (TSV) file with the following columns. | |
---|---|
0 | The SNP marker ID. |
1 | The position of the marker in base pairs from the start of the chromosome. |
2+ | RIL ID (header) and the called genotype of the RIL at each position. |
Output is formatted as a comma‑separated value (CSV) file with the following columns. | |
---|---|
0 | The RIL ID |
Odd | Location of a crosspoint. (Empty after the chromosome ends.) |
Even | Genotype in between the surrounding crosspoints. (Empty after the chromosome ends.) |
Description | Usage | Input Format | Output Format |
---|
bins
takes the crosspoints predicted for each RIL and combines similar crosspoint locations to create a combined map of all crossover points across the RILs at a specified resolution. It then projects the genotype regions of the RIL back onto the map and outputs the average genotype of each RIL in each bin on the map. The procedure is as follows. It should be noted that, to insure the changes are obvious, the illustrations below are showing a map with very low resolution (bin size) and therefore there is significant loss of information. A smaller bin size would create a more accurate map.
- The script begins by combining the crosspoints from all lines, including duplicates occurring at the same location.
- Contiguous series of crosspoints are then grouped together if they are closer to a neighbor than the specified minimum bin size.
- One‑dimensional k‑means optimization is then used to find the best placement for the bin boundaries (steps 2 and 4 below). This is repeated for every possible number of boundaries that can fit in the span of each group. In order to account for the minimum bin‑size constraint, once a possible set of boundaries has been converged upon by the k‑means algorithm, each mean is adjusted to insure it is at least the minimum distance from it's neighbors (steps 3 and 4 below). If this enters a cycle instead of converging on a working solution, the script will accept the adjusted boundaries without the second optimization step. Otherwise, optimization continues until a solution is reached with appropriately spaced boundaries.
This k=3 example finishes due to a cycle (steps 3‑5).
- For each group, the solution with a value of k leading to the least variance from the adjusted means are placed into a list of final boundaries. These boundaries are then used to create bins for the final binmap.
- Each RIL is then projected onto this bin and the results are output as a CSV. Bins are genotyped as whatever genotype represents a plurality of its contents.
Running the bins command requires an input path, output path, and a minimum size argument. Optionally, a binmap ID may also be provided.
$ snpbinner bins --input PATH --output PATH --min-bin-size INT [--binmap-id ID]
Type | Description | ||
---|---|---|---|
‑i |
‑‑input |
PATH |
Path to a crosspoints CSV, multiple paths, or a glob (e.g. myGenome.chr*.crosp.csv). |
‑o |
‑‑output |
PATH |
Path for the output CSV when there is a single input, or for a folder when there are multiple. |
‑l |
‑‑min‑bin‑size |
INT |
Sets the minimum size (in bp) of each bin. |
Type | Description | ||
---|---|---|---|
‑n |
‑‑binmap‑id |
ID |
If a binmap ID is provided, a header row will be added and each column labeled with the given string. |
bins
uses the output from crosspoints
.
For details, see the crosspoints
Output Format.
Output is formatted as a comma‑separated value (CSV) file and has the following rows. | |
---|---|
0 | (Optional) The binmap ID |
1 | The start of each bin (in base pairs). |
2 | The end of each bin (in base pairs). |
3 | The center of each bin (in base pairs). |
4+ | RIL ID in the first cell, then the genotypes of each bin for that RIL. |
Description | Usage | Input Format | Output Format |
---|
visualize
plots the inputs and outputs of bins
and crosspoints
. It can be used to visually check the results of the above commands to help determine the best values for each of the parameters. It can accept three filetypes (SNP input TSV, crosspoint CSV, and bin CSV). It then parses the files and groups the data by RIL, creating an image for each. In each row of the resulting images, regions are colored red, green, or blue, for genotype a, heterozygous, or genotype b, respectively. The binmap is represented in gray with adjacent bins alternating dark and light. The script can accept any combination or number of files for each of the different filetypes.
$ snpbinner visualize --out PATH [--bins PATH]... [--crosspoints PATH]... [--snps PATH]...
Type | Description | ||
---|---|---|---|
‑o |
‑‑out |
PATH |
Folder to which the resulting images should be saved. |
Type | Description | ||
---|---|---|---|
‑b |
‑‑bins |
PATH |
bins output file to be added to the visualization. |
‑c |
‑‑crosspoints |
PATH |
crosspoints output file to be added to the visualization. |
‑s |
‑‑snps |
PATH |
SNP (crosspoints input file) file to be added to the visualization. |