Fasta to SampleData #674
-
I have a Fasta file with ~1000 phased sequences turned into biallelic characters (0's and 1's), from SNPs and indels. What's the best way to turn this into a SampleData object for use with the tsinfer CLI? Note: I'm not a Python programmer. Example:
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The simplest thing to do would be to convert to a VCF in some way, and then follow the standard methods outlined in the tutorial. |
Beta Was this translation helpful? Give feedback.
-
Alternatively, if you just want to read it in as a massive matrix: import numpy as np
import tsinfer
# slurp it all into a big matrix: assumes all data for a seq is on one line
# otherwise use a text editor to delete all newlines except those
# followed by ">"
binary_data = np.genfromtxt(
"tmp.fasta",
comments=">", # ignore any lines starting with ">"
delimiter=1, # one char per value.
dtype=int,
)
with tsinfer.SampleData(
path="my_data.samples",
sequence_length=binary_data.shape[1]
) as sd:
for pos, column in enumerate(binary_data.T): # iterate over transposed matrix
sd.add_site(pos, column) |
Beta Was this translation helpful? Give feedback.
Alternatively, if you just want to read it in as a massive matrix: