This directory contains the scripts suitable for protein sequencing experiments which will identify proteins from a database data/uniprot-9606.fasta
. Given a decoded readout from a sequencing device (eg. sample-reads-P30419.csv
), main.py
will construct a HMM and scan the database to identify the protein corresponding to the readouts. Further instructions on setting the environment and running the program are given below.
python
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
python dependencies
pandas 1.3.4
numpy 1.22.3
pyhmmer 0.6.3
# hmmsearch :: search profile(s) against a sequence database
# HMMER 3.3.2 (Nov 2020); http://hmmer.org/
# Copyright (C) 2020 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Usage: hmmsearch [options] <hmmfile> <seqdb>
Basic options:
-h : show brief help on version and usage
The input files are the decoded readouts from the sequencing device i.e.. they are the posterior probabilites of the signal after decoding. Therfore, this will be of size read length x 20. Few sample reads are provided as csv. For eg, this reading has length 434, therefore is a 434 x 20 matrix (headers 0, 1, .., 19 shown below for clarity and should not be in the readings).
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 |
1 | 0.80 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
2 | 0.01 | 0.01 | 0.01 | 0.80 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
3 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.80 | 0.01 | 0.01 | 0.01 | 0.01 |
4 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
431 | 0.80 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
432 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.80 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
433 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 | 0.05 |
python3 main.py -r <your_readings_file>
This will print and save the results of the found proteins inside the results
directory as a csv file. For eg:
user@host:~$ python3 main.py -r sample-reads-P30419.csv
Query Hit E-value Score
0 sample-reads-P30419.csv P30419 1.638253e-48 166.687073
1 sample-reads-P30419.csv O60551 2.344075e-19 70.342026
2 sample-reads-P30419.csv Q9UF47 3.997793e+00 6.790119
The sequence with highest score is the likely protein for the readouts.
The current database is ~20K human proteins as a fasta file inside data
directory. If you prefer a different database, please edit the database location inside functions/hmmer.py
file to the location of your database.
If your readouts contains low amount of errors, you can also adjust the transition probabilites inside the functions/hmmer.py
file.
EU Horizon 2020 grant agreement no. 964363
If you use this tool, please cite:
- Bikash Kumar Bhandari, Nick Goldman, A generalized protein identification method for novel and diverse sequencing technologies, NAR Genomics and Bioinformatics, Volume 6, Issue 3, September 2024, lqae126, https://doi.org/10.1093/nargab/lqae126