Skip to content

Commit

Permalink
A little nicer
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewdalpino committed Oct 15, 2024
1 parent 8785c20 commit d82ee9f
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 9 deletions.
20 changes: 13 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

A datastructure and tokenization library for counting short DNA sequences for use in Bioinformatics. DNA Hash stores k-mer sequence counts by their up2bit encoding - a two-way hash that works with variable-length sequences. As such, DNA Hash uses considerably less memory than a lookup table that stores sequences in plaintext. In addition, DNA Hash's novel autoscaling Bloom filter eliminates the need to explicitly store counts for sequences that have only been seen once.

- **Variable** sequence lengths
- **Ultra-low** memory footprint
- **Embarrassingly** parallelizable
- **Open-source** and free to use commercially
Expand All @@ -17,13 +18,6 @@ Install DNA Hash using a Python package manager, example pip:
pip install dnahash
```

## Parameters
| # | Name | Default | Type | Description |
|---|---|---|---|---|
| 1 | max_false_positive_rate | 0.01 | float | The upper bound on the false positivity rate. |
| 2 | num_hashes | 4 | int | The number of hash functions used, i.e. the number of slices per layer. |
| 3 | layer_size | 32000000 | int | The size of each layer of the Bloom filter in bits. |

## Example Usage

```python
Expand Down Expand Up @@ -57,6 +51,18 @@ plt.ylabel('Frequency')
plt.show()
```

```
TAACAA: 70
TTAAAA: 68
ACAACA: 65
...
CATTAA: 49
Total sequences: 29876
# of unique sequences: 2013
# of singletons: 100
```

## References
- [1] https://github.com/JohnLonginotto/ACGTrie/blob/master/docs/UP2BIT.md.
- [2] P. Melsted et al. (2011). Efficient counting of k-mers in DNA sequences using a bloom filter.
Expand Down
Binary file added docs/images/sars-cov-2-histogram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 2 additions & 2 deletions src/dna_hash/dna_hash.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,9 @@ def _decode(cls, hash: int) -> str:
sequence = ''

for i in range(0, int(math.log(hash, 2)), 2):
base = (hash >> i) & 3
encoding = (hash >> i) & 3

sequence += cls.BASE_DECODE_MAP[base]
sequence += cls.BASE_DECODE_MAP[encoding]

return sequence

Expand Down

0 comments on commit d82ee9f

Please sign in to comment.