Skip to content

Commit

Permalink
Merge pull request #11 from krassowski/testing-and-docs
Browse files Browse the repository at this point in the history
Document `parse_dbsnp_variants`, add py3.11
  • Loading branch information
krassowski authored Jan 7, 2023
2 parents 2bfdcbd + 2a2ffec commit 19376dd
Show file tree
Hide file tree
Showing 10 changed files with 552 additions and 29 deletions.
12 changes: 9 additions & 3 deletions .github/workflows/job.test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ jobs:
runs-on: ${{ matrix.os }}
strategy:
matrix:
python: ["3.7", "3.8", "3.9", "3.10"]
python: ["3.7", "3.8", "3.9", "3.10", "3.11"]
os: [ubuntu-latest, macos-latest, windows-latest]
steps:
- uses: actions/checkout@v3
Expand All @@ -24,6 +24,12 @@ jobs:
run: |
python -m pip install --upgrade pip
pip install -r requirements/dev.txt
- name: Run tests
- name: Run core tests
run: |
python -m pytest
python -m pytest -m "not optional"
- name: Install optional dependencies
run: |
pip install -r requirements/optional.txt
- name: Run optional tests
run: |
python -m pytest -m optional
63 changes: 53 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Easy-entrez:
- does not use the stateful API as it is [error-prone](https://gitlab.com/ncbipy/entrezpy/-/issues/7) as seen on example of the alternative *entrezpy*.


**Status:** beta (pending tutorial write-up and documentation improvements before official release).
### Examples

```python
from easy_entrez import EntrezAPI
Expand All @@ -38,7 +38,7 @@ See more in the [Demo notebook](./Demo.ipynb) and [documentation](https://easy-e

For a real-world example (i.e. used for [this publication](https://www.frontiersin.org/articles/10.3389/fgene.2020.610798/full)) see notebooks in [multi-omics-state-of-the-field](https://github.com/krassowski/multi-omics-state-of-the-field) repository.

#### Example: fetching genes for a variant from dbSNP
#### Fetching genes for a variant from dbSNP

Fetch the SNP record for `rs6311`:

Expand Down Expand Up @@ -84,7 +84,7 @@ print(gene_names)

> `{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}`
#### Example: obtaining the chromosomal position from SNP rsID number
#### Obtaining the chromosomal position from SNP rsID number

```python
from pandas import DataFrame
Expand All @@ -111,7 +111,47 @@ variant_positions
> | 1 | rs662138 | 6 | 160143444 |

#### Example: obtaining the SNP rs ID number from chromosomal position
#### Converting full variation/mutation data to tabular format

Parsing utilities can quickly extract the data to a `VariantSet` object
holding pandas `DataFrame`s with coordinates and alternative alleles frequencies:

```python
from easy_entrez.parsing import parse_dbsnp_variants

variants = parse_dbsnp_variants(result)
variants
```

> `<VariantSet with 2 variants>`
To get the coordinates:

```python
variants.coordinates
```

> | rs_id | ref | alts | chrom | pos | chrom_prev | pos_prev | consequence |
> |:---------|:------|:-------|--------:|----------:|-------------:|-----------:|:-----------------------------------------------------------------------------|
>| rs6311 | C | A,T | 13 | 46897343 | 13 | 47471478 | upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant |
>| rs662138 | C | G | 6 | 160143444 | 6 | 160564476 | intron_variant |
For frequencies:

```python
variants.alt_frequencies.head(5) # using head to only display first 5 for brevity
```

> | | rs_id | allele | source_frequency | total_count | study | count |
> |---:|:--------|:---------|-------------------:|--------------:|:------------|----------:|
> | 0 | rs6311 | T | 0.44349 | 2221 | 1000Genomes | 984.991 |
> | 1 | rs6311 | T | 0.411261 | 1585 | ALSPAC | 651.849 |
> | 2 | rs6311 | T | 0.331696 | 1486 | Estonian | 492.9 |
> | 3 | rs6311 | T | 0.35 | 14 | GENOME_DK | 4.9 |
> | 4 | rs6311 | T | 0.402529 | 56309 | GnomAD | 22666 |

#### Obtaining the SNP rs ID number from chromosomal position

You can use the query string directly:

Expand Down Expand Up @@ -143,7 +183,7 @@ The base position should use the latest genome assembly (GRCh38 at the time of w
you can use the position in previous assembly coordinates by replacing `POSITION` with `POSITION_GRCH37`.
For more information of the arguments accepted by the SNP database see the [entrez help page](https://www.ncbi.nlm.nih.gov/snp/docs/entrez_help/) on NCBI website.

### Example: find PubMed ID from DOI
#### Find PubMed ID from DOI

When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:

Expand Down Expand Up @@ -183,14 +223,17 @@ If you wish to enable (optional, tqdm-based) progress bars use:
pip install easy-entrez[with_progress_bars]
```

### Alternatives:
If you wish to enable (optional, pandas-based) parsing utilities use:

```bash
pip install easy-entrez[with_parsing_utils]
```

### Alternatives

You might want to try:

- [biopython.Entrez](https://biopython.org/docs/1.74/api/Bio.Entrez.html) - biopython is a heavy dependency, but probably good choice if you already use it
- [pubmedpy](https://github.com/dhimmel/pubmedpy) - provides interesting utilities for parsing the responses
- [entrez](https://github.com/jordibc/entrez) - appears to have a comparable scope but quite different API

I have tried and do not recommend:

- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - in addition to the history problems, watch out for [documentation issues](https://gitlab.com/ncbipy/entrezpy/-/issues/8) and basically [no reaction](https://gitlab.com/ncbipy/entrezpy/-/merge_requests/1) to pull requests.
- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - this one did not work well for me (hence this package), but may have improved since
3 changes: 2 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
# -- Project information -----------------------------------------------------

project = 'Easy-entrez'
copyright = '2020, Michał Krassowski'
copyright = '2023, Michał Krassowski'
author = 'Michał Krassowski'

# -- General configuration ---------------------------------------------------
Expand All @@ -32,6 +32,7 @@
'sphinx.ext.napoleon',
'sphinx_autodoc_typehints',
'sphinx_copybutton',
'myst_parser',
# 'sphinx.ext.linkcode'
# todo something like https://github.com/numpy/numpy/blob/master/doc/source/conf.py?
]
Expand Down
6 changes: 3 additions & 3 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,19 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Easy-entrez documentation
=======================================
.. include:: ../README.md
:parser: myst_parser.sphinx_

.. toctree::
:maxdepth: 2
:caption: Contents:

usage
queries
parsing
types



Indices and tables
==================

Expand Down
8 changes: 8 additions & 0 deletions docs/parsing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
**********************
Parsing
**********************

.. currentmodule:: easy_entrez.parsing

.. automodule:: easy_entrez.parsing
:undoc-members:
44 changes: 38 additions & 6 deletions easy_entrez/parsing.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,20 +17,40 @@
namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}


def xml_to_string(element):
def xml_to_string(element, indent=' ' * 4):
"""Convert provided XML element to pretty indented string.
Parameters:
element: the XML element to convert (`data` attribute of entrez result)
indent: the indentation to use, 4 spaces by default
"""
return (
minidom.parseString(ElementTree.tostring(element))
.toprettyxml(indent=' ' * 4)
.toprettyxml(indent=indent)
)


@dataclass
class VariantSet:
"""Result of parsing with `parse_dbsnp_variants()`."""
#: Coordinates of the SNPs in the genome and consequence (e.g. intro_variant).
coordinates: DataFrame
#: Frequencies of the alternative alleles.
alt_frequencies: DataFrame
#: Preferred identifiers map (old → new); old != new for merged variants.
preferred_ids: dict

def __repr__(self):
return f'<VariantSet with {len(self.coordinates)} variants>'


def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) -> VariantSet:
"""Parse coordinates, frequencies and preferred IDs of dbSNP variants.
Parameters:
snps_result: result of fetch query in XML format, usually to `'snp'` database
verbose: whether to print out full problematic XML if SPDI cannot be parsed
"""
if DataFrame is None:
raise ValueError('pandas is required for parser_dbsnp_variants')
if not is_xml_response(snps_result):
Expand All @@ -41,13 +61,14 @@ def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) ->

results = []
alt_frequencies = []
preferred_id = {}

for i, snp in enumerate(snps):
error = snp.find('.//ns0:error', namespaces)
if error is not None:
warn(f'Failed to retrieve {snps_result.query.ids[i]} due to error: {error.text}')
continue
rs_id = snp.find('.//ns0:SNP_ID', namespaces).text
rs_id = snp.attrib['uid']
spdi_text = snp.find('.//ns0:SPDI', namespaces).text
if not spdi_text:
warn(f'Failed to retrieve {snps_result.query.ids[i]}: SPDI not found')
Expand All @@ -59,6 +80,13 @@ def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) ->
chrom_prev, pos_prev = snp.find('.//ns0:CHRPOS_PREV_ASSM', namespaces).text.split(':')
sig_class = snp.find('.//ns0:FXN_CLASS', namespaces).text

merged_into = snp.find('.//ns0:SNP_ID', namespaces).text
if rs_id != merged_into:
was_merged = snp.find('.//ns0:MERGED_SORT', namespaces).text
assert was_merged == '1'

preferred_id[f'rs{rs_id}'] = f'rs{merged_into}'

expected_ref = {
s.split(':')[-2]
for s in spdi
Expand Down Expand Up @@ -101,13 +129,17 @@ def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) ->
'ref': list(expected_ref)[0],
'alts': ','.join(expected_alt),
'chrom': chrom,
'pos': float(pos),
'pos': int(pos),
'chrom_prev': chrom_prev,
'pos_prev': float(pos_prev),
'pos_prev': int(pos_prev),
'consequence': sig_class
})

return VariantSet(
coordinates=DataFrame(results).set_index('rs_id'),
alt_frequencies=DataFrame(alt_frequencies)
alt_frequencies=DataFrame(alt_frequencies),
preferred_ids=preferred_id
)


__all__ = ['VariantSet', 'parse_dbsnp_variants', 'xml_to_string', 'namespaces']
3 changes: 2 additions & 1 deletion requirements/docs.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
-r ./minimal.txt
sphinx==3.2.1
sphinx<6.0
pydata-sphinx-theme
sphinx-autodoc-typehints
sphinx-copybutton
myst-parser
10 changes: 6 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ def get_long_description(file_name):
package_data={'easy_entrez': ['data/*.tsv', 'py.typed']},
# required for mypy to work
zip_safe=False,
version='0.3.3',
version='0.3.4',
license='MIT',
description='Python REST API for Entrez E-Utilities: stateless, easy to use, reliable.',
long_description=get_long_description('README.md'),
Expand All @@ -24,7 +24,7 @@ def get_long_description(file_name):
url='https://github.com/krassowski/easy-entrez',
keywords=['entrez', 'pubmed', 'e-utilities', 'ncbi', 'rest', 'api', 'dbsnp', 'literature', 'mining'],
classifiers=[
'Development Status :: 4 - Beta',
'Development Status :: 5 - Production/Stable',
'License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)',
'Operating System :: Microsoft :: Windows',
'Operating System :: POSIX :: Linux',
Expand All @@ -39,11 +39,13 @@ def get_long_description(file_name):
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: 3.9'
'Programming Language :: Python :: 3.9',
'Programming Language :: Python :: 3.10',
'Programming Language :: Python :: 3.11'
],
install_requires=['requests', 'typing_extensions', 'dataclasses>="0.7";python_version<"3.7"'],
extras_require={
'with_progress_bars': ['tqdm'],
'parsing': ['tqdm']
'with_parsing_utils': ['pandas']
}
)
1 change: 0 additions & 1 deletion tests/test_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,3 @@ def test_fetch():

with raises(ValueError, match='Received str but a list-like container of identifiers was expected'):
entrez_api.fetch('4', max_results=1, database='snp')

Loading

0 comments on commit 19376dd

Please sign in to comment.