Merge pull request #11 from krassowski/testing-and-docs

Document `parse_dbsnp_variants`, add py3.11
krassowski · Jan 7, 2023 · 19376dd · 19376dd
2 parents 2bfdcbd + 2a2ffec
commit 19376dd
Show file tree

Hide file tree

Showing 10 changed files with 552 additions and 29 deletions.
diff --git a/.github/workflows/job.test.yml b/.github/workflows/job.test.yml
@@ -12,7 +12,7 @@ jobs:
     runs-on: ${{ matrix.os }}
     strategy:
       matrix:
-        python: ["3.7", "3.8", "3.9", "3.10"]
+        python: ["3.7", "3.8", "3.9", "3.10", "3.11"]
         os: [ubuntu-latest, macos-latest, windows-latest]
     steps:
     - uses: actions/checkout@v3
@@ -24,6 +24,12 @@ jobs:
       run: |
         python -m pip install --upgrade pip
         pip install -r requirements/dev.txt
-    - name: Run tests
+    - name: Run core tests
       run: |
-        python -m pytest
+        python -m pytest -m "not optional"
+    - name: Install optional dependencies
+      run: |
+        pip install -r requirements/optional.txt
+    - name: Run optional tests
+      run: |
+        python -m pytest -m optional
diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@ Easy-entrez:
 - does not use the stateful API as it is [error-prone](https://gitlab.com/ncbipy/entrezpy/-/issues/7) as seen on example of the alternative *entrezpy*.
 
 
-**Status:** beta (pending tutorial write-up and documentation improvements before official release).
+### Examples
 
 ```python
 from easy_entrez import EntrezAPI
@@ -38,7 +38,7 @@ See more in the [Demo notebook](./Demo.ipynb) and [documentation](https://easy-e
 
 For a real-world example (i.e. used for [this publication](https://www.frontiersin.org/articles/10.3389/fgene.2020.610798/full)) see notebooks in [multi-omics-state-of-the-field](https://github.com/krassowski/multi-omics-state-of-the-field) repository.
 
-#### Example: fetching genes for a variant from dbSNP 
+#### Fetching genes for a variant from dbSNP
 
 Fetch the SNP record for `rs6311`:
 
@@ -84,7 +84,7 @@ print(gene_names)
 
 > `{'rs6311': ['HTR2A'], 'rs662138': ['SLC22A1']}`
 
-#### Example: obtaining the chromosomal position from SNP rsID number
+#### Obtaining the chromosomal position from SNP rsID number
 
 ```python
 from pandas import DataFrame
@@ -111,7 +111,47 @@ variant_positions
 > |  1 | rs662138 |            6 |  160143444 |
 
 
-#### Example: obtaining the SNP rs ID number from chromosomal position
+#### Converting full variation/mutation data to tabular format
+
+Parsing utilities can quickly extract the data to a `VariantSet` object
+holding pandas `DataFrame`s with coordinates and alternative alleles frequencies:
+
+```python
+from easy_entrez.parsing import parse_dbsnp_variants
+
+variants = parse_dbsnp_variants(result)
+variants
+```
+
+> `<VariantSet with 2 variants>`
+
+To get the coordinates:
+
+```python
+variants.coordinates
+```
+
+> | rs_id    | ref   | alts   |   chrom |       pos |   chrom_prev |   pos_prev | consequence                                                                  |
+> |:---------|:------|:-------|--------:|----------:|-------------:|-----------:|:-----------------------------------------------------------------------------|
+>| rs6311   | C     | A,T    |      13 |  46897343 |           13 |   47471478 | upstream_transcript_variant,intron_variant,genic_upstream_transcript_variant |
+>| rs662138 | C     | G      |       6 | 160143444 |            6 |  160564476 | intron_variant                                                               |
+
+For frequencies:
+
+```python
+variants.alt_frequencies.head(5)  # using head to only display first 5 for brevity
+```
+
+> |    | rs_id   | allele   |   source_frequency |   total_count | study       |     count |
+> |---:|:--------|:---------|-------------------:|--------------:|:------------|----------:|
+> |  0 | rs6311  | T        |           0.44349  |          2221 | 1000Genomes |   984.991 |
+> |  1 | rs6311  | T        |           0.411261 |          1585 | ALSPAC      |   651.849 |
+> |  2 | rs6311  | T        |           0.331696 |          1486 | Estonian    |   492.9   |
+> |  3 | rs6311  | T        |           0.35     |            14 | GENOME_DK   |     4.9   |
+> |  4 | rs6311  | T        |           0.402529 |         56309 | GnomAD      | 22666     |
+
+
+#### Obtaining the SNP rs ID number from chromosomal position
 
 You can use the query string directly:
 
@@ -143,7 +183,7 @@ The base position should use the latest genome assembly (GRCh38 at the time of w
 you can use the position in previous assembly coordinates by replacing `POSITION` with `POSITION_GRCH37`.
 For more information of the arguments accepted by the SNP database see the [entrez help page](https://www.ncbi.nlm.nih.gov/snp/docs/entrez_help/) on NCBI website.
 
-### Example: find PubMed ID from DOI
+#### Find PubMed ID from DOI
 
 When searching GWAS catalog PMID is needed over DOI. You can covert one to the other using:
 
@@ -183,14 +223,17 @@ If you wish to enable (optional, tqdm-based) progress bars use:
 pip install easy-entrez[with_progress_bars]
 ```
 
-### Alternatives:
+If you wish to enable (optional, pandas-based) parsing utilities use:
+
+```bash
+pip install easy-entrez[with_parsing_utils]
+```
+
+### Alternatives
 
 You might want to try:
 
 - [biopython.Entrez](https://biopython.org/docs/1.74/api/Bio.Entrez.html) - biopython is a heavy dependency, but probably good choice if you already use it
 - [pubmedpy](https://github.com/dhimmel/pubmedpy) - provides interesting utilities for parsing the responses
 - [entrez](https://github.com/jordibc/entrez) - appears to have a comparable scope but quite different API
-
-I have tried and do not recommend:
-
-- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - in addition to the history problems, watch out for [documentation issues](https://gitlab.com/ncbipy/entrezpy/-/issues/8) and basically [no reaction](https://gitlab.com/ncbipy/entrezpy/-/merge_requests/1) to pull requests.
+- [entrezpy](https://gitlab.com/ncbipy/entrezpy) - this one did not work well for me (hence this package), but may have improved since
diff --git a/docs/conf.py b/docs/conf.py
@@ -18,7 +18,7 @@
 # -- Project information -----------------------------------------------------
 
 project = 'Easy-entrez'
-copyright = '2020, Michał Krassowski'
+copyright = '2023, Michał Krassowski'
 author = 'Michał Krassowski'
 
 # -- General configuration ---------------------------------------------------
@@ -32,6 +32,7 @@
     'sphinx.ext.napoleon',
     'sphinx_autodoc_typehints',
     'sphinx_copybutton',
+    'myst_parser',
     # 'sphinx.ext.linkcode'
     # todo something like https://github.com/numpy/numpy/blob/master/doc/source/conf.py?
 ]

diff --git a/docs/index.rst b/docs/index.rst
@@ -3,19 +3,19 @@
    You can adapt this file completely to your liking, but it should at least
    contain the root `toctree` directive.
 
-Easy-entrez documentation
-=======================================
+.. include:: ../README.md
+   :parser: myst_parser.sphinx_
 
 .. toctree::
    :maxdepth: 2
    :caption: Contents:
 
    usage
    queries
+   parsing
    types
 
 
-
 Indices and tables
 ==================
 

diff --git a/docs/parsing.rst b/docs/parsing.rst
@@ -0,0 +1,8 @@
+**********************
+Parsing
+**********************
+
+.. currentmodule:: easy_entrez.parsing
+
+.. automodule:: easy_entrez.parsing
+    :undoc-members:
diff --git a/easy_entrez/parsing.py b/easy_entrez/parsing.py
@@ -17,20 +17,40 @@
 namespaces = {'ns0': 'https://www.ncbi.nlm.nih.gov/SNP/docsum'}
 
 
-def xml_to_string(element):
+def xml_to_string(element, indent=' ' * 4):
+    """Convert provided XML element to pretty indented string.
+
+    Parameters:
+        element: the XML element to convert (`data` attribute of entrez result)
+        indent: the indentation to use, 4 spaces by default
+    """
     return (
         minidom.parseString(ElementTree.tostring(element))
-        .toprettyxml(indent=' ' * 4)
+        .toprettyxml(indent=indent)
     )
 
 
 @dataclass
 class VariantSet:
+    """Result of parsing with `parse_dbsnp_variants()`."""
+    #: Coordinates of the SNPs in the genome and consequence (e.g. intro_variant).
     coordinates: DataFrame
+    #: Frequencies of the alternative alleles.
     alt_frequencies: DataFrame
+    #: Preferred identifiers map (old → new); old != new for merged variants.
+    preferred_ids: dict
+
+    def __repr__(self):
+        return f'<VariantSet with {len(self.coordinates)} variants>'
 
 
 def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) -> VariantSet:
+    """Parse coordinates, frequencies and preferred IDs of dbSNP variants.
+
+    Parameters:
+        snps_result: result of fetch query in XML format, usually to `'snp'` database
+        verbose: whether to print out full problematic XML if SPDI cannot be parsed
+    """
     if DataFrame is None:
         raise ValueError('pandas is required for parser_dbsnp_variants')
     if not is_xml_response(snps_result):
@@ -41,13 +61,14 @@ def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) ->
 
     results = []
     alt_frequencies = []
+    preferred_id = {}
 
     for i, snp in enumerate(snps):
         error = snp.find('.//ns0:error', namespaces)
         if error is not None:
             warn(f'Failed to retrieve {snps_result.query.ids[i]} due to error: {error.text}')
             continue
-        rs_id = snp.find('.//ns0:SNP_ID', namespaces).text
+        rs_id = snp.attrib['uid']
         spdi_text = snp.find('.//ns0:SPDI', namespaces).text
         if not spdi_text:
             warn(f'Failed to retrieve {snps_result.query.ids[i]}: SPDI not found')
@@ -59,6 +80,13 @@ def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) ->
         chrom_prev, pos_prev = snp.find('.//ns0:CHRPOS_PREV_ASSM', namespaces).text.split(':')
         sig_class = snp.find('.//ns0:FXN_CLASS', namespaces).text
 
+        merged_into = snp.find('.//ns0:SNP_ID', namespaces).text
+        if rs_id != merged_into:
+            was_merged = snp.find('.//ns0:MERGED_SORT', namespaces).text
+            assert was_merged == '1'
+
+        preferred_id[f'rs{rs_id}'] = f'rs{merged_into}'
+
         expected_ref = {
             s.split(':')[-2]
             for s in spdi
@@ -101,13 +129,17 @@ def parse_dbsnp_variants(snps_result: EntrezResponse, verbose: bool = False) ->
             'ref': list(expected_ref)[0],
             'alts': ','.join(expected_alt),
             'chrom': chrom,
-            'pos': float(pos),
+            'pos': int(pos),
             'chrom_prev': chrom_prev,
-            'pos_prev': float(pos_prev),
+            'pos_prev': int(pos_prev),
             'consequence': sig_class
         })
 
     return VariantSet(
         coordinates=DataFrame(results).set_index('rs_id'),
-        alt_frequencies=DataFrame(alt_frequencies)
+        alt_frequencies=DataFrame(alt_frequencies),
+        preferred_ids=preferred_id
     )
+
+
+__all__ = ['VariantSet', 'parse_dbsnp_variants', 'xml_to_string', 'namespaces']
diff --git a/requirements/docs.txt b/requirements/docs.txt
@@ -1,5 +1,6 @@
 -r ./minimal.txt
-sphinx==3.2.1
+sphinx<6.0
 pydata-sphinx-theme
 sphinx-autodoc-typehints
 sphinx-copybutton
+myst-parser
diff --git a/setup.py b/setup.py
@@ -14,7 +14,7 @@ def get_long_description(file_name):
         package_data={'easy_entrez': ['data/*.tsv', 'py.typed']},
         # required for mypy to work
         zip_safe=False,
-        version='0.3.3',
+        version='0.3.4',
         license='MIT',
         description='Python REST API for Entrez E-Utilities: stateless, easy to use, reliable.',
         long_description=get_long_description('README.md'),
@@ -24,7 +24,7 @@ def get_long_description(file_name):
         url='https://github.com/krassowski/easy-entrez',
         keywords=['entrez', 'pubmed', 'e-utilities', 'ncbi', 'rest', 'api', 'dbsnp', 'literature', 'mining'],
         classifiers=[
-            'Development Status :: 4 - Beta',
+            'Development Status :: 5 - Production/Stable',
             'License :: OSI Approved :: GNU Lesser General Public License v3 (LGPLv3)',
             'Operating System :: Microsoft :: Windows',
             'Operating System :: POSIX :: Linux',
@@ -39,11 +39,13 @@ def get_long_description(file_name):
             'Programming Language :: Python :: 3.6',
             'Programming Language :: Python :: 3.7',
             'Programming Language :: Python :: 3.8',
-            'Programming Language :: Python :: 3.9'
+            'Programming Language :: Python :: 3.9',
+            'Programming Language :: Python :: 3.10',
+            'Programming Language :: Python :: 3.11'
         ],
         install_requires=['requests', 'typing_extensions', 'dataclasses>="0.7";python_version<"3.7"'],
         extras_require={
             'with_progress_bars': ['tqdm'],
-            'parsing': ['tqdm']
+            'with_parsing_utils': ['pandas']
         }
     )
diff --git a/tests/test_api.py b/tests/test_api.py
@@ -41,4 +41,3 @@ def test_fetch():
 
     with raises(ValueError, match='Received str but a list-like container of identifiers was expected'):
         entrez_api.fetch('4', max_results=1, database='snp')
-
Original file line number	Diff line number	Diff line change
Expand Up		@@ -41,4 +41,3 @@ def test_fetch():

		with raises(ValueError, match='Received str but a list-like container of identifiers was expected'):
		entrez_api.fetch('4', max_results=1, database='snp')