Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement input API and processor for EVEX database #1393

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
87de13d
Add downloader for EVEX standoff
bgyori Sep 14, 2022
4208a54
Implement Evex source file processing
bgyori Sep 15, 2022
a2f7730
Implement basic Statement extraction
bgyori Sep 15, 2022
2455ea1
Extract text references in evidence
bgyori Sep 15, 2022
623d410
Implement standoff file processing
bgyori Sep 19, 2022
6aaea09
Get evidence text based on offset and use dataclasses
bgyori Sep 19, 2022
3135c77
Deal with unresolved regulations
bgyori Sep 19, 2022
8e17396
Start implementing regulation finding
bgyori Sep 19, 2022
a00e970
Extract negation
bgyori Sep 20, 2022
afe6ebb
First end-to-end text getting
bgyori Sep 20, 2022
acd7ed6
Handle more special cases
bgyori Sep 20, 2022
4fbcd5c
Implement standoff cache
bgyori Sep 20, 2022
e961cfa
Expose useful code in init
bgyori Sep 20, 2022
687b991
Cache standoff index since it's slow to build
bgyori Sep 21, 2022
6bf2502
Handle corner cases and implement better debugging
bgyori Sep 21, 2022
c7422d2
Implement polarity handling and path finding in standoff
bgyori Sep 26, 2022
60dd3b6
Fix bugs and corner cases for Regulation handling
bgyori Sep 26, 2022
d5b91ea
Construct annotated path
bgyori Sep 26, 2022
ee94629
Implement constraint-based path matching
bgyori Sep 27, 2022
c8af8da
Propagate agent texts into annotations
bgyori Sep 27, 2022
4e4c951
Fix getting evidence text
bgyori Sep 27, 2022
5cbb0c5
Fix handling of list arguments when finding entrez
bgyori Sep 28, 2022
0bf8a93
Refactor processing loop into functions
bgyori Sep 28, 2022
67ec544
Handle multiple articles for a general event ID
bgyori Sep 28, 2022
56386c2
Handle matching binding
bgyori Sep 28, 2022
b08d05e
Add EVEX to package and docs
bgyori Sep 28, 2022
eb8c0e3
Allow missing standoff
bgyori Sep 28, 2022
5f9d030
Add end-to-end test and cleanup
bgyori Sep 28, 2022
ed9d962
Annotation should be string
bgyori Sep 28, 2022
1d4b63e
Add more evidence info and refactor static functions
bgyori Sep 28, 2022
4158aa9
Fix setting coordinates
bgyori Sep 28, 2022
d0e424a
Add test for evidence-level uniqueness
bgyori Sep 28, 2022
4a1393c
Add docstrings for EXEV API and allow custom base folder
bgyori Sep 28, 2022
09fe17a
Add more comments to processor code
bgyori Sep 28, 2022
3a1416d
Add negation, speculation and confidence to evidence
bgyori Sep 29, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ Reading systems:
| Sparser | [`indra.sources.sparser`](https://indra.readthedocs.io/en/latest/modules/sources/sparser/index.html#) | https://github.com/ddmcdonald/sparser |
| Eidos | [`indra.sources.eidos`](https://indra.readthedocs.io/en/latest/modules/sources/eidos/index.html#) | https://github.com/clulab/eidos |
| TEES | [`indra.sources.tees`](https://indra.readthedocs.io/en/latest/modules/sources/tees/index.html) | https://github.com/jbjorne/TEES |
| EVEX | [`indra.sources.evex`](https://indra.readthedocs.io/en/latest/modules/sources/evex/index.html) | http://evexdb.org/ |
| MedScan | [`indra.sources.medscan`](https://indra.readthedocs.io/en/latest/modules/sources/medscan/index.html) | https://doi.org/10.1093/bioinformatics/btg207 |
| RLIMS-P | [`indra.sources.rlimsp`](https://indra.readthedocs.io/en/latest/modules/sources/rlimsp/index.html) | https://research.bioinformatics.udel.edu/rlimsp |
| ISI/AMR | [`indra.sources.isi`](https://indra.readthedocs.io/en/latest/modules/sources/isi/index.html) | https://github.com/sgarg87/big_mech_isi_gg |
Expand Down
17 changes: 17 additions & 0 deletions doc/modules/sources/evex.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
EVEX (:py:mod:`indra.sources.evex`)
===================================

.. automodule:: indra.sources.evex
:members:

EVEX API (:py:mod:`indra.sources.evex.api`)
-------------------------------------------

.. automodule:: indra.sources.evex.api
:members:

EVEX Processor (:py:mod:`indra.sources.evex.processor`)
-------------------------------------------------------

.. automodule:: indra.sources.evex.processor
:members:
1 change: 1 addition & 0 deletions doc/modules/sources/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ Reading Systems
sparser/index
medscan/index
tees/index
evex
isi/index
geneways/index
rlimsp/index
Expand Down
6 changes: 4 additions & 2 deletions indra/resources/default_belief_probs.json
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,8 @@
"creeds": 0.01,
"ubibrowser": 0.01,
"acsn": 0.01,
"semrep": 0.05
"semrep": 0.05,
"evex": 0.05
},
"rand": {
"eidos": 0.3,
Expand Down Expand Up @@ -71,6 +72,7 @@
"creeds": 0.1,
"ubibrowser": 0.1,
"acsn": 0.1,
"semrep": 0.3
"semrep": 0.3,
"evex": 0.3
}
}
10 changes: 10 additions & 0 deletions indra/resources/source_info.json
Original file line number Diff line number Diff line change
Expand Up @@ -299,6 +299,16 @@
"background-color": "#6600cc"
}
},
"evex": {
"name": "EVEX",
"link": "http://evexdb.org/",
"type": "reader",
"domain": "biology",
"default_style": {
"color": "white",
"background-color": "#295c8d"
}
},
"creeds": {
"name": "CREEDS",
"link": "https://maayanlab.cloud/CREEDS/",
Expand Down
2 changes: 2 additions & 0 deletions indra/sources/evex/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from .api import download_evex, process_human_events
from .processor import EvexProcessor, EvexStandoff
144 changes: 144 additions & 0 deletions indra/sources/evex/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
import os
import glob
import logging
import pickle
import tarfile
from urllib.request import urlretrieve
import requests
import pandas
import tqdm

from .processor import EvexProcessor

logger = logging.getLogger(__name__)

human_network = 'http://evexdb.org/download/network-format/Metazoa/' \
'Homo_sapiens.tar.gz'
standoff_root = 'http://evexdb.org/download/standoff-annotation/version-0.1/'


def process_human_events(base_folder=None):
"""Process all human events available in EVEX.

Note that unless the standoff files have already been downloaded using the
`download_evex` function, the Statements produced by this function
will not carry any evidence text, agent text and various other metadata
in them for which the standoff files are required.

Parameters
----------
base_folder : Optional[str]
If provided, the given base folder is used to download the human
network file from EVEX. Otherwise, the `pystow` package is used
to create an `evex` folder within the pystow base path,
typically ~/.data/evex.

Returns
-------
EvexProcessor
An EvexProcessor instance with the extracted INDRA Statements
as its statements attribute.
"""
if not base_folder:
import pystow
base_folder = pystow.join('evex').as_posix()
standoff_index = build_standoff_index()
network_file = os.path.join(base_folder, 'Homo_sapiens.tar.gz')
if not os.path.exists(network_file):
urlretrieve(human_network, network_file)
with tarfile.open(network_file, 'r:gz') as fh:
relations_file = fh.extractfile('EVEX_relations_9606.tab')
articles_file = fh.extractfile('EVEX_articles_9606.tab')
relations_df = pandas.read_csv(relations_file, sep='\t')
articles_df = pandas.read_csv(articles_file, sep='\t')
ep = EvexProcessor(relations_df, articles_df, standoff_index)
ep.process_statements()
return ep


def build_standoff_index(cached=True, base_folder=None):
"""Build an index of publications in standoff bulk archive files.

This index is necessary to figure out which standoff archive the annotations
for a given article are in.

Parameters
----------
cached: Optional[bool]
If True, the standoff index is cached in the base folder and isn't
regenerated if this function is called again, just reloaded.
This is useful since generating the full standoff file index
can take a long time. Default: True
base_folder : Optional[str]
If provided, the given base folder is used to download the human
network file from EVEX. Otherwise, the `pystow` package is used
to create an `evex` folder within the pystow base path,
typically ~/.data/evex.
"""
if not base_folder:
import pystow
base_folder = pystow.join('evex').as_posix()
cache_file = os.path.join(base_folder, 'standoff_index.pkl')
if cached and os.path.exists(cache_file):
logger.info('Loading standoff index from %s' % cache_file)
with open(cache_file, 'rb') as fh:
return pickle.load(fh)
index = {}
for fname in tqdm.tqdm(glob.glob(os.path.join(base_folder, 'batch*')),
desc='Building standoff index'):
try:
with tarfile.open(fname, 'r:gz') as fh:
names = fh.getnames()
except tarfile.ReadError:
logger.error('Could not read tarfile %s' % fname)
continue
ids = {tuple(os.path.splitext(name)[0].split('_')[:2])
for name in names if name.endswith('ann')}
for paper_id in ids:
index[paper_id] = fname
if cached:
with open(cache_file, 'wb') as fh:
pickle.dump(index, fh)
return index


def download_evex(base_folder=None):
"""Download EVEX human network and standoff output files.

This function downloads the human network file as well as a large number
of standoff output files. These files are necessary to find evidence text,
agent text and agent coordinates to be used in INDRA. Note that there
are over 4 thousand such files, and the overall size is around 6 GB.

Parameters
----------
base_folder : Optional[str]
If provided, the given base folder is used to download the human
network file from EVEX. Otherwise, the `pystow` package is used
to create an `evex` folder within the pystow base path,
typically ~/.data/evex.
"""
from bs4 import BeautifulSoup
if not base_folder:
import pystow
base_folder = pystow.join('evex').as_posix()
# Download human network first
fname = os.path.join(base_folder, 'Homo_sapiens.tar.gz')
if not os.path.exists(fname):
urlretrieve(human_network, fname)
# Now download all the standoff files
res = requests.get(standoff_root)
soup = BeautifulSoup(res.text, 'html.parser')
children = [standoff_root + node.get('href')
for node in soup.find_all('a')
if node.get('href').startswith('files')]
for child in tqdm.tqdm(children):
res = requests.get(child)
soup = BeautifulSoup(res.text, 'html.parser')
downloadables = [child + node.get('href')
for node in soup.find_all('a')
if node.get('href').startswith('batch')]
for downloadable in downloadables:
fname = os.path.join(base_folder, downloadable.split('/')[-1])
if not os.path.exists(fname):
urlretrieve(downloadable, fname)
Loading
Loading