Skip to content

Commit 71349e4

Browse files
authored
Merge pull request #36 from ccb-hms/development
Remove extra functions and add unmapped
2 parents d3fe0a3 + d2634fb commit 71349e4

File tree

13 files changed

+304
-182
lines changed

13 files changed

+304
-182
lines changed

.readthedocs.yaml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
# .readthedocs.yaml
2+
# Read the Docs configuration file
3+
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
4+
5+
# Required
6+
version: 2
7+
8+
# Set the OS, Python version and other tools you might need
9+
build:
10+
os: ubuntu-22.04
11+
tools:
12+
python: "3.11"
13+
# You can also specify other tool versions:
14+
# nodejs: "19"
15+
# rust: "1.64"
16+
# golang: "1.19"
17+
18+
# Build documentation in the "docs/" directory with Sphinx
19+
sphinx:
20+
configuration: docs/conf.py
21+
22+
# Optionally build your docs in additional formats such as PDF and ePub
23+
# formats:
24+
# - pdf
25+
# - epub
26+
27+
# Optional but recommended, declare the Python requirements required
28+
# to build your documentation
29+
# See https://docs.readthedocs.io/en/stable/guides/reproducible-builds.html
30+
python:
31+
install:
32+
- requirements: requirements.txt

README.md

Lines changed: 31 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,14 @@ pip install text2term
1313
import text2term
1414
import pandas
1515

16-
df1 = text2term.map_file("test/unstruct_terms.txt", "http://www.ebi.ac.uk/efo/efo.owl")
16+
df1 = text2term.map_terms("test/unstruct_terms.txt", "http://www.ebi.ac.uk/efo/efo.owl")
1717
df2 = text2term.map_terms(["asthma", "acute bronchitis"], "http://www.ebi.ac.uk/efo/efo.owl")
18+
df3 = text2term.map_terms({"asthma":"disease", "acute bronchitis":["disease", "lungs"]}, "http://www.ebi.ac.uk/efo/efo.owl")
1819
```
1920
Below is an example of caching, assuming the same imports as above:
2021
```python
2122
text2term.cache_ontology("http://www.ebi.ac.uk/efo/efo.owl", "EFO")
22-
df1 = text2term.map_file("test/unstruct_terms.txt", "EFO", use_cache=True)
23+
df1 = text2term.map_terms("test/unstruct_terms.txt", "EFO", use_cache=True)
2324
df2 = text2term.map_terms(["asthma", "acute bronchitis"], "EFO", use_cache=True)
2425
text2term.clear_cache("EFO")
2526
```
@@ -48,10 +49,10 @@ Then, after running this, the following command is equivalent:
4849
`python text2term -s test/unstruct_terms.txt -t EFO`
4950

5051
## Programmatic Usage
51-
The tool can be executed in Python with any of the three following functions:
52+
The tool can be executed in Python with the `map_terms` function:
5253

5354
```python
54-
text2term.map_file(input_file='/some/file.txt',
55+
text2term.map_terms(source_terms,
5556
target_ontology='http://some.ontology/v1.owl',
5657
base_iris=(),
5758
csv_columns=(),
@@ -64,45 +65,15 @@ text2term.map_file(input_file='/some/file.txt',
6465
save_mappings=False,
6566
separator=',',
6667
use_cache=False,
67-
term_type='classes')
68-
```
69-
or
70-
```python
71-
text2term.map_terms(source_terms=['term one', 'term two'],
72-
target_ontology='http://some.ontology/v1.owl',
73-
base_iris=(),
74-
excl_deprecated=False,
75-
max_mappings=3,
76-
min_score=0.3,
77-
mapper=Mapper.TFIDF,
78-
output_file='',
79-
save_graphs=False,
80-
save_mappings=False,
81-
source_terms_ids=(),
82-
use_cache=False,
83-
term_type='classes')
84-
```
85-
or
86-
```python
87-
text2term.map_tagged_terms(tagged_terms_dict={'term one': ["tag 1", "tag 2"]},
88-
target_ontology='http://some.ontology/v1.owl',
89-
base_iris=(),
90-
excl_deprecated=False,
91-
max_mappings=3,
92-
min_score=0.3,
93-
mapper=Mapper.TFIDF,
94-
output_file='',
95-
save_graphs=False,
96-
save_mappings=False,
97-
source_terms_ids=(),
98-
use_cache=False,
99-
term_type='classes')
68+
term_type='classes',
69+
incl_unmapped=False)
70+
10071
```
72+
NOTE: As of 3.0.0, the former three functions (`map_file`, `map_terms`, `map_tagged_terms`) have been condensed into one function. Users can now change the name of any function in old code to `map_terms` and it reads the input context to maintain the functionality of each one.
10173

10274
### Arguments
103-
For `map_file`, the first argument 'input_file' specifies a path to a file containing the terms to be mapped. It also has a `csv_column` argument that allows the user to specify a column to map if a csv is passed in as the input file.
104-
For `map_terms`, the first argument 'source_terms' takes in a list of the terms to be mapped.
105-
For `map_tagged_terms`, everything is the same as `map_terms` except the first argument is either a dictionary of terms to a list of tags, or a list of TaggedTerm objects (see below). Currently, the tags do not affect the mapping in any way, but they are added to the output dataframe at the end of the process.
75+
For `map_terms`, the first argument can be any of the following: 1) a string that specifies a path to a file containing the terms to be mapped, 2) a list of the terms to be mapped, or 3)dictionary of terms to a list of tags, or a list of TaggedTerm objects (see below).
76+
Currently, the tags do not affect the mapping in any way, but they are added to the output dataframe at the end of the process. The exception is the Ignore tag, which causes the term to not be mapped at all, but still be outputted in the results if the incl_unmapped argument is True (see below).
10677

10778
All other arguments are the same, and have the same functionality:
10879

@@ -115,6 +86,9 @@ All other arguments are the same, and have the same functionality:
11586
Map only to ontology terms whose IRIs start with one of the strings given in this tuple, for example:
11687
('http://www.ebi.ac.uk/efo','http://purl.obolibrary.org/obo/HP')
11788

89+
`csv_column` : tuple
90+
Allows the user to specify a column to map if a csv is passed in as the input file. Ignored if the input is not a file path.
91+
11892
`source_terms_ids` : tuple
11993
Collection of identifiers for the given source terms
12094
WARNING: While this is still available for the tagged term function, it is worth noting that dictionaries do not necessarily preserve order, so it is not recommended. If using the TaggedTerm object, the source terms can be attached there to guarantee order.
@@ -141,12 +115,18 @@ All other arguments are the same, and have the same functionality:
141115
`save_mappings` : bool
142116
Save the generated mappings to a file (specified by `output_file`)
143117

118+
`seperator` : str
119+
Character that seperates the source term values if a file input is given. Ignored if the input is not a file path.
120+
144121
`use_cache` : bool
145122
Use the cache for the ontology. More details are below.
146123

147124
`term_type` : str
148125
Determines whether the ontology should be parsed for its classes (ThingClass), properties (PropertyClass), or both. Possible values are ['classes', 'properties', 'both']. If it does not match one of these values, the program will throw a ValueError.
149126

127+
`incl_unmapped` : bool
128+
Include all unmapped terms in the output. If something has been tagged Ignore (see below) or falls below the `min_score` threshold, it is included without a mapped term at the end of the output.
129+
150130
All default values, if they exist, can be seen above.
151131

152132
### Return Value
@@ -185,9 +165,6 @@ As of version 1.2.0, text2term includes regex-based preprocessing functionality
185165

186166
Like the "map" functions above, the two functions differ on whether the input is a file or a list of strings:
187167
```python
188-
preprocess_file(file_path, template_path, output_file='', blocklist_path='', blocklist_char='', rem_duplicates=False)
189-
```
190-
```python
191168
preprocess_terms(terms, template_path, output_file='', blocklist_path='', blocklist_char='', rem_duplicates=False)
192169
```
193170
```python
@@ -202,7 +179,7 @@ NOTE: As of version 2.1.0, the arguments were changed to "blocklist" from "black
202179
The Remove Duplicates `rem_duplicates` functionality will remove all duplicate terms after processing, if set to `True`.
203180
WARNING: Removing duplicates at any point does not guarantee which original term is kept. This is particularly important if original terms have different tags, so user caution is advised.
204181

205-
The functions `preprocess_file()` and `preprocess_terms()` both return a dictionary where the keys are the original terms and the values are the preprocessed terms.
182+
The function `preprocess_terms()` returns a dictionary where the keys are the original terms and the values are the preprocessed terms.
206183
The `preprocess_tagged_terms()` function returns a list of TaggedTerm items with the following function contracts:
207184
```python
208185
def __init__(self, term=None, tags=[], original_term=None, source_term_id=None)
@@ -214,10 +191,19 @@ def get_term(self)
214191
def get_tags(self)
215192
def get_source_term_id(self)
216193
```
217-
As mentioned in the mapping section above, this can then be passed directly to map_tagged_terms(), allowing for easy programmatic usage. Note that this allows multiple of the same preprocessed term with different tags.
194+
As mentioned in the mapping section above, this can then be passed directly to `map_terms`, allowing for easy programmatic usage. Note that this allows multiple of the same preprocessed term with different tags.
218195

219196
**Note on NA values in input**: As of v2.0.3, when the input to text2term is a table file, any rows that contain `NA` values in the specified term column, or in the term ID column (if provided), will be ignored.
220197

198+
### Tag Usage
199+
As of 3.0.0, some tags have additional functionality that is added when attached to a term:
200+
201+
IGNORE:
202+
If an ignore tag is added to a term, that term will not be mapped to any terms in the ontology. It will only be included in the output if the `incl_unmapped` argument is True. Here are the following values that count as ignore tags:
203+
```python
204+
IGNORE_TAGS = ["ignore", "Ignore", "ignore ", "Ignore "]
205+
```
206+
221207
## Command Line Usage
222208

223209
After installation, execute the tool from a command line as follows:

docs/Makefile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = .
9+
BUILDDIR = _build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

docs/conf.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Configuration file for the Sphinx documentation builder.
2+
#
3+
# For the full list of built-in configuration values, see the documentation:
4+
# https://www.sphinx-doc.org/en/master/usage/configuration.html
5+
6+
# -- Project information -----------------------------------------------------
7+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#project-information
8+
9+
project = 'text2term'
10+
copyright = '2023, Harvard Medical School'
11+
author = 'Rafael Goncalves and Jason Payne'
12+
13+
# -- General configuration ---------------------------------------------------
14+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration
15+
16+
extensions = ["myst_parser"]
17+
18+
templates_path = ['_templates']
19+
exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
20+
21+
22+
23+
# -- Options for HTML output -------------------------------------------------
24+
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
25+
26+
html_theme = 'alabaster'
27+
html_static_path = ['_static']

docs/index.rst

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
.. text2term documentation master file, created by
2+
sphinx-quickstart on Tue Jul 11 10:34:29 2023.
3+
You can adapt this file completely to your liking, but it should at least
4+
contain the root `toctree` directive.
5+
6+
Welcome to text2term's documentation!
7+
=====================================
8+
9+
.. toctree::
10+
:maxdepth: 2
11+
:caption: Contents:
12+
13+
14+
15+
Indices and tables
16+
==================
17+
18+
* :ref:`genindex`
19+
* :ref:`modindex`
20+
* :ref:`search`

docs/make.bat

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
@ECHO OFF
2+
3+
pushd %~dp0
4+
5+
REM Command file for Sphinx documentation
6+
7+
if "%SPHINXBUILD%" == "" (
8+
set SPHINXBUILD=sphinx-build
9+
)
10+
set SOURCEDIR=.
11+
set BUILDDIR=_build
12+
13+
%SPHINXBUILD% >NUL 2>NUL
14+
if errorlevel 9009 (
15+
echo.
16+
echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
17+
echo.installed, then set the SPHINXBUILD environment variable to point
18+
echo.to the full path of the 'sphinx-build' executable. Alternatively you
19+
echo.may add the Sphinx directory to PATH.
20+
echo.
21+
echo.If you don't have Sphinx installed, grab it from
22+
echo.https://www.sphinx-doc.org/
23+
exit /b 1
24+
)
25+
26+
if "%1" == "" goto help
27+
28+
%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
29+
goto end
30+
31+
:help
32+
%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
33+
34+
:end
35+
popd

test/simple-test.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,16 @@ def main():
55
efo = "http://www.ebi.ac.uk/efo/efo.owl#"
66
pizza = "https://protege.stanford.edu/ontologies/pizza/pizza.owl"
77
ncit = "http://purl.obolibrary.org/obo/ncit/releases/2022-08-19/ncit.owl"
8-
# print(bioregistry.get_owl_download("eFo"))
98
if not text2term.cache_exists("EFO"):
109
cached_onto = text2term.cache_ontology("EFO")
1110
# df = cached_onto.map_terms(["asthma", "disease location", "obsolete food allergy"], excl_deprecated=True, term_type="classes")
1211
print("Cache exists:", cached_onto.cache_exists())
1312
# caches = text2term.cache_ontology_set("text2term/resources/ontologies.csv")
14-
df = text2term.map_terms(["asthma", "disease location", "obsolete food allergy"], "EFO", min_score=.8, mapper=text2term.Mapper.JARO_WINKLER, excl_deprecated=True, use_cache=True, term_type="classes")
13+
# df = text2term.map_terms(["asthma", "disease location", "obsolete food allergy"], "EFO", min_score=.8, mapper=text2term.Mapper.JARO_WINKLER, excl_deprecated=True, use_cache=True, term_type="classes")
1514
# df = text2term.map_terms(["contains", "asthma"], "EFO", term_type="classes")
15+
df = text2term.map_terms({"asthma":"disease", "allergy":["ignore", "response"], "assdhfbswif":["sent"], "isdjfnsdfwd":""}, "EFO", excl_deprecated=True, use_cache=True, incl_unmapped=True)
16+
# taggedterms = text2term.preprocess_tagged_terms("test/simple_preprocess.txt")
17+
# df = text2term.map_terms(taggedterms, "EFO", excl_deprecated=True, use_cache=True, incl_unmapped=True)
1618
print(df.to_string())
1719

1820
if __name__ == '__main__':

test/simple_preprocess.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
asthma;:;disease
2+
acute bronchitis;:;important,tags
3+
colon disease

text2term/__init__.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,9 @@
11
from .t2t import map_terms
2-
from .t2t import map_file
3-
from .t2t import map_tagged_terms
42
from .t2t import cache_ontology
53
from .onto_cache import cache_ontology_set
64
from .onto_cache import cache_exists
75
from .onto_cache import clear_cache
86
from .mapper import Mapper
9-
from .preprocess import preprocess_file
107
from .preprocess import preprocess_terms
118
from .preprocess import preprocess_tagged_terms
129
from .tagged_terms import TaggedTerm

text2term/preprocess.py

Lines changed: 4 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3,32 +3,11 @@
33
from enum import Enum
44
from .tagged_terms import TaggedTerm
55

6-
def preprocess_file(file_path, template_path, output_file="", blocklist_path="", \
7-
blocklist_char='', blacklist_path="", blacklist_char='', \
8-
rem_duplicates=False):
9-
# Allows backwards compatibility to blacklist. Will eventually be deleted
10-
if blocklist_char == '':
11-
blocklist_char = blacklist_char
12-
if blocklist_path == "":
13-
blocklist_path = blacklist_path
14-
terms = _get_values(file_path)
15-
processed_terms = preprocess_terms(terms, template_path, output_file=output_file, \
16-
blocklist_path=blocklist_path, blocklist_char=blocklist_char, \
17-
rem_duplicates=rem_duplicates)
18-
19-
return processed_terms
20-
216
## Tags should be stored with their terms in the same line, delineated by ";:;"
227
## ex: Age when diagnosed with (.*) ;:; age,diagnosis
238
## "Age when diagnosed with cancer" becomes: {"cancer", ["age", "diagnosis"]}
249
def preprocess_tagged_terms(file_path, template_path="", blocklist_path="", \
25-
blocklist_char='', blacklist_path="", blacklist_char='', \
26-
rem_duplicates=False, separator=";:;"):
27-
# Allows backwards compatibility to blacklist. Will eventually be deleted
28-
if blocklist_char == '':
29-
blocklist_char = blacklist_char
30-
if blocklist_path == "":
31-
blocklist_path = blacklist_path
10+
blocklist_char='', rem_duplicates=False, separator=";:;"):
3211
# Seperate tags from the terms, put in TaggedTerm and add to list
3312
raw_terms = _get_values(file_path)
3413
terms = []
@@ -80,13 +59,9 @@ def preprocess_tagged_terms(file_path, template_path="", blocklist_path="", \
8059
return processed_terms
8160

8261
def preprocess_terms(terms, template_path, output_file="", blocklist_path="", \
83-
blocklist_char='', blacklist_path="", blacklist_char='', \
84-
rem_duplicates=False):
85-
# Allows backwards compatibility to blacklist. Will eventually be deleted
86-
if blocklist_char == '':
87-
blocklist_char = blacklist_char
88-
if blocklist_path == "":
89-
blocklist_path = blacklist_path
62+
blocklist_char='', rem_duplicates=False):
63+
if isinstance(terms, str):
64+
terms = _get_values(file_path)
9065
# Form the templates as regular expressions
9166
template_strings = []
9267
if template_path != "":

0 commit comments

Comments
 (0)