Skip to content

Commit

Permalink
Add support to spaCy v3 (#52)
Browse files Browse the repository at this point in the history
* Update to spacyv3 (#49)

* Resolved #48 by migrating the code to match the nlp pipeline in spacy v3. See: https://nightly.spacy.io/usage/v3#migrating-add-pipe

* updated tests in contextualSpellCheck.py to match the pipeline in spacy v3

* updated spacy dependency number

* black lint

* Update tests

Co-authored-by: R1j1t <[email protected]>

* updated the type check based on PEP 3017

Ref:
- https://stackoverflow.com/a/21384492/7630458
- https://docs.python.org/3/library/typing.html
- https://www.python.org/dev/peps/pep-3107/

* updated README and controller

* reflected changes in examples and other housekeeping

* preparing for release

* removed optional config from README usage

Co-authored-by: jonmun <[email protected]>
  • Loading branch information
R1j1t and jonmun authored Feb 16, 2021
1 parent f18c69c commit 5b65bad
Show file tree
Hide file tree
Showing 11 changed files with 113 additions and 90 deletions.
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[flake8]
ignore = W503, E203
ignore = W503, E203, F401
exclude = .git,__pycache__,build,peters_code,.ipynb_checkpoints,setup.py
max-complexity = 15
per-file-ignores =
Expand Down
3 changes: 1 addition & 2 deletions .github/stale.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@ staleLabel: wontfix
# Comment to post when marking an issue as stale. Set to `false` to disable
markComment: >
This issue has been automatically marked as stale because it has not had
recent activity. It will be closed if no further activity occurs. Thank you
for your contributions.
recent activity. It will be closed if no further activity occurs.
# Comment to post when closing a stale issue. Set to `false` to disable
closeComment: false
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -145,3 +145,6 @@ contextualSpellCheck/tests/debugFile.txt

# vs code ignore
.vscode/

# PyCharm config ignore
.idea/
37 changes: 20 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,13 @@ Contextual word checker for better suggestions

## Types of spelling mistakes

It is essential to understand that identifying whether a candidate is a spelling error is a big task. You can see the below quote from a research paper:
It is essential to understand that identifying whether a candidate is a spelling error is a big task.

> Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.
>
> -- [Monojit Choudhury et. al. (2007)][1]
This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. In the coming days, I would like to focus on RWE and optimising the package by implementing it in cython.
This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation.

## Install

Expand All @@ -28,26 +28,24 @@ The package can be installed using [pip](https://pypi.org/project/contextualSpel
pip install contextualSpellCheck
```

Also, please install the dependencies from requirements.txt

## Usage

**Note:** For other language examples check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder.
**Note:** For use in other languages check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder.

### How to load the package in spacy pipeline

```bash
>>> import contextualSpellCheck
>>> import spacy
>>>
>>> ## We require NER to identify if it is PERSON
>>> ## also require parser because we use Token.sent for context
>>> nlp = spacy.load("en_core_web_sm")
>>>
>>> ## We require NER to identify if a token is a PERSON
>>> ## also require parser because we use `Token.sent` for context
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
>>> contextualSpellCheck.add_to_pipe(nlp)
<spacy.lang.en.English object at 0x12839a2d0>
>>> nlp.pipe_names
['tagger', 'parser', 'ner', 'contextual spellchecker']
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
>>>
>>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
>>> doc._.outcome_spellCheck
Expand All @@ -60,19 +58,24 @@ Or you can add to spaCy pipeline manually!
>>> import spacy
>>> import contextualSpellCheck
>>>
>>> nlp = spacy.load('en')
>>> checker = contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck()
>>> nlp.add_pipe(checker)
>>> nlp = spacy.load("en_core_web_sm")
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
>>> # You can pass the optional parameters to the contextualSpellCheck
>>> # eg. pass max edit distance use config={"max_edit_dist": 3}
>>> nlp.add_pipe("contextual spellchecker")
<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0>
>>> nlp.pipe_names
['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
>>>
>>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
>>> print(doc._.performed_spellCheck)
True
>>> print(doc._.outcome_spellCheck)
Income was $9.4 million compared to the prior year of $2.7 million.

```

After adding contextual spell checker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.
After adding `contextual spellchecker` in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using [extensions](#Extensions).

### Using the pipeline

Expand Down Expand Up @@ -108,7 +111,7 @@ True

## Extensions

To make the usage simpler spacy provides custom extensions which a library can use. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. Below tables summaries the extensions.
To make the usage easy, `contextual spellchecker` provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. Below tables summaries the extensions.

### `spaCy.Doc` level extensions

Expand Down Expand Up @@ -142,7 +145,7 @@ Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY
Note: Your browser can handle the text encoding

```
http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.
GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.
```

Response:
Expand Down
4 changes: 1 addition & 3 deletions contextualSpellCheck/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,4 @@


def add_to_pipe(nlp, **kwargs):
checker = ContextualSpellCheck(**kwargs)
nlp.add_pipe(checker)
return nlp
nlp.add_pipe("contextual spellchecker", **kwargs)
44 changes: 20 additions & 24 deletions contextualSpellCheck/contextualSpellCheck.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,10 @@
from spacy.tokens import Doc, Token, Span
from spacy.vocab import Vocab
from transformers import AutoModelForMaskedLM, AutoTokenizer
from spacy.language import Language


@Language.factory("contextual spellchecker")
class ContextualSpellCheck(object):
"""
Class object for Out Of Vocabulary(OOV) corrections
Expand All @@ -22,11 +24,13 @@ class ContextualSpellCheck(object):

def __init__(
self,
vocab_path="",
model_name="bert-base-cased",
max_edit_dist=10,
debug=False,
performance=False,
nlp,
name,
vocab_path: str = "",
model_name: str = "bert-base-cased",
max_edit_dist: int = 10,
debug: bool = False,
performance: bool = False,
):
"""To create an object for this class. It does not require any special
Expand All @@ -43,27 +47,14 @@ def __init__(
by individual steps in spell check.
Defaults to False.
"""
if (
not isinstance(vocab_path, str)
or not isinstance(debug, type(True))
or not isinstance(performance, type(True))
):
raise TypeError(
"Please check datatype provided. vocab_path should be str,"
" debug and performance should be bool"
)
try:
int(float(max_edit_dist))
except ValueError:
raise ValueError(
f"cannot convert {max_edit_dist} to int. Please provide a "
f"valid integer "
)

if vocab_path != "":
vocab_path = str(vocab_path)
try:
# First open() for user specified word addition to vocab
with open(vocab_path, encoding="utf8") as f:
print(vocab_path)
print("inside vocab path")
# if want to remove '[unusedXX]' from vocab
# words = [
# line.rstrip()
Expand All @@ -75,12 +66,14 @@ def __init__(
# The below code adds the necessary words like numbers
# /punctuations/tokenizer specific words like [PAD]/[
# unused0]/##M
print("file opened!")
current_path = os.path.dirname(__file__)
vocab_path = os.path.join(current_path, "data", "vocab.txt")
extra_token = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
words.extend(extra_token)

with open(vocab_path, encoding="utf8") as f:
print("Inside [unused....]")
# if want to remove '[unusedXX]' from vocab
# words = [
# line.rstrip()
Expand Down Expand Up @@ -110,7 +103,7 @@ def __init__(
words = []

self.max_edit_dist = int(float(max_edit_dist))
self.model_name = model_name
self.model_name = str(model_name)
self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name)

if vocab_path == "":
Expand Down Expand Up @@ -651,8 +644,11 @@ def deep_tokenize_in_vocab(self, text):
raise AttributeError(
"parser is required please enable it in nlp pipeline"
)
checker = ContextualSpellCheck(debug=True, max_edit_dist=3)
nlp.add_pipe(checker)
# checker = ContextualSpellCheck(debug=True, max_edit_dist=3)
nlp.add_pipe(
"contextual spellchecker", config={"debug": True, "max_edit_dist": 3}
)

# nlp.add_pipe(merge_ents)

doc = nlp(
Expand Down
Loading

0 comments on commit 5b65bad

Please sign in to comment.