Add support to spaCy v3 (#52)

* Update to spacyv3 (#49) * Resolved #48 by migrating the code to match the nlp pipeline in spacy v3. See: https://nightly.spacy.io/usage/v3#migrating-add-pipe * updated tests in contextualSpellCheck.py to match the pipeline in spacy v3 * updated spacy dependency number * black lint * Update tests Co-authored-by: R1j1t <[email protected]> * updated the type check based on PEP 3017 Ref: - https://stackoverflow.com/a/21384492/7630458 - https://docs.python.org/3/library/typing.html - https://www.python.org/dev/peps/pep-3107/ * updated README and controller * reflected changes in examples and other housekeeping * preparing for release * removed optional config from README usage Co-authored-by: jonmun <[email protected]>
R1j1t · Feb 16, 2021 · 5b65bad · 5b65bad
1 parent f18c69c
commit 5b65bad
Show file tree

Hide file tree

Showing 11 changed files with 113 additions and 90 deletions.
diff --git a/.flake8 b/.flake8
@@ -1,5 +1,5 @@
 [flake8]
-ignore = W503, E203
+ignore = W503, E203, F401
 exclude = .git,__pycache__,build,peters_code,.ipynb_checkpoints,setup.py
 max-complexity = 15
 per-file-ignores =

diff --git a/.github/stale.yml b/.github/stale.yml
@@ -13,7 +13,6 @@ staleLabel: wontfix
 # Comment to post when marking an issue as stale. Set to `false` to disable
 markComment: >
   This issue has been automatically marked as stale because it has not had
-  recent activity. It will be closed if no further activity occurs. Thank you
-  for your contributions.
+  recent activity. It will be closed if no further activity occurs.
 # Comment to post when closing a stale issue. Set to `false` to disable
 closeComment: false
diff --git a/.gitignore b/.gitignore
@@ -145,3 +145,6 @@ contextualSpellCheck/tests/debugFile.txt
 
 # vs code ignore
 .vscode/
+
+# PyCharm config ignore
+.idea/
diff --git a/README.md b/README.md
@@ -12,13 +12,13 @@ Contextual word checker for better suggestions
 
 ## Types of spelling mistakes
 
-It is essential to understand that identifying whether a candidate is a spelling error is a big task. You can see the below quote from a research paper:
+It is essential to understand that identifying whether a candidate is a spelling error is a big task.
 
 > Spelling errors are broadly classified as non- word errors (NWE) and real word errors (RWE). If the misspelt string is a valid word in the language, then it is called an RWE, else it is an NWE.
 >
 > -- [Monojit Choudhury et. al. (2007)][1]
 
-This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. In the coming days, I would like to focus on RWE and optimising the package by implementing it in cython.
+This package currently focuses on Out of Vocabulary (OOV) word or non-word error (NWE) correction using BERT model. The idea of using BERT was to use the context when correcting OOV. To improve this package, I would like to extend the functionality to identify RWE, optimising the package, and improving the documentation.
 
 ## Install
 
@@ -28,26 +28,24 @@ The package can be installed using [pip](https://pypi.org/project/contextualSpel
 pip install contextualSpellCheck
 ```
 
-Also, please install the dependencies from requirements.txt
-
 ## Usage
 
-**Note:** For other language examples check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder. 
+**Note:** For use in other languages check [`examples`](https://github.com/R1j1t/contextualSpellCheck/tree/master/examples) folder.
 
 ### How to load the package in spacy pipeline
 
 ```bash
 >>> import contextualSpellCheck
 >>> import spacy
->>> 
->>> ## We require NER to identify if it is PERSON
->>> ## also require parser because we use Token.sent for context
 >>> nlp = spacy.load("en_core_web_sm") 
 >>> 
+>>> ## We require NER to identify if a token is a PERSON
+>>> ## also require parser because we use `Token.sent` for context
+>>> nlp.pipe_names
+['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
 >>> contextualSpellCheck.add_to_pipe(nlp)
-<spacy.lang.en.English object at 0x12839a2d0>
 >>> nlp.pipe_names
-['tagger', 'parser', 'ner', 'contextual spellchecker']
+['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
 >>> 
 >>> doc = nlp('Income was $9.4 milion compared to the prior year of $2.7 milion.')
 >>> doc._.outcome_spellCheck
@@ -60,19 +58,24 @@ Or you can add to spaCy pipeline manually!
 >>> import spacy
 >>> import contextualSpellCheck
 >>> 
->>> nlp = spacy.load('en')
->>> checker = contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck()
->>> nlp.add_pipe(checker)
+>>> nlp = spacy.load("en_core_web_sm")
+>>> nlp.pipe_names
+['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer']
+>>> # You can pass the optional parameters to the contextualSpellCheck
+>>> # eg. pass max edit distance use config={"max_edit_dist": 3}
+>>> nlp.add_pipe("contextual spellchecker")
+<contextualSpellCheck.contextualSpellCheck.ContextualSpellCheck object at 0x1049f82b0>
+>>> nlp.pipe_names
+['tok2vec', 'tagger', 'parser', 'ner', 'attribute_ruler', 'lemmatizer', 'contextual spellchecker']
 >>> 
 >>> doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
 >>> print(doc._.performed_spellCheck)
 True
 >>> print(doc._.outcome_spellCheck)
 Income was $9.4 million compared to the prior year of $2.7 million.
-
 ```
 
-After adding contextual spell checker in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using extensions.
+After adding `contextual spellchecker` in the pipeline, you use the pipeline normally. The spell check suggestions and other data can be accessed using [extensions](#Extensions).
 
 ### Using the pipeline
 
@@ -108,7 +111,7 @@ True
 
 ## Extensions
 
-To make the usage simpler spacy provides custom extensions which a library can use. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. Below tables summaries the extensions.
+To make the usage easy, `contextual spellchecker` provides custom spacy extensions which your code can consume. This makes it easier for the user to get the desired data. contextualSpellCheck provides extensions on the `doc`, `span` and `token` level. Below tables summaries the extensions.
 
 ### `spaCy.Doc` level extensions
 
@@ -142,7 +145,7 @@ Query: You can use the endpoint http://127.0.0.1:5000/?query=YOUR-QUERY
 Note: Your browser can handle the text encoding
 
 ```
-http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.
+GET Request: http://localhost:5000/?query=Income%20was%20$9.4%20milion%20compared%20to%20the%20prior%20year%20of%20$2.7%20milion.
 ```
 
 Response:

diff --git a/contextualSpellCheck/__init__.py b/contextualSpellCheck/__init__.py
@@ -5,6 +5,4 @@
 
 
 def add_to_pipe(nlp, **kwargs):
-    checker = ContextualSpellCheck(**kwargs)
-    nlp.add_pipe(checker)
-    return nlp
+    nlp.add_pipe("contextual spellchecker", **kwargs)
diff --git a/contextualSpellCheck/contextualSpellCheck.py b/contextualSpellCheck/contextualSpellCheck.py
@@ -11,8 +11,10 @@
 from spacy.tokens import Doc, Token, Span
 from spacy.vocab import Vocab
 from transformers import AutoModelForMaskedLM, AutoTokenizer
+from spacy.language import Language
 
 
+@Language.factory("contextual spellchecker")
 class ContextualSpellCheck(object):
     """
     Class object for Out Of Vocabulary(OOV) corrections
@@ -22,11 +24,13 @@ class ContextualSpellCheck(object):
 
     def __init__(
         self,
-        vocab_path="",
-        model_name="bert-base-cased",
-        max_edit_dist=10,
-        debug=False,
-        performance=False,
+        nlp,
+        name,
+        vocab_path: str = "",
+        model_name: str = "bert-base-cased",
+        max_edit_dist: int = 10,
+        debug: bool = False,
+        performance: bool = False,
     ):
         """To create an object for this class. It does not require any special
 
@@ -43,27 +47,14 @@ def __init__(
                                           by individual steps in spell check.
                                           Defaults to False.
         """
-        if (
-            not isinstance(vocab_path, str)
-            or not isinstance(debug, type(True))
-            or not isinstance(performance, type(True))
-        ):
-            raise TypeError(
-                "Please check datatype provided. vocab_path should be str,"
-                " debug and performance should be bool"
-            )
-        try:
-            int(float(max_edit_dist))
-        except ValueError:
-            raise ValueError(
-                f"cannot convert {max_edit_dist} to int. Please provide a "
-                f"valid integer "
-            )
 
         if vocab_path != "":
+            vocab_path = str(vocab_path)
             try:
                 # First open() for user specified word addition to vocab
                 with open(vocab_path, encoding="utf8") as f:
+                    print(vocab_path)
+                    print("inside vocab path")
                     # if want to remove '[unusedXX]' from vocab
                     # words = [
                     #     line.rstrip()
@@ -75,12 +66,14 @@ def __init__(
                 # The below code adds the necessary words like numbers
                 # /punctuations/tokenizer specific words like [PAD]/[
                 # unused0]/##M
+                print("file opened!")
                 current_path = os.path.dirname(__file__)
                 vocab_path = os.path.join(current_path, "data", "vocab.txt")
                 extra_token = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
                 words.extend(extra_token)
 
                 with open(vocab_path, encoding="utf8") as f:
+                    print("Inside [unused....]")
                     # if want to remove '[unusedXX]' from vocab
                     # words = [
                     #     line.rstrip()
@@ -110,7 +103,7 @@ def __init__(
                 words = []
 
         self.max_edit_dist = int(float(max_edit_dist))
-        self.model_name = model_name
+        self.model_name = str(model_name)
         self.BertTokenizer = AutoTokenizer.from_pretrained(self.model_name)
 
         if vocab_path == "":
@@ -651,8 +644,11 @@ def deep_tokenize_in_vocab(self, text):
         raise AttributeError(
             "parser is required please enable it in nlp pipeline"
         )
-    checker = ContextualSpellCheck(debug=True, max_edit_dist=3)
-    nlp.add_pipe(checker)
+    #    checker = ContextualSpellCheck(debug=True, max_edit_dist=3)
+    nlp.add_pipe(
+        "contextual spellchecker", config={"debug": True, "max_edit_dist": 3}
+    )
+
     # nlp.add_pipe(merge_ents)
 
     doc = nlp(