ERRANT v2.3.0

chrisjbryant · Jul 15, 2021 · 9111c6c · 9111c6c
1 parent 6c0d521
commit 9111c6c
Show file tree

Hide file tree

Showing 10 changed files with 349 additions and 125 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,32 @@
 
 This log describes all the changes made to ERRANT since its release.
 
+## v2.3.0 (15-07-021)
+
+1. Added some new rules to reduce the number of OTHER-type 1:1 edits and classify them as something else. Specifically, there are now ~40% fewer 1:1 OTHER edits and ~15% fewer n:n OTHER edits overall (tested on the FCE and W&I training sets combined). The changes are as follows:
+
+    * A possessive suffix at the start of a merge sequence is now always split:
+
+    | Example | people life -> people 's lives                             |
+    |---------|------------------------------------------------------------|
+    | Old     |  _life_ -> _'s lives_ (R:OTHER)                            |
+    | New     |  _ε_ -> _'s_ (M:NOUN:POSS), _life_ -> _lives_ (R:NOUN:NUM) |
+
+    * NUM <-> DET edits are now classified as R:DET; e.g. _one (cat)_ -> _a (cat)_. Thanks to [@katkorre](https://github.com/katkorre/ERRANT-reclassification)!
+
+    * Changed the string similarity score in the classifier from the Levenshtein ratio to the normalised Levenshtein distance based on the length of the longest input string. This is because we felt some ratio scores were unintuitive; e.g. _smt_ -> _something_ has a ratio score of 0.5 despite the insertion of 6 characters (the new normalised score is 0.33).  
+
+    * The non-word spelling error rules were updated slightly to take the new normalised Levenshtein score into account. Additionally, dissimilar strings are now classified based on the POS tag of the correction rather than as OTHER; e.g. _amougnht_ -> _number_ (R:NOUN).
+
+    * The new normalised Levenshtein score is also used to classify many of the remaining 1:1 replacement edits that were previously classified as OTHER. Many of these are real-word spelling errors (e.g. _their_ <-> _there_), but there are also some morphological errors (e.g. _health_ -> _healthy_) and POS-based errors (e.g. _transport_ -> _travel_). Note that these rules are a little complex and depend on both the similarity score and the length of the original and corrected strings. For example, _form_ -> _from_ (R:SPELL) and _eventually_ -> _finally_ (R:ADV) both have the same similarity score of 0.5 yet are differentiated as different error types based on their string lengths. 
+
+2. Various minor updates:  
+    * `out_m2` in `parallel_to_m2.py` and `m2_to_m2.py` is now opened and closed properly. [#20](https://github.com/chrisjbryant/errant/pull/20)
+    * Fixed a bracketing error that deleted a valid edit in rare circumstances. [#26](https://github.com/chrisjbryant/errant/issues/26) [#28](https://github.com/chrisjbryant/errant/issues/28)
+    * Updated the English wordlist.
+    * Minor changes to the readme.
+    * Tidied up some code comments.
+
 ## v2.2.3 (12-02-21)
 
 1. Changed the dependency version requirements in `setup.py` since ERRANT v2.2.x is not compatible with spaCy 3. 
@@ -27,13 +53,13 @@ Fixed key error in the classifier for rare spaCy 2 POS tags: _SP, BES, HVS.
 1. The character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) library instead of python's native `difflib.SequenceMatcher`. This makes ERRANT 3x faster!
 
 2. Various minor updates:  
-* Updated the English wordlist.
-* Fixed a broken rule for classifying contraction errors.
-* Changed a condition in the calculation of transposition errors to be more intuitive.
-* Partially updated the ERRANT POS tag map to match the updated [Universal POS tag map](https://universaldependencies.org/tagset-conversion/en-penn-uposf.html). Specifically, EX now maps to PRON rather than ADV, LS maps to X rather than PUNCT, and CONJ has been renamed CCONJ. I did not change the mapping of RP from PART to ADP yet because this breaks several rules involving phrasal verbs.
-* Added an `errant.__version__` attribute.
-* Added a warning about using ERRANT with spaCy 2.
-* Tidied some code in the classifier.
+    * Updated the English wordlist.
+    * Fixed a broken rule for classifying contraction errors.
+    * Changed a condition in the calculation of transposition errors to be more intuitive.
+    * Partially updated the ERRANT POS tag map to match the updated [Universal POS tag map](https://universaldependencies.org/tagset-conversion/en-penn-uposf.html). Specifically, EX now maps to PRON rather than ADV, LS maps to X rather than PUNCT, and CONJ has been renamed CCONJ. I did not change the mapping of RP from PART to ADP yet because this breaks several rules involving phrasal verbs.
+    * Added an `errant.__version__` attribute.
+    * Added a warning about using ERRANT with spaCy 2.
+    * Tidied some code in the classifier.
 
 ## v2.0.0 (10-12-19)
 

diff --git a/README.md b/README.md
@@ -1,12 +1,12 @@
-# ERRANT v2.2.3
+# ERRANT v2.3.0
 
 This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:
 
 > Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. [**Automatic annotation and evaluation of error types for grammatical error correction**](https://www.aclweb.org/anthology/P17-1074/). In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.
 
 > Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. [**Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments**](https://www.aclweb.org/anthology/C16-1079/). In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.
 
-If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html).
+If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html). In particular, see Chapter 5 for definitions of error types.
 
 # Overview
 
@@ -40,11 +40,11 @@ python3 -m spacy download en
 ```
 This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then update some setup tools and install ERRANT, [spaCy](https://spacy.io/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.  
 
-#### ERRANT and spaCy 2
+#### ERRANT and spaCy
 
-ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy, this means ERRANT v2.2.0 is **~4x slower** than ERRANT v2.1.0.
+ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy, this means ERRANT v2.2 is **~4x slower** than ERRANT v2.1. We have not yet extended ERRANT to work with spaCy 3, but preliminary tests suggest ERRANT will become even slower. 
 
-There is no way around this if you use Python >= 3.7, but *we recommend installing ERRANT v2.1.0 if you use Python < 3.7*.  
+Consequently, we recommend ERRANT v2.1.0 if speed is a priority and you can use Python < 3.7.  
 ```
 pip3 install errant==2.1.0
 ```
@@ -64,6 +64,7 @@ git clone https://github.com/chrisjbryant/errant.git
 cd errant
 python3 -m venv errant_env
 source errant_env/bin/activate
+pip3 install -U pip setuptools wheel
 pip3 install -e .
 python3 -m spacy download en
 ```

diff --git a/errant/__init__.py b/errant/__init__.py
@@ -3,7 +3,7 @@
 from errant.annotator import Annotator
 
 # ERRANT version
-__version__ = '2.2.3'
+__version__ = '2.3.0'
 
 # Load an ERRANT Annotator object for a given language
 def load(lang, nlp=None):

diff --git a/errant/commands/m2_to_m2.py b/errant/commands/m2_to_m2.py
@@ -7,66 +7,72 @@ def main():
     print("Loading resources...")
     # Load Errant
     annotator = errant.load("en")
-    # Open output M2 file
-    out_m2 = open(args.out, "w")
 
     print("Processing M2 file...")
-    # Open the m2 file and split it into text+edit blocks
-    m2 = open(args.m2_file).read().strip().split("\n\n")
-    # Loop through the blocks
-    for m2_block in m2:
-        m2_block = m2_block.strip().split("\n")
-        # Write the original text to the output M2 file
-        out_m2.write(m2_block[0]+"\n")
-        # Parse orig with spacy
-        orig = annotator.parse(m2_block[0][2:])
-        # Simplify the edits and sort by coder id
-        edit_dict = simplify_edits(m2_block[1:])
-        # Loop through coder ids
-        for id, raw_edits in sorted(edit_dict.items()):
-            # If the first edit is a noop
-            if raw_edits[0][2] == "noop":
-                # Write the noop and continue
-                out_m2.write(noop_edit(id)+"\n")
-                continue
-            # Apply the edits to generate the corrected text
-            # Also redefine the edits as orig and cor token offsets
-            cor, gold_edits = get_cor_and_edits(m2_block[0][2:], raw_edits)
-            # Parse cor with spacy
-            cor = annotator.parse(cor)
-            # Save detection edits here for auto
-            det_edits = []
-            # Loop through the gold edits
-            for gold_edit in gold_edits:
-                # Do not minimise detection edits
-                if gold_edit[-2] in {"Um", "UNK"}:
-                    edit = annotator.import_edit(orig, cor, gold_edit[:-1],
-                        min=False, old_cat=args.old_cats)
-                    # Overwrite the pseudo correction and set it in the edit
-                    edit.c_toks = annotator.parse(gold_edit[-1])
-                    # Save the edit for auto
-                    det_edits.append(edit)
-                    # Write the edit for gold
-                    if args.gold:
-                        # Write the edit
-                        out_m2.write(edit.to_m2(id)+"\n")
-                # Gold annotation
-                elif args.gold:
-                    edit = annotator.import_edit(orig, cor, gold_edit[:-1],
-                        not args.no_min, args.old_cats)
-                    # Write the edit
-                    out_m2.write(edit.to_m2(id)+"\n")
-            # Auto annotations
-            if args.auto:
-                # Auto edits
-                edits = annotator.annotate(orig, cor, args.lev, args.merge)
-                # Combine detection and auto edits and sort by orig offsets
-                edits = sorted(det_edits+edits, key=lambda e:(e.o_start, e.o_end))
-                # Write the edits to the output M2 file
-                for edit in edits:
-                    out_m2.write(edit.to_m2(id)+"\n")
-        # Write a newline when there are no more edits
-        out_m2.write("\n")
+    # Open the m2 file and split it into text+edits blocks. Also open out_m2.
+    with open(args.m2_file) as m2, open(args.out, "w") as out_m2:
+        # Store the current m2_block here
+        m2_block = []
+        # Loop through m2 lines
+        for line in m2:
+            line = line.strip()
+            # If the line isn't empty, add it to the m2_block
+            if line: m2_block.append(line)
+            # Otherwise, process the complete blocks
+            else:
+                # Write the original text to the output M2 file
+                out_m2.write(m2_block[0]+"\n")
+                # Parse orig with spacy
+                orig = annotator.parse(m2_block[0][2:])
+                # Simplify the edits and sort by coder id
+                edit_dict = simplify_edits(m2_block[1:])
+                # Loop through coder ids
+                for id, raw_edits in sorted(edit_dict.items()):
+                    # If the first edit is a noop
+                    if raw_edits[0][2] == "noop":
+                        # Write the noop and continue
+                        out_m2.write(noop_edit(id)+"\n")
+                        continue
+                    # Apply the edits to generate the corrected text
+                    # Also redefine the edits as orig and cor token offsets
+                    cor, gold_edits = get_cor_and_edits(m2_block[0][2:], raw_edits)
+                    # Parse cor with spacy
+                    cor = annotator.parse(cor)
+                    # Save detection edits here for auto
+                    det_edits = []
+                    # Loop through the gold edits
+                    for gold_edit in gold_edits:
+                        # Do not minimise detection edits
+                        if gold_edit[-2] in {"Um", "UNK"}:
+                            edit = annotator.import_edit(orig, cor, gold_edit[:-1],
+                                min=False, old_cat=args.old_cats)
+                            # Overwrite the pseudo correction and set it in the edit
+                            edit.c_toks = annotator.parse(gold_edit[-1])
+                            # Save the edit for auto
+                            det_edits.append(edit)
+                            # Write the edit for gold
+                            if args.gold:
+                                # Write the edit
+                                out_m2.write(edit.to_m2(id)+"\n")
+                        # Gold annotation
+                        elif args.gold:
+                            edit = annotator.import_edit(orig, cor, gold_edit[:-1],
+                                not args.no_min, args.old_cats)
+                            # Write the edit
+                            out_m2.write(edit.to_m2(id)+"\n")
+                    # Auto annotations
+                    if args.auto:
+                        # Auto edits
+                        edits = annotator.annotate(orig, cor, args.lev, args.merge)
+                        # Combine detection and auto edits and sort by orig offsets
+                        edits = sorted(det_edits+edits, key=lambda e:(e.o_start, e.o_end))
+                        # Write the edits to the output M2 file
+                        for edit in edits:
+                            out_m2.write(edit.to_m2(id)+"\n")
+                # Write a newline when there are no more edits
+                out_m2.write("\n")
+                # Reset the m2 block
+                m2_block = []
 
 # Parse command line args
 def parse_args():

diff --git a/errant/commands/parallel_to_m2.py b/errant/commands/parallel_to_m2.py
@@ -8,13 +8,11 @@ def main():
     print("Loading resources...")
     # Load Errant
     annotator = errant.load("en")
-    # Open output m2 file
-    out_m2 = open(args.out, "w")
 
     print("Processing parallel files...")
     # Process an arbitrary number of files line by line simultaneously. Python 3.3+
-    # See https://tinyurl.com/y4cj4gth
-    with ExitStack() as stack:
+    # See https://tinyurl.com/y4cj4gth . Also opens the output m2 file.
+    with ExitStack() as stack, open(args.out, "w") as out_m2:
         in_files = [stack.enter_context(open(i)) for i in [args.orig]+args.cor]
         # Process each line of all input files
         for line in zip(*in_files):
@@ -45,9 +43,6 @@ def main():
                         out_m2.write(edit.to_m2(cor_id)+"\n")
             # Write a newline when we have processed all corrections for each line
             out_m2.write("\n")
-
-#    pr.disable()
-#    pr.print_stats(sort="time")
 
 # Parse command line args
 def parse_args():