Skip to content

Commit

Permalink
ERRANT v2.3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
Christopher Bryant committed Jul 15, 2021
1 parent 6c0d521 commit 9111c6c
Show file tree
Hide file tree
Showing 10 changed files with 349 additions and 125 deletions.
40 changes: 33 additions & 7 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,32 @@

This log describes all the changes made to ERRANT since its release.

## v2.3.0 (15-07-021)

1. Added some new rules to reduce the number of OTHER-type 1:1 edits and classify them as something else. Specifically, there are now ~40% fewer 1:1 OTHER edits and ~15% fewer n:n OTHER edits overall (tested on the FCE and W&I training sets combined). The changes are as follows:

* A possessive suffix at the start of a merge sequence is now always split:

| Example | people life -> people 's lives |
|---------|------------------------------------------------------------|
| Old | _life_ -> _'s lives_ (R:OTHER) |
| New | _ε_ -> _'s_ (M:NOUN:POSS), _life_ -> _lives_ (R:NOUN:NUM) |

* NUM <-> DET edits are now classified as R:DET; e.g. _one (cat)_ -> _a (cat)_. Thanks to [@katkorre](https://github.com/katkorre/ERRANT-reclassification)!

* Changed the string similarity score in the classifier from the Levenshtein ratio to the normalised Levenshtein distance based on the length of the longest input string. This is because we felt some ratio scores were unintuitive; e.g. _smt_ -> _something_ has a ratio score of 0.5 despite the insertion of 6 characters (the new normalised score is 0.33).

* The non-word spelling error rules were updated slightly to take the new normalised Levenshtein score into account. Additionally, dissimilar strings are now classified based on the POS tag of the correction rather than as OTHER; e.g. _amougnht_ -> _number_ (R:NOUN).

* The new normalised Levenshtein score is also used to classify many of the remaining 1:1 replacement edits that were previously classified as OTHER. Many of these are real-word spelling errors (e.g. _their_ <-> _there_), but there are also some morphological errors (e.g. _health_ -> _healthy_) and POS-based errors (e.g. _transport_ -> _travel_). Note that these rules are a little complex and depend on both the similarity score and the length of the original and corrected strings. For example, _form_ -> _from_ (R:SPELL) and _eventually_ -> _finally_ (R:ADV) both have the same similarity score of 0.5 yet are differentiated as different error types based on their string lengths.

2. Various minor updates:
* `out_m2` in `parallel_to_m2.py` and `m2_to_m2.py` is now opened and closed properly. [#20](https://github.com/chrisjbryant/errant/pull/20)
* Fixed a bracketing error that deleted a valid edit in rare circumstances. [#26](https://github.com/chrisjbryant/errant/issues/26) [#28](https://github.com/chrisjbryant/errant/issues/28)
* Updated the English wordlist.
* Minor changes to the readme.
* Tidied up some code comments.

## v2.2.3 (12-02-21)

1. Changed the dependency version requirements in `setup.py` since ERRANT v2.2.x is not compatible with spaCy 3.
Expand All @@ -27,13 +53,13 @@ Fixed key error in the classifier for rare spaCy 2 POS tags: _SP, BES, HVS.
1. The character level cost in the sentence alignment function is now computed by the much faster [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) library instead of python's native `difflib.SequenceMatcher`. This makes ERRANT 3x faster!

2. Various minor updates:
* Updated the English wordlist.
* Fixed a broken rule for classifying contraction errors.
* Changed a condition in the calculation of transposition errors to be more intuitive.
* Partially updated the ERRANT POS tag map to match the updated [Universal POS tag map](https://universaldependencies.org/tagset-conversion/en-penn-uposf.html). Specifically, EX now maps to PRON rather than ADV, LS maps to X rather than PUNCT, and CONJ has been renamed CCONJ. I did not change the mapping of RP from PART to ADP yet because this breaks several rules involving phrasal verbs.
* Added an `errant.__version__` attribute.
* Added a warning about using ERRANT with spaCy 2.
* Tidied some code in the classifier.
* Updated the English wordlist.
* Fixed a broken rule for classifying contraction errors.
* Changed a condition in the calculation of transposition errors to be more intuitive.
* Partially updated the ERRANT POS tag map to match the updated [Universal POS tag map](https://universaldependencies.org/tagset-conversion/en-penn-uposf.html). Specifically, EX now maps to PRON rather than ADV, LS maps to X rather than PUNCT, and CONJ has been renamed CCONJ. I did not change the mapping of RP from PART to ADP yet because this breaks several rules involving phrasal verbs.
* Added an `errant.__version__` attribute.
* Added a warning about using ERRANT with spaCy 2.
* Tidied some code in the classifier.

## v2.0.0 (10-12-19)

Expand Down
11 changes: 6 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# ERRANT v2.2.3
# ERRANT v2.3.0

This repository contains the grammatical ERRor ANnotation Toolkit (ERRANT) described in:

> Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. [**Automatic annotation and evaluation of error types for grammatical error correction**](https://www.aclweb.org/anthology/P17-1074/). In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada.
> Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. [**Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments**](https://www.aclweb.org/anthology/C16-1079/). In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.
If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html).
If you make use of this code, please cite the above papers. More information about ERRANT can be found [here](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-938.html). In particular, see Chapter 5 for definitions of error types.

# Overview

Expand Down Expand Up @@ -40,11 +40,11 @@ python3 -m spacy download en
```
This will create and activate a new python3 environment called `errant_env` in the current directory. `pip` will then update some setup tools and install ERRANT, [spaCy](https://spacy.io/), [python-Levenshtein](https://pypi.org/project/python-Levenshtein/) and spaCy's default English model in this environment. You can deactivate the environment at any time by running `deactivate`, but must remember to activate it again whenever you want to use ERRANT.

#### ERRANT and spaCy 2
#### ERRANT and spaCy

ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy, this means ERRANT v2.2.0 is **~4x slower** than ERRANT v2.1.0.
ERRANT was originally designed to work with spaCy v1.9.0 and works best with this version. SpaCy v1.9.0 does not work with Python >= 3.7 however, and so we were forced to update ERRANT to be compatible with spaCy 2. Since spaCy 2 uses a neural system to trade speed for accuracy, this means ERRANT v2.2 is **~4x slower** than ERRANT v2.1. We have not yet extended ERRANT to work with spaCy 3, but preliminary tests suggest ERRANT will become even slower.

There is no way around this if you use Python >= 3.7, but *we recommend installing ERRANT v2.1.0 if you use Python < 3.7*.
Consequently, we recommend ERRANT v2.1.0 if speed is a priority and you can use Python < 3.7.
```
pip3 install errant==2.1.0
```
Expand All @@ -64,6 +64,7 @@ git clone https://github.com/chrisjbryant/errant.git
cd errant
python3 -m venv errant_env
source errant_env/bin/activate
pip3 install -U pip setuptools wheel
pip3 install -e .
python3 -m spacy download en
```
Expand Down
2 changes: 1 addition & 1 deletion errant/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from errant.annotator import Annotator

# ERRANT version
__version__ = '2.2.3'
__version__ = '2.3.0'

# Load an ERRANT Annotator object for a given language
def load(lang, nlp=None):
Expand Down
122 changes: 64 additions & 58 deletions errant/commands/m2_to_m2.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,66 +7,72 @@ def main():
print("Loading resources...")
# Load Errant
annotator = errant.load("en")
# Open output M2 file
out_m2 = open(args.out, "w")

print("Processing M2 file...")
# Open the m2 file and split it into text+edit blocks
m2 = open(args.m2_file).read().strip().split("\n\n")
# Loop through the blocks
for m2_block in m2:
m2_block = m2_block.strip().split("\n")
# Write the original text to the output M2 file
out_m2.write(m2_block[0]+"\n")
# Parse orig with spacy
orig = annotator.parse(m2_block[0][2:])
# Simplify the edits and sort by coder id
edit_dict = simplify_edits(m2_block[1:])
# Loop through coder ids
for id, raw_edits in sorted(edit_dict.items()):
# If the first edit is a noop
if raw_edits[0][2] == "noop":
# Write the noop and continue
out_m2.write(noop_edit(id)+"\n")
continue
# Apply the edits to generate the corrected text
# Also redefine the edits as orig and cor token offsets
cor, gold_edits = get_cor_and_edits(m2_block[0][2:], raw_edits)
# Parse cor with spacy
cor = annotator.parse(cor)
# Save detection edits here for auto
det_edits = []
# Loop through the gold edits
for gold_edit in gold_edits:
# Do not minimise detection edits
if gold_edit[-2] in {"Um", "UNK"}:
edit = annotator.import_edit(orig, cor, gold_edit[:-1],
min=False, old_cat=args.old_cats)
# Overwrite the pseudo correction and set it in the edit
edit.c_toks = annotator.parse(gold_edit[-1])
# Save the edit for auto
det_edits.append(edit)
# Write the edit for gold
if args.gold:
# Write the edit
out_m2.write(edit.to_m2(id)+"\n")
# Gold annotation
elif args.gold:
edit = annotator.import_edit(orig, cor, gold_edit[:-1],
not args.no_min, args.old_cats)
# Write the edit
out_m2.write(edit.to_m2(id)+"\n")
# Auto annotations
if args.auto:
# Auto edits
edits = annotator.annotate(orig, cor, args.lev, args.merge)
# Combine detection and auto edits and sort by orig offsets
edits = sorted(det_edits+edits, key=lambda e:(e.o_start, e.o_end))
# Write the edits to the output M2 file
for edit in edits:
out_m2.write(edit.to_m2(id)+"\n")
# Write a newline when there are no more edits
out_m2.write("\n")
# Open the m2 file and split it into text+edits blocks. Also open out_m2.
with open(args.m2_file) as m2, open(args.out, "w") as out_m2:
# Store the current m2_block here
m2_block = []
# Loop through m2 lines
for line in m2:
line = line.strip()
# If the line isn't empty, add it to the m2_block
if line: m2_block.append(line)
# Otherwise, process the complete blocks
else:
# Write the original text to the output M2 file
out_m2.write(m2_block[0]+"\n")
# Parse orig with spacy
orig = annotator.parse(m2_block[0][2:])
# Simplify the edits and sort by coder id
edit_dict = simplify_edits(m2_block[1:])
# Loop through coder ids
for id, raw_edits in sorted(edit_dict.items()):
# If the first edit is a noop
if raw_edits[0][2] == "noop":
# Write the noop and continue
out_m2.write(noop_edit(id)+"\n")
continue
# Apply the edits to generate the corrected text
# Also redefine the edits as orig and cor token offsets
cor, gold_edits = get_cor_and_edits(m2_block[0][2:], raw_edits)
# Parse cor with spacy
cor = annotator.parse(cor)
# Save detection edits here for auto
det_edits = []
# Loop through the gold edits
for gold_edit in gold_edits:
# Do not minimise detection edits
if gold_edit[-2] in {"Um", "UNK"}:
edit = annotator.import_edit(orig, cor, gold_edit[:-1],
min=False, old_cat=args.old_cats)
# Overwrite the pseudo correction and set it in the edit
edit.c_toks = annotator.parse(gold_edit[-1])
# Save the edit for auto
det_edits.append(edit)
# Write the edit for gold
if args.gold:
# Write the edit
out_m2.write(edit.to_m2(id)+"\n")
# Gold annotation
elif args.gold:
edit = annotator.import_edit(orig, cor, gold_edit[:-1],
not args.no_min, args.old_cats)
# Write the edit
out_m2.write(edit.to_m2(id)+"\n")
# Auto annotations
if args.auto:
# Auto edits
edits = annotator.annotate(orig, cor, args.lev, args.merge)
# Combine detection and auto edits and sort by orig offsets
edits = sorted(det_edits+edits, key=lambda e:(e.o_start, e.o_end))
# Write the edits to the output M2 file
for edit in edits:
out_m2.write(edit.to_m2(id)+"\n")
# Write a newline when there are no more edits
out_m2.write("\n")
# Reset the m2 block
m2_block = []

# Parse command line args
def parse_args():
Expand Down
9 changes: 2 additions & 7 deletions errant/commands/parallel_to_m2.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,11 @@ def main():
print("Loading resources...")
# Load Errant
annotator = errant.load("en")
# Open output m2 file
out_m2 = open(args.out, "w")

print("Processing parallel files...")
# Process an arbitrary number of files line by line simultaneously. Python 3.3+
# See https://tinyurl.com/y4cj4gth
with ExitStack() as stack:
# See https://tinyurl.com/y4cj4gth . Also opens the output m2 file.
with ExitStack() as stack, open(args.out, "w") as out_m2:
in_files = [stack.enter_context(open(i)) for i in [args.orig]+args.cor]
# Process each line of all input files
for line in zip(*in_files):
Expand Down Expand Up @@ -45,9 +43,6 @@ def main():
out_m2.write(edit.to_m2(cor_id)+"\n")
# Write a newline when we have processed all corrections for each line
out_m2.write("\n")

# pr.disable()
# pr.print_stats(sort="time")

# Parse command line args
def parse_args():
Expand Down
Loading

0 comments on commit 9111c6c

Please sign in to comment.