title | date | author | authorAvatar | tags | categories | image | draft | |||
---|---|---|---|---|---|---|---|---|---|---|
Spelling Error Detection using Deep Neural Networks |
2022-08-01 15:36:35 +0530 |
Stanley George |
img/ada.jpg |
|
img/writing.jpg |
false |
This project aims to detect spelling errors in a given sentence using a Deep Learning approach.
CONTENTS
- INTRODUCTION
The advent of computers into almost all spheres of human-life has made us more dependent on the keypad to convey information rather than filling out forms by hand. The use of a keypad (physical or virtual as in a smartphone) has enabled us to read legibly unlike the instance when one can't decipher the handwriting inscribed on a letter. But, human beings are still prone to make spelling errors. In this project, we try to detect spelling errors in an English language input text. - PROBLEM STATEMENT: Given a Sentence in English language as the input, output the words that have been incorrectly spelled.
- DATASETS In this project, we used two datasets for training and evaluation respectively. We decided to use two datasets so as to ensure our model has been able to generalise better instead of simply memorizing all words.
- Wikipedia Dataset In this project, the Wikipedia dataset (collected by the AD Chair) was used for training the models. The datasets were already split up as development, training and testing files - however we used only the training file for our model's training task.
- BEA-60k It is a dataset collected by the team who developed the NeuSpell spelling correction toolkit. It is made up of close to 70K spelling errors in around 63000 sentences.
- DATA ENCODING An important task when working with textual data for a Deep Learning model (or any Machine Learning model) is to decide on the encoding technique. We would need a mechanism by which we convert our textual data into a numerical format using which our models can learn the data and optimise itself. As part of this project we experimented with two such encoding techniques where are discussed below:
- Semi-Character Vector This technique was introduced in the 2017 paper by Sakaguchi et al. in their paper titled Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network. Human beings often can undertsand words in a sentence even if a word is misspelled.(E.g. Look at the last two sentences. There were a couple of misspellings, but you still figured it out !!)
- The boy cuold not slove the probelm so he aksed for help.
- The boy coudl not solev the problme so he askde for help.
- The boy oculd not oslve the rpoblem so he saked for help.
- Character level One-hot Encoding
- MODELS: As part of this project, we mainly used the LSTM networks for our analysis. Below we briefly introduce what are Recurrent Neural Networks (RNN) and LSTM networks.
- Recurrent Neural Networks: Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They can be visusalised as below:
- Long Short-term Memory:
- DATASET PREPARATION:
- Experimental Focus: Our experiments were focused on two angles: context based and non-context based. As the name suggests, in the non-context based approach, the model is trained only on individual words. The input was not a sentence of words but just a single word. For the context based approach, the model was trained on the target word and the contextual words that appeared before and after it. For our experiments, the models were shown two context words before and after the target word.
- Training Dataset:
- Real-time Error generation:
- Replace a character with another random character
- Introduce an extra character at a random position
- Remove a character from a random position
- Evaluation Dataset:
- TRAINING: For this project, we used the popular machine learning Pytorch which is based on the Torch library. The models were trained on 4 Nvidia Titan X (Pascal) GPUs. This can be done easily (i.e. using multiple GPUs for Parallel training), thanks to Pytorch's nn.DataParallel module. A code snippet can be found here showing how its implemented.
- RESULTS: The table below shows the accuracy that was achieved on the test dataset and the corresponding F1 Score. Further we also have plotted the Confusion Matrix for the same dataset. It is clearly evident that Context based models outperform non-context based model.
- OBSERVATIONS:
- Does Context really Matter ?
- One-Hot Encoding OR Semi-Character Encoding ? A closer look at the training plots placed in Fig 4 shows the orange line reaching better metric levels faster than the semi-character encodings. A possible reason for this behaviour lies in the underlying concept of one-hot vectors. One-hot vectors are simply character-by-character encoding with no special technique applied.
- Does the Case matter ? At the beginning of writing my code, I was under the assumption that the models would be happy with just the spellings. But just a trial of removing the \({.lower()}\) function resulted in a huge spike in the validation accuracy (~10%).
- THE MILLION DOLLAR QUESTION: Let's discuss some advantages of these models:
- It can detect spelling errors
- For context-based models, in many situations, it highlights error words which aren't error but hinting towards a grammatical mistake or a missing filler word.
- It produces False-Positives.(mostly prevalent in the context-based approach)
- The one-hot model classifies the first word mostly as Positive
- It doesn't consider punctuations.
- Train on a bigger dataset.
- Expand the alphabet to wisely include punctuations as part of the context.
- A closer look at the results of the test dataset showed some scenarios where only either of the model was correct. Maybe an ensemble based approach might work here.
Motivation to choose it :Since Wikipedia is a public website, we expect less spelling mistakes. Moreover, it also offers articles from a wide variety of topics. This is in contrast to providing as input a JSON file containing an open-sourced book. Even though the contents would be well-proofed from a grammatical and spelling error perspective, it however will be limited to a small context of words on which the book has been based.
Also, as part of the initial investigations, some other datasets too were considered which can also serve the same purpose. Some of them that can be used by future students who are conducting research on similar areas are:
For this project, we splitted the data equally (50-50) and was used as our validation and test dataset.
Here the authors built on top of the research conducted by other Scientists (psycholinguists) to understand the level of difficulty faced by humans to decipher a word based on the position at which a spelling error occurred in a word.
Consider the below three sentences:
It was found that reading sentence (1) and (2) were comparetively easier than (3) because it was the BEGinning characters that were jumbled. In case of (1), the INTernal words were swapped whereas in (2) the ENDing characters were altered. Overall, the difficulty in reading jumbled words can be summarized as: N ≤ INT < END < BEG where N denotes No error. Thus, we see that the beginning letter and the ending letter have more importance in human word recognition.
We use this principle to construct a word vector which is made up of three sub-vectors (\(b_{n},i_{n},e_{n})\) that correspond to the characters position. The first and third sub-vectors represent the first and last character of the n-th word. They are analogous to the one-hot representation technique which is popular in the field of Deep Learning. The second sub-vector \(i_{n}\) represents the character count of each word except the first and last word. Refer the below figure for a sample word vector for the word 'Dictionary'.
![](/georgestanley/Spelling-Error-DeepNLP/raw/main/assets/semi_character_example.png)
So, considering we have a five word sentence (e.g. My favourite dictionary is Oxford) as input to the model with a Vocabulary set of 52 elements (English alphabets in lower and upper case), we have a word vector of shape 5* 156 (5 words * (\(b_{n}+i_{n}+e_{n}\))) which will be passed on as the input to our model.
One-hot encoding is often used in the field of Machine learning to encode categorical data into numerical data. It is a binary representation vector wherein a categorical class' index is assigned a positive value (1) and rest all indexes have a negative (0) value. A brief tutorial of this concept can be found here.
However, we can extend this concept further and use it to encode a Word at Character level where our Vocabulary set is made up of the alphabets of the language. This is same as the bn and en vectors created in section 4.i
So, considering the same five word sentence used for the Semi-Character example (i.e. My favourite dictionary is Oxford) , we will have a One-hot vector of shape 33*53 where 33 is length of the sentence and 53 is the length of the vocabulary. This means that every single character has a vector associated with it. In the vocabulary set we add an additional vector to account for Spaces in the sentence. Hence, the vocabulary length now becomes 52+1 = 53.
![](/georgestanley/Spelling-Error-DeepNLP/raw/main/assets/rnn_demo_stanford.png)
Such networks are mainly used in the field of Natural language processing and Speech recognition. However, RNN networks also experience a phenomena known as exploding/vanishing gradient problem. It happens when the network finds it difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.
LSTMs solve the problem of vanishing gradient by introducing gates inside the network that regulate the flow of information within a cell. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. Due to this, networks can now learn from sequences that are of longer length.
If you would like to learn more about LSTM and how they function, check out this well explained blog by Chris Olah Understanding LSTM Networks where he explains how each LSTM modules operate with good visual cues.
Note: The project began by experimenting on Multi-Layer Perceptrons (MLP) and Recurrent Neural Networks(RNN). However, since the results in the initial stages were not very promising, we dedicated more focus towards LSTM networks. Hence, only LSTM evaluation results are listed below.
Example: In the sentence We need suffcient carbohydrates in our body , for the target word suffcient, the input sentence would be We need suffcient carbohydrates in i.e. the two sequential words before and after the target.
For the non-context based approach, we extracted all the individual words from 200,000 Wikipedia articles and filtered out the words which occurred less than 20 times. This additional filtering had to be done to remove words that were misspelt or words that occured very rarely in the encyclopedia. The final training dataset contained 147,011 words.
As for the context based approach,we didn't need to do any extra preparation tasks. However, due to the limitations with GPU compute, we trained only on a randomly selected corpus of 5000 wikipedia articles. The final dataset contained 1,743,076 5-gram pairs.
Additionally, for the one-hot encoding technique, we also needed to decide on the length of the one-hot encoded vector. For the same, we plotted a distribution of the length of 5-word sentences of the entire dataset (Fig 3). Based on the results, we decided to set 60 characters as the maximum length of the vector. So, any 5-word sentences greater than 60 characters would be trimmed to 60 characters and sentences that are lesser than 60 characters would be given extra right-end paddings.
![](/georgestanley/Spelling-Error-DeepNLP/raw/main/assets/img_1.png)
Since we are following a Supervised learning approach, our models need to be trained to distinguish between Positive and Negative Samples by training it on Positive and Negative Samples. However, we assume that our dataset is a clean dataset i.e. without any errors. Hence, we introduce errors manually during the training epochs.
We introduce randomly either of the below three error on the target word:
Due to this, in every epoch, the network sees a different negative word thereby avoiding a possible overfitting .
The BEA-60k dataset was modified as collection of positive and negative sample of 5-word sentences. So,the final dataset size was <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-46ru{background-color:#96fffb;border-color:inherit;text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} .tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top} .tg .tg-elvq{background-color:#fffc9e;border-color:inherit;text-align:left;vertical-align:top} </style>
Samples With Error | Samples Without Error | Total Samples | ||
---|---|---|---|---|
Context-Based (i.e. Sentences) | 10,865 | 10,781 | 21,646 | Validation Set |
11,220 | 11,202 | 22,422 | Test Set | |
Non-Context (i.e. Words) | 13,899 | 6,384 | 20,283 | Validation Set |
14,299 | 6,671 | 20,970 | Test Set |
The below table lists some important hyperparameters used during the training.
Fig 4. shows the some of the important training metrics.
LSTM without context | LSTM with context | LSTM with context | |
---|---|---|---|
Encoding | Semi-Character | Semi-Character | One-hot encoding |
Learning Rate | 0.01 | 0.001 | 0.001 |
Hidden Dimension | 256 | 512 | 512 |
Optimizer | ADAM | ADAM | ADAM |
Loss | Cross Entropy Loss | Cross Entropy Loss | Cross Entropy Loss |
Training time(per Epoch) | 1 min | 16 min | 20 min |
Dataset Size | ~150k words | 5000 Wikipedia articles | 5000 Wikipedia articles |
![](/georgestanley/Spelling-Error-DeepNLP/raw/main/assets/pic_all_graphs.png)
Accuracy | F1 Score | |
---|---|---|
LSTM without Context | 76.01% | 0.798 |
LSTM with Context (Semi-Character) | 87.90% | 0.879 |
LSTM with Context (One-hot) | 87.77% | 0.874 |
![](/georgestanley/Spelling-Error-DeepNLP/raw/main/assets/cm_all_Screenshot.png)
Some sample evaluations:
ID | Sentence | Semi-Character Encoding with Context | One-hot Encoding with context | Semi-Charcter Encoding without Context |
---|---|---|---|---|
1 | understant | - | understant | - |
2 | The quick brown fox jumps Over the lazy dog | fox | The,quick,fox | over |
3 | We need to appreciat the developer for his efforts | appreciat | We,appreciat | - |
4 | An impartial invstigation into the crash was conducted by the agency | invstigation | An,impartial,invstigation | invstigation,was |
5 | for no apparant reason she laughed | apparant | for,no,apparant | - |
6 | Students must focus on their Pronanciation skills for better grades | Pronanciation | - | - |
7 | Students must focus on their pronuncation skills for better grades | pronuncation | pronuncation | - |
As seen from the Accuracy and F1 score in Section 7 Table 3 or by looking at the examples in Table 4 or on playing with the console (or webapp), one can easily see that having context improves the score. Let us now look at some examples from Table 4.
Consider the third example (We need to appreciat the developer for his efforts). Here, only the context-based models detected the error word. Something similar happens in the fifth example.
Similarly, the word 'appresiate'. The non-context classifier classified it as Negative. However, both Context based classifiers classified it as Positive when provided in a contextual sentence as 'I really appresiate my host'.
So, YES, the context matters !!
On the other hand, the semi-character encoding was a special kind of algorithm for which the model took time to learn but once it understood the encoding, it started producing good results. Hence the model learnt faster.
This however was also accompanied by an increased compute time and a large memory usage (as observed during the training process). The examples shown in Table 4 show certain errors which were detected only by either of these models.
Hence, we can't decisively say which one is supreme.
Hence, any spelling detection model should be provided with the actual case in which it was written !!
Now, what are some of the major problems:
So, I would honestly say NO !!
What can be done to make it better:
I would like to thank Mr. Matthias Hertel for supervising this project and providing valuable suggestions and prompt responses whenever approached.