title

date

author

authorAvatar

categories

image

draft

Spelling Error Detection using Deep Neural Networks

2022-08-01 15:36:35 +0530

Stanley George

img/ada.jpg

deep learning

spelling error

LSTM

img/writing.jpg

false

This project aims to detect spelling errors in a given sentence using a Deep Learning approach.

CONTENTS

Introduction
Problem Statement
Dataset
Data Encoding
Models
Dataset preparation
Training
Results
Observations
The Million Dollar Question

INTRODUCTION
The advent of computers into almost all spheres of human-life has made us more dependent on the keypad to convey information rather than filling out forms by hand. The use of a keypad (physical or virtual as in a smartphone) has enabled us to read legibly unlike the instance when one can't decipher the handwriting inscribed on a letter. But, human beings are still prone to make spelling errors. In this project, we try to detect spelling errors in an English language input text.

PROBLEM STATEMENT:

Given a Sentence in English language as the input, output the words that have been incorrectly spelled.

DATASETS
In this project, we used two datasets for training and evaluation respectively. We decided to use two datasets so as to ensure our model has been able to generalise better instead of simply memorizing all words.

Wikipedia Dataset

Motivation to choose it :Since Wikipedia is a public website, we expect less spelling mistakes. Moreover, it also offers articles from a wide variety of topics. This is in contrast to providing as input a JSON file containing an open-sourced book. Even though the contents would be well-proofed from a grammatical and spelling error perspective, it however will be limited to a small context of words on which the book has been based.

Also, as part of the initial investigations, some other datasets too were considered which can also serve the same purpose. Some of them that can be used by future students who are conducting research on similar areas are:

Reuters News Dataset
Google's 1 Billion Word dataset

BEA-60k

NeuSpell

For this project, we splitted the data equally (50-50) and was used as our validation and test dataset.

DATA ENCODING

An important task when working with textual data for a Deep Learning model (or any Machine Learning model) is to decide on the encoding technique. We would need a mechanism by which we convert our textual data into a numerical format using which our models can learn the data and optimise itself. As part of this project we experimented with two such encoding techniques where are discussed below:

Semi-Character Vector

Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network

Here the authors built on top of the research conducted by other Scientists (psycholinguists) to understand the level of difficulty faced by humans to decipher a word based on the position at which a spelling error occurred in a word.

Consider the below three sentences:

The boy cuold not slove the probelm so he aksed for help.
The boy coudl not solev the problme so he askde for help.
The boy oculd not oslve the rpoblem so he saked for help.

It was found that reading sentence (1) and (2) were comparetively easier than (3) because it was the BEGinning characters that were jumbled. In case of (1), the INTernal words were swapped whereas in (2) the ENDing characters were altered. Overall, the difficulty in reading jumbled words can be summarized as: N ≤ INT < END < BEG where N denotes No error. Thus, we see that the beginning letter and the ending letter have more importance in human word recognition.

We use this principle to construct a word vector which is made up of three sub-vectors (\(b_{n},i_{n},e_{n})\) that correspond to the characters position. The first and third sub-vectors represent the first and last character of the n-th word. They are analogous to the one-hot representation technique which is popular in the field of Deep Learning. The second sub-vector \(i_{n}\) represents the character count of each word except the first and last word. Refer the below figure for a sample word vector for the word 'Dictionary'.

So, considering we have a five word sentence (e.g. My favourite dictionary is Oxford) as input to the model with a Vocabulary set of 52 elements (English alphabets in lower and upper case), we have a word vector of shape 5* 156 (5 words * (\(b_{n}+i_{n}+e_{n}\))) which will be passed on as the input to our model.

Character level One-hot Encoding

One-hot encoding is often used in the field of Machine learning to encode categorical data into numerical data. It is a binary representation vector wherein a categorical class' index is assigned a positive value (1) and rest all indexes have a negative (0) value. A brief tutorial of this concept can be found here.

However, we can extend this concept further and use it to encode a Word at Character level where our Vocabulary set is made up of the alphabets of the language. This is same as the bn and en vectors created in section 4.i

So, considering the same five word sentence used for the Semi-Character example (i.e. My favourite dictionary is Oxford) , we will have a One-hot vector of shape 33*53 where 33 is length of the sentence and 53 is the length of the vocabulary. This means that every single character has a vector associated with it. In the vocabulary set we add an additional vector to account for Spaces in the sentence. Hence, the vocabulary length now becomes 52+1 = 53.

MODELS:

As part of this project, we mainly used the LSTM networks for our analysis. Below we briefly introduce what are Recurrent Neural Networks (RNN) and LSTM networks.

Recurrent Neural Networks:

Source: Stanford CS230 lecture notes

Such networks are mainly used in the field of Natural language processing and Speech recognition. However, RNN networks also experience a phenomena known as exploding/vanishing gradient problem. It happens when the network finds it difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.

Long Short-term Memory:

LSTMs solve the problem of vanishing gradient by introducing gates inside the network that regulate the flow of information within a cell. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. Due to this, networks can now learn from sequences that are of longer length.

If you would like to learn more about LSTM and how they function, check out this well explained blog by Chris Olah Understanding LSTM Networks where he explains how each LSTM modules operate with good visual cues.

Note: The project began by experimenting on Multi-Layer Perceptrons (MLP) and Recurrent Neural Networks(RNN). However, since the results in the initial stages were not very promising, we dedicated more focus towards LSTM networks. Hence, only LSTM evaluation results are listed below.

DATASET PREPARATION:

Experimental Focus:

context

non-context

Example: In the sentence We need suffcient carbohydrates in our body , for the target word suffcient, the input sentence would be We need suffcient carbohydrates in i.e. the two sequential words before and after the target.

Training Dataset:

For the non-context based approach, we extracted all the individual words from 200,000 Wikipedia articles and filtered out the words which occurred less than 20 times. This additional filtering had to be done to remove words that were misspelt or words that occured very rarely in the encyclopedia. The final training dataset contained 147,011 words.

As for the context based approach,we didn't need to do any extra preparation tasks. However, due to the limitations with GPU compute, we trained only on a randomly selected corpus of 5000 wikipedia articles. The final dataset contained 1,743,076 5-gram pairs.

Additionally, for the one-hot encoding technique, we also needed to decide on the length of the one-hot encoded vector. For the same, we plotted a distribution of the length of 5-word sentences of the entire dataset (Fig 3). Based on the results, we decided to set 60 characters as the maximum length of the vector. So, any 5-word sentences greater than 60 characters would be trimmed to 60 characters and sentences that are lesser than 60 characters would be given extra right-end paddings.

Real-time Error generation:

Since we are following a Supervised learning approach, our models need to be trained to distinguish between Positive and Negative Samples by training it on Positive and Negative Samples. However, we assume that our dataset is a clean dataset i.e. without any errors. Hence, we introduce errors manually during the training epochs.

We introduce randomly either of the below three error on the target word:

Replace a character with another random character
Introduce an extra character at a random position
Remove a character from a random position

Due to this, in every epoch, the network sees a different negative word thereby avoiding a possible overfitting .

Evaluation Dataset:

The BEA-60k dataset was modified as collection of positive and negative sample of 5-word sentences. So,the final dataset size was <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-46ru{background-color:#96fffb;border-color:inherit;text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} .tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top} .tg .tg-elvq{background-color:#fffc9e;border-color:inherit;text-align:left;vertical-align:top} </style>

	Samples With Error	Samples Without Error	Total Samples
Context-Based (i.e. Sentences)	10,865	10,781	21,646	Validation Set
Context-Based (i.e. Sentences)	11,220	11,202	22,422	Test Set
Non-Context (i.e. Words)	13,899	6,384	20,283	Validation Set
Non-Context (i.e. Words)	14,299	6,671	20,970	Test Set

TRAINING:

For this project, we used the popular machine learning Pytorch which is based on the Torch library. The models were trained on 4 Nvidia Titan X (Pascal) GPUs. This can be done easily (i.e. using multiple GPUs for Parallel training), thanks to Pytorch's nn.DataParallel module. A code snippet can be found here showing how its implemented.
The below table lists some important hyperparameters used during the training.

Fig 4. shows the some of the important training metrics.

	LSTM without context	LSTM with context	LSTM with context
Encoding	Semi-Character	Semi-Character	One-hot encoding
Learning Rate	0.01	0.001	0.001
Hidden Dimension	256	512	512
Optimizer	ADAM	ADAM	ADAM
Loss	Cross Entropy Loss	Cross Entropy Loss	Cross Entropy Loss
Training time(per Epoch)	1 min	16 min	20 min
Dataset Size	~150k words	5000 Wikipedia articles	5000 Wikipedia articles

Table.2 - Important hyperparameters used for Evaluation.

Fig 4. Some important metrics of the training phase.

RESULTS:

The table below shows the accuracy that was achieved on the test dataset and the corresponding F1 Score. Further we also have plotted the Confusion Matrix for the same dataset. It is clearly evident that Context based models outperform non-context based model.

	Accuracy	F1 Score
LSTM without Context	76.01%	0.798
LSTM with Context (Semi-Character)	87.90%	0.879
LSTM with Context (One-hot)	87.77%	0.874

Table 3. Accuracy and F1 Score on the test dataset.

Fig 5. Confusion Matrix for the three models.

Some sample evaluations:

ID	Sentence	Semi-Character Encoding with Context	One-hot Encoding with context	Semi-Charcter Encoding without Context
1	understant	-	understant	-
2	The quick brown fox jumps Over the lazy dog	fox	The,quick,fox	over
3	We need to appreciat the developer for his efforts	appreciat	We,appreciat	-
4	An impartial invstigation into the crash was conducted by the agency	invstigation	An,impartial,invstigation	invstigation,was
5	for no apparant reason she laughed	apparant	for,no,apparant	-
6	Students must focus on their Pronanciation skills for better grades	Pronanciation	-	-
7	Students must focus on their pronuncation skills for better grades	pronuncation	pronuncation	-

Table 4. Certain demonstrative examples comparing the performance of the three models.

OBSERVATIONS:

Does Context really Matter ?

As seen from the Accuracy and F1 score in Section 7 Table 3 or by looking at the examples in Table 4 or on playing with the console (or webapp), one can easily see that having context improves the score. Let us now look at some examples from Table 4.

Consider the third example (We need to appreciat the developer for his efforts). Here, only the context-based models detected the error word. Something similar happens in the fifth example.

Similarly, the word 'appresiate'. The non-context classifier classified it as Negative. However, both Context based classifiers classified it as Positive when provided in a contextual sentence as 'I really appresiate my host'.

So, YES, the context matters !!

One-Hot Encoding OR Semi-Character Encoding ?

Fig 4

On the other hand, the semi-character encoding was a special kind of algorithm for which the model took time to learn but once it understood the encoding, it started producing good results. Hence the model learnt faster.

This however was also accompanied by an increased compute time and a large memory usage (as observed during the training process). The examples shown in Table 4 show certain errors which were detected only by either of these models.

Hence, we can't decisively say which one is supreme.

Does the Case matter ?

Hence, any spelling detection model should be provided with the actual case in which it was written !!

THE MILLION DOLLAR QUESTION:

"Will I use my spell-checker for my next big revolutionary Word Processing Software?"

Let's discuss some advantages of these models:

It can detect spelling errors
For context-based models, in many situations, it highlights error words which aren't error but hinting towards a grammatical mistake or a missing filler word.

Now, what are some of the major problems:

It produces False-Positives.(mostly prevalent in the context-based approach)
The one-hot model classifies the first word mostly as Positive
It doesn't consider punctuations.

So, I would honestly say NO !!

What can be done to make it better:

Train on a bigger dataset.
Expand the alphabet to wisely include punctuations as part of the context.
A closer look at the results of the test dataset showed some scenarios where only either of the model was correct. Maybe an ensemble based approach might work here.

Acknowledgements:

I would like to thank Mr. Matthias Hertel for supervising this project and providing valuable suggestions and prompt responses whenever approached.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project-spelling-error-detection.md

project-spelling-error-detection.md

"Will I use my spell-checker for my next big revolutionary Word Processing Software?"

Acknowledgements:

Files

project-spelling-error-detection.md

Latest commit

History

project-spelling-error-detection.md

File metadata and controls

"Will I use my spell-checker for my next big revolutionary Word Processing Software?"

Acknowledgements: