Skip to content

Latest commit

 

History

History
505 lines (436 loc) · 22.4 KB

project-spelling-error-detection.md

File metadata and controls

505 lines (436 loc) · 22.4 KB
title date author authorAvatar tags categories image draft
Spelling Error Detection using Deep Neural Networks
2022-08-01 15:36:35 +0530
Stanley George
img/ada.jpg
deep learning
spelling error
LSTM
img/writing.jpg
false

This project aims to detect spelling errors in a given sentence using a Deep Learning approach.

  1. INTRODUCTION
    The advent of computers into almost all spheres of human-life has made us more dependent on the keypad to convey information rather than filling out forms by hand. The use of a keypad (physical or virtual as in a smartphone) has enabled us to read legibly unlike the instance when one can't decipher the handwriting inscribed on a letter. But, human beings are still prone to make spelling errors. In this project, we try to detect spelling errors in an English language input text.

  2. PROBLEM STATEMENT:
  3. Given a Sentence in English language as the input, output the words that have been incorrectly spelled.

  4. DATASETS
  5. In this project, we used two datasets for training and evaluation respectively. We decided to use two datasets so as to ensure our model has been able to generalise better instead of simply memorizing all words.
    1. Wikipedia Dataset
    2. In this project, the Wikipedia dataset (collected by the AD Chair) was used for training the models. The datasets were already split up as development, training and testing files - however we used only the training file for our model's training task.

      Motivation to choose it :Since Wikipedia is a public website, we expect less spelling mistakes. Moreover, it also offers articles from a wide variety of topics. This is in contrast to providing as input a JSON file containing an open-sourced book. Even though the contents would be well-proofed from a grammatical and spelling error perspective, it however will be limited to a small context of words on which the book has been based.

      Also, as part of the initial investigations, some other datasets too were considered which can also serve the same purpose. Some of them that can be used by future students who are conducting research on similar areas are:

    3. BEA-60k
    4. It is a dataset collected by the team who developed the NeuSpell spelling correction toolkit. It is made up of close to 70K spelling errors in around 63000 sentences.

      For this project, we splitted the data equally (50-50) and was used as our validation and test dataset.

  6. DATA ENCODING
  7. An important task when working with textual data for a Deep Learning model (or any Machine Learning model) is to decide on the encoding technique. We would need a mechanism by which we convert our textual data into a numerical format using which our models can learn the data and optimise itself. As part of this project we experimented with two such encoding techniques where are discussed below:
    1. Semi-Character Vector
    2. This technique was introduced in the 2017 paper by Sakaguchi et al. in their paper titled Robsut Wrod Reocginiton via Semi-Character Recurrent Neural Network. Human beings often can undertsand words in a sentence even if a word is misspelled.(E.g. Look at the last two sentences. There were a couple of misspellings, but you still figured it out !!)


      Here the authors built on top of the research conducted by other Scientists (psycholinguists) to understand the level of difficulty faced by humans to decipher a word based on the position at which a spelling error occurred in a word.

      Consider the below three sentences:

      1. The boy cuold not slove the probelm so he aksed for help.
      2. The boy coudl not solev the problme so he askde for help.
      3. The boy oculd not oslve the rpoblem so he saked for help.

      It was found that reading sentence (1) and (2) were comparetively easier than (3) because it was the BEGinning characters that were jumbled. In case of (1), the INTernal words were swapped whereas in (2) the ENDing characters were altered. Overall, the difficulty in reading jumbled words can be summarized as: N ≤ INT < END < BEG where N denotes No error. Thus, we see that the beginning letter and the ending letter have more importance in human word recognition.

      We use this principle to construct a word vector which is made up of three sub-vectors (\(b_{n},i_{n},e_{n})\) that correspond to the characters position. The first and third sub-vectors represent the first and last character of the n-th word. They are analogous to the one-hot representation technique which is popular in the field of Deep Learning. The second sub-vector \(i_{n}\) represents the character count of each word except the first and last word. Refer the below figure for a sample word vector for the word 'Dictionary'.

      Fig 1. An example of the Semi-Character Vector generated for the word Dictionary

      So, considering we have a five word sentence (e.g. My favourite dictionary is Oxford) as input to the model with a Vocabulary set of 52 elements (English alphabets in lower and upper case), we have a word vector of shape 5* 156 (5 words * (\(b_{n}+i_{n}+e_{n}\))) which will be passed on as the input to our model.

    3. Character level One-hot Encoding
    4. One-hot encoding is often used in the field of Machine learning to encode categorical data into numerical data. It is a binary representation vector wherein a categorical class' index is assigned a positive value (1) and rest all indexes have a negative (0) value. A brief tutorial of this concept can be found here.

      However, we can extend this concept further and use it to encode a Word at Character level where our Vocabulary set is made up of the alphabets of the language. This is same as the bn and en vectors created in section 4.i

      So, considering the same five word sentence used for the Semi-Character example (i.e. My favourite dictionary is Oxford) , we will have a One-hot vector of shape 33*53 where 33 is length of the sentence and 53 is the length of the vocabulary. This means that every single character has a vector associated with it. In the vocabulary set we add an additional vector to account for Spaces in the sentence. Hence, the vocabulary length now becomes 52+1 = 53.

  8. MODELS:
  9. As part of this project, we mainly used the LSTM networks for our analysis. Below we briefly introduce what are Recurrent Neural Networks (RNN) and LSTM networks.
    1. Recurrent Neural Networks:
    2. Recurrent neural networks, also known as RNNs, are a class of neural networks that allow previous outputs to be used as inputs while having hidden states. They can be visusalised as below: Fig 2. A block diagram of a generic RNN network. Source: Stanford CS230 lecture notes

      Such networks are mainly used in the field of Natural language processing and Speech recognition. However, RNN networks also experience a phenomena known as exploding/vanishing gradient problem. It happens when the network finds it difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers.

    3. Long Short-term Memory:
    4. LSTMs solve the problem of vanishing gradient by introducing gates inside the network that regulate the flow of information within a cell. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation. Due to this, networks can now learn from sequences that are of longer length.

      If you would like to learn more about LSTM and how they function, check out this well explained blog by Chris Olah Understanding LSTM Networks where he explains how each LSTM modules operate with good visual cues.

    Note: The project began by experimenting on Multi-Layer Perceptrons (MLP) and Recurrent Neural Networks(RNN). However, since the results in the initial stages were not very promising, we dedicated more focus towards LSTM networks. Hence, only LSTM evaluation results are listed below.

  10. DATASET PREPARATION:
    1. Experimental Focus:
    2. Our experiments were focused on two angles: context based and non-context based. As the name suggests, in the non-context based approach, the model is trained only on individual words. The input was not a sentence of words but just a single word. For the context based approach, the model was trained on the target word and the contextual words that appeared before and after it. For our experiments, the models were shown two context words before and after the target word.

      Example: In the sentence We need suffcient carbohydrates in our body , for the target word suffcient, the input sentence would be We need suffcient carbohydrates in i.e. the two sequential words before and after the target.

    3. Training Dataset:
    4. For the non-context based approach, we extracted all the individual words from 200,000 Wikipedia articles and filtered out the words which occurred less than 20 times. This additional filtering had to be done to remove words that were misspelt or words that occured very rarely in the encyclopedia. The final training dataset contained 147,011 words.

      As for the context based approach,we didn't need to do any extra preparation tasks. However, due to the limitations with GPU compute, we trained only on a randomly selected corpus of 5000 wikipedia articles. The final dataset contained 1,743,076 5-gram pairs.

      Additionally, for the one-hot encoding technique, we also needed to decide on the length of the one-hot encoded vector. For the same, we plotted a distribution of the length of 5-word sentences of the entire dataset (Fig 3). Based on the results, we decided to set 60 characters as the maximum length of the vector. So, any 5-word sentences greater than 60 characters would be trimmed to 60 characters and sentences that are lesser than 60 characters would be given extra right-end paddings.

      Fig 3. Histogram of word-length of every 5-word sentences in the entire dataset.
    5. Real-time Error generation:
    6. Since we are following a Supervised learning approach, our models need to be trained to distinguish between Positive and Negative Samples by training it on Positive and Negative Samples. However, we assume that our dataset is a clean dataset i.e. without any errors. Hence, we introduce errors manually during the training epochs.

      We introduce randomly either of the below three error on the target word:

      • Replace a character with another random character
      • Introduce an extra character at a random position
      • Remove a character from a random position

      Due to this, in every epoch, the network sees a different negative word thereby avoiding a possible overfitting .

    7. Evaluation Dataset:
    8. The BEA-60k dataset was modified as collection of positive and negative sample of 5-word sentences. So,the final dataset size was <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-46ru{background-color:#96fffb;border-color:inherit;text-align:left;vertical-align:top} .tg .tg-0pky{border-color:inherit;text-align:left;vertical-align:top} .tg .tg-fymr{border-color:inherit;font-weight:bold;text-align:left;vertical-align:top} .tg .tg-elvq{background-color:#fffc9e;border-color:inherit;text-align:left;vertical-align:top} </style>

      Samples With Error Samples Without Error Total Samples
      Context-Based (i.e. Sentences) 10,865 10,781 21,646 Validation Set
      11,220 11,202 22,422 Test Set
      Non-Context (i.e. Words) 13,899 6,384 20,283 Validation Set
      14,299 6,671 20,970 Test Set
      Table.1 - Validation and Test dataset sizes.
  11. TRAINING:
  12. For this project, we used the popular machine learning Pytorch which is based on the Torch library. The models were trained on 4 Nvidia Titan X (Pascal) GPUs. This can be done easily (i.e. using multiple GPUs for Parallel training), thanks to Pytorch's nn.DataParallel module. A code snippet can be found here showing how its implemented.
    The below table lists some important hyperparameters used during the training.

    Fig 4. shows the some of the important training metrics.

    LSTM without context LSTM with context LSTM with context
    Encoding Semi-Character Semi-Character One-hot encoding
    Learning Rate 0.01 0.001 0.001
    Hidden Dimension 256 512 512
    Optimizer ADAM ADAM ADAM
    Loss Cross Entropy Loss Cross Entropy Loss Cross Entropy Loss
    Training time(per Epoch) 1 min 16 min 20 min
    Dataset Size ~150k words 5000 Wikipedia articles 5000 Wikipedia articles
    Table.2 - Important hyperparameters used for Evaluation. Fig 4. Some important metrics of the training phase.
  13. RESULTS:
  14. The table below shows the accuracy that was achieved on the test dataset and the corresponding F1 Score. Further we also have plotted the Confusion Matrix for the same dataset. It is clearly evident that Context based models outperform non-context based model.
    Accuracy F1 Score
    LSTM without Context 76.01% 0.798
    LSTM with Context (Semi-Character) 87.90% 0.879
    LSTM with Context (One-hot) 87.77% 0.874
    Table 3. Accuracy and F1 Score on the test dataset. Fig 5. Confusion Matrix for the three models.

    Some sample evaluations:

    ID Sentence Semi-Character Encoding with Context One-hot Encoding with context Semi-Charcter Encoding without Context
    1 understant - understant -
    2 The quick brown fox jumps Over the lazy dog fox The,quick,fox over
    3 We need to appreciat the developer for his efforts appreciat We,appreciat -
    4 An impartial invstigation into the crash was conducted by the agency invstigation An,impartial,invstigation invstigation,was
    5 for no apparant reason she laughed apparant for,no,apparant -
    6 Students must focus on their Pronanciation skills for better grades Pronanciation - -
    7 Students must focus on their pronuncation skills for better grades pronuncation pronuncation -
    Table 4. Certain demonstrative examples comparing the performance of the three models.
  15. OBSERVATIONS:
    • Does Context really Matter ?
    • As seen from the Accuracy and F1 score in Section 7 Table 3 or by looking at the examples in Table 4 or on playing with the console (or webapp), one can easily see that having context improves the score. Let us now look at some examples from Table 4.

      Consider the third example (We need to appreciat the developer for his efforts). Here, only the context-based models detected the error word. Something similar happens in the fifth example.

      Similarly, the word 'appresiate'. The non-context classifier classified it as Negative. However, both Context based classifiers classified it as Positive when provided in a contextual sentence as 'I really appresiate my host'.

      So, YES, the context matters !!

    • One-Hot Encoding OR Semi-Character Encoding ?
    • A closer look at the training plots placed in Fig 4 shows the orange line reaching better metric levels faster than the semi-character encodings. A possible reason for this behaviour lies in the underlying concept of one-hot vectors. One-hot vectors are simply character-by-character encoding with no special technique applied.

      On the other hand, the semi-character encoding was a special kind of algorithm for which the model took time to learn but once it understood the encoding, it started producing good results. Hence the model learnt faster.

      This however was also accompanied by an increased compute time and a large memory usage (as observed during the training process). The examples shown in Table 4 show certain errors which were detected only by either of these models.

      Hence, we can't decisively say which one is supreme.

    • Does the Case matter ?
    • At the beginning of writing my code, I was under the assumption that the models would be happy with just the spellings. But just a trial of removing the \({.lower()}\) function resulted in a huge spike in the validation accuracy (~10%).

      Hence, any spelling detection model should be provided with the actual case in which it was written !!

  16. THE MILLION DOLLAR QUESTION:
  17. "Will I use my spell-checker for my next big revolutionary Word Processing Software?"

    Let's discuss some advantages of these models:
    • It can detect spelling errors
    • For context-based models, in many situations, it highlights error words which aren't error but hinting towards a grammatical mistake or a missing filler word.

    Now, what are some of the major problems:

    • It produces False-Positives.(mostly prevalent in the context-based approach)
    • The one-hot model classifies the first word mostly as Positive
    • It doesn't consider punctuations.

    So, I would honestly say NO !!

    What can be done to make it better:

    • Train on a bigger dataset.
    • Expand the alphabet to wisely include punctuations as part of the context.
    • A closer look at the results of the test dataset showed some scenarios where only either of the model was correct. Maybe an ensemble based approach might work here.

Acknowledgements:

I would like to thank Mr. Matthias Hertel for supervising this project and providing valuable suggestions and prompt responses whenever approached.