AraSpell (Arabic Spelling Correction)

This work introduces AraSpell (Arabic Spelling Correction) trained on more than 6.9 Million sentences, the offical implemntation of AraSpell Paper.

The paper would be available here

Model

We implemented 3 models arcthictures as described in the paper they are:

Attentional vanilla Seq2Seq using RNN
Attentional Seq2Seq with stacked RNN blocks
Transformer

The image below shows all the model architectures proposed and implemented

Datasets

The data set used is Wikipedia dataset and can be found here on Kaggle, and below are the training, testing, and dev sets we have used after performing the proper transformations.

Data	Link
Train	Here
Test	Here
Dev	Here

To generate the mixed dataset use the below for both train.csv and test.csv

cat train.csv | awk -F , '{print $2","$3}' | sed "s/clean\,distorted_0\.05/clean,distorted_mix/g" > train_mix.csv
cat train.csv | awk -F , '{if (NR>1) print $2","$4}' >> train_mix.csv

Setup

Setup development environment

create enviroment

python -m venv env

activate the enviroment

source env/bin/activate

install pytorch

pick the version with the proper cuda version for your hardware from here (optional)

pip install torch --extra-index-url https://download.pytorch.org/whl/cu116

install the required dependencies

pip install -r requirements.txt

Setting up docker image

docker build .

Try it out

You can try it out using the jupyter notebook provided infer.ipynb and download the tokenizer and one of the published models

Results

Model	CER (%) on 5%	CER (%) on 10%	WER (%) on 5%	WER (%) on 10%
Transformer_0.05	1.24	4.15	5.35	18.38
Transformer_0.1	1.45	2.82	5.95	10.36
Transformer_mixed	1.11	2.8	4.8	10.65
Transformer_varied	1.22	3.16	5.41	12.35
RNNB_0.05	1.74	4.45	7.76	18.8
RNNB_0.1	1.86	4.01	7.8	15.41
RNNB_mixed	1.67	4.06	7.51	16.55
RNNB_varied	1.77	4.36	8.14	18.01
VanillaRNN_0.05	1.89	4.99	8.33	20.67
VanillaRNN_0.1	2.01	4.52	18.19	16.95
VanillaRNN_mixed	1.86	4.53	7.84	17.36
VanillaRNN_varied	2.01	4.97	8.76	19.49

Pre-trained Models

All Pre-trained models are available here

Fine tune the published models

Download one of the published model
Prepare the dataset as mentioned in here
run the following command to train the model

python train.py --pre_trained_path path/to/checkpoint.pt

Training on Other Languages

There are 2 different ways you can train on your data:

Your data is ready:

Your data on a form of 3 CSV files train, test, and dev and each CSV file looks like the following

clean, distorted
clean line, distorted line
clean line, distorted line

Change the valid characters in the constants.py file according to your language
Start the training using the following command after changing the value to fit your needs

python train.py --epochs number_of_epochs --n_gpus max_sent_length --train_path path/to/train.csv \
      --test_path path/to/train.csv --max_len max_sent_length --distortion_ratio distortion_ratio

Process the data using our processing pipeline

format the data to be in the following structuer

data
│   file1.txt
│   file2.txt
│   file3.txt

Change the valid characters in the constants.py file according to your language
run the below command to process the data

python process_data.py

Split your data into train, test and dev CSV files
Start the training using the following command after changing the value to fit your needs

python train.py --epochs number_of_epochs --n_gpus max_sent_length --train_path path/to/train.csv \
      --test_path path/to/train.csv --max_len max_sent_length --distortion_ratio distortion_ratio

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
args.py		args.py
callback.py		callback.py
constants.py		constants.py
data.py		data.py
decorators.py		decorators.py
interfaces.py		interfaces.py
layers.py		layers.py
logger.py		logger.py
loss.py		loss.py
models.py		models.py
optimizer.py		optimizer.py
process_data.py		process_data.py
processes.py		processes.py
processors.py		processors.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
tokenizer.py		tokenizer.py
train.py		train.py
utils.py		utils.py
words.json		words.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AraSpell (Arabic Spelling Correction)

Model

Datasets

Setup

Setup development environment

Setting up docker image

Try it out

Results

Pre-trained Models

Fine tune the published models

Training on Other Languages

About

Releases

Packages

Languages

License

msalhab96/AraSpell

Folders and files

Latest commit

History

Repository files navigation

AraSpell (Arabic Spelling Correction)

Model

Datasets

Setup

Setup development environment

Setting up docker image

Try it out

Results

Pre-trained Models

Fine tune the published models

Training on Other Languages

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages