Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update readme.md #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 17 additions & 15 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Tamil Spell Checker

Idea for building simple Tamil Spell Checker came from a conversation with T Shrinivasan from open-tamil team.
The idea for building a simple Tamil Spell Checker came from a conversation with T Shrinivasan from the open-tamil team.

Tamil Spell Checker uses below approach to suggest different spellings for a word
Tamil Spell Checker uses the below approach to suggest different spellings for a word

- Check whether it is a valid Tamil word using Bloom Filter
- Use Levenstein Distance (edit distance of 2) to suggest words when it is not a tamil word
- Use Levenstein Distance (edit distance of 2) to suggest words when it is not a Tamil word

## Project Madurai Crawler

Project Madurai has good collection of tamil works. Use Project Madurai Crawler to generate Tamil unique word list.
Project Madurai has a good collection of Tamil works. Use Project Madurai Crawler to generate a Tamil unique word list.

To run it use the below command
```
Expand All @@ -18,37 +18,39 @@ python ProjectMaduraiCrawler.py

## Create Bloom Filter File

Bloom Filter is a space efficient and compute optimized probablistic datastructure designed to tell whether an item is present in a set or not. More information on Bloom Filter can be found in [wiki](https://en.wikipedia.org/wiki/Bloom_filter).
Bloom Filter is a space-efficient and compute-optimized probabilistic data structure designed to tell whether an item is present in a set. More information on Bloom Filter can be found in [wiki](https://en.wikipedia.org/wiki/Bloom_filter).

- Spellchecker is using Bloom Filter to check whether a word is a valid tamil word or not.
- Bloom Filter Datastructure file has to be created first before using to check validity of a word
- Spellchecker is using Bloom Filter to check whether a word is a valid Tamil word or not.
- Bloom Filter Datastructure file has to be created first before being used to check the validity of a word

To generate Bloom Filter file use the below command
To generate a Bloom Filter file use the below command

```
python TamilBloomFilterCreator.py
```

## Sample code to check whether a word is valid tamil word
## Sample code to check whether a word is a valid Tamil word

```
from TamilwordChecker import TamilwordChecker
from tamilspellchecker.TamilwordChecker import TamilwordChecker
from tamilspellchecker.TamilSpellingAutoCorrect import get_data

unique_word_count = 2043478
tamilwordchecker = TamilwordChecker(unique_word_count,"tamil_bloom_filter.txt")
tamilwordchecker = TamilwordChecker(unique_word_count,get_data("tamil_bloom_filter.txt"))
print(tamilwordchecker.tamil_word_exists("மேகம்"))
```

## Sample code to check get spell check corrections

```
from TamilSpellingAutoCorrect import TamilSpellingAutoCorrect
spellchecker = TamilSpellingAutoCorrect("tamil_bloom_filter.txt","tamilwordlist.txt")
from tamilspellchecker.TamilSpellingAutoCorrect import TamilSpellingAutoCorrect, get_data
spellchecker = TamilSpellingAutoCorrect(get_data("tamil_bloom_filter.txt"), get_data("tamilwordlist.txt"))
from_spell_checker_list = spellchecker.tamil_correct_spelling("மேக்ம்")
print(from_spell_checker_list)
```

## Norvig Algorithm
Norvig algorithm can run faster than exhaustive search method; you
Norvig algorithm can run faster than the exhaustive search method; you
can use it as follows,

```
Expand All @@ -65,7 +67,7 @@ assert 'தமிழ்நாடு' in results

## Accuracy Issues

Accuracy of Tamilwordchecker depends on the list of unique words that is there in tamilwordlist.txt. Need to add more unique words from other sources.
The accuracy of Tamilwordchecker depends on the list of unique words that are there in tamilwordlist.txt. We need to add more unique words from other sources.