This project demonstrates how to process Persian (Farsi) text using the Parsivar NLP library. It includes text normalization, tokenization, stemming, and spell checking, with additional tools to handle proper display of Persian characters.
- ✅ Normalization – Cleans and standardizes Persian text.
- ✅ Tokenization – Splits text into sentences and words.
- ✅ Stemming – Converts words to their root forms.
- ✅ Spell Checking – Detects and corrects misspellings in Persian.
- ✅ Display Support – Uses
arabic_reshaper
andpython-bidi
to fix RTL display issues.
parsivar
– NLP tools for Persian.arabic_reshaper
– For reshaping characters to correct forms.python-bidi
– Ensures proper display of RTL scripts like Persian.
- Read Persian text from a
.txt
file. - Normalize the text using Parsivar.
- Tokenize the normalized text into words and sentences.
- Apply stemming to get root forms of words.
- Use spell correction on custom input.
- Display reshaped output for better readability in terminals.
pip install parsivar arabic_reshaper python-bidi
pip install -r requirements.txt
then in code first we normalize then we tokenize, and after tokenize we stammer and in the end for spell detection you need to download these two files and put it in the this below path:
first create a spell folder in this path:
venv\Lib\site-packages\parsivar\resource
then replace these two file in the spell folder:
- onegram.pckl
- mybigram_lm.pckl
- python
- nltk
- parsivar
- bidi
- arabic_reshaper