Arabic text reversed with connected letters not reshaped correctly #69

AnasAG · 2021-06-16T16:00:58Z

I have a script for extracting Arabic text from PDF. pdfminer lib is used for pdf parsing. When extracting the Arabic text, sentences were reversed but the letters in each work were connected.

Original text in PDF: "وضح المقصود بكل من المصطلحات التالية"
Extracted text from PDF: "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

When using arabic_reshaper I noticed a situation where the Arabic text is not formatted correctly.

Sample Code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result: ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: ﻮﻀﺣ ﻼﻤﻘﺻﻭﺩ ﺐﻜﻟ ﻢﻧ ﻼﻤﺼﻄﻠﺣﺎﺗ ﻼﺗﻼﻳﺓ

But, when using an Arabic text similar to the previous example (reversed) but the letters are isolated (not connected), arabic_reshaper did work properly.

Original text in PDF: "على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة"
Extracted text from PDF: "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

Sample code:

import arabic_reshaper
from bidi.algorithm import get_display

text = "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"

reshaped_text = arabic_reshaper.reshape(text)    # correct its shape
print(reshaped_text)
# result:  ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ

bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة

I couldn't find out why it behaves this way. Also tried using the ArabicReshaper class with configuration and changing args such as use_unshaped_instead_of_isolated and support_ligatures, but the behavior was the same.
The pdf font affects the extracted text output, it might be also why the text sometimes is extracted with connected or isolated letters/alphabets. Though in general, I'm not sure if it's a bug or related to ligatures or other causes.

The text was updated successfully, but these errors were encountered:

naourass · 2023-01-29T05:15:10Z

I'm running into this same issue. All my target text is in join format. Is it possible to isolate the letters when they're joined?

abdelmalek13 · 2024-03-21T03:39:21Z

I have the same problem during extracted data from pdf

AnasAG changed the title ~~Arabic revered but connected letters not reshaped correctly~~ Arabic text reversed with connected letters not reshaped correctly Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arabic text reversed with connected letters not reshaped correctly #69

Arabic text reversed with connected letters not reshaped correctly #69

AnasAG commented Jun 16, 2021 •

edited

Loading

naourass commented Jan 29, 2023

abdelmalek13 commented Mar 21, 2024

Arabic text reversed with connected letters not reshaped correctly #69

Arabic text reversed with connected letters not reshaped correctly #69

Comments

AnasAG commented Jun 16, 2021 • edited Loading

naourass commented Jan 29, 2023

abdelmalek13 commented Mar 21, 2024

AnasAG commented Jun 16, 2021 •

edited

Loading