You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a script for extracting Arabic text from PDF. pdfminer lib is used for pdf parsing. When extracting the Arabic text, sentences were reversed but the letters in each work were connected.
Original text in PDF: "وضح المقصود بكل من المصطلحات التالية"
Extracted text from PDF: "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"
When using arabic_reshaper I noticed a situation where the Arabic text is not formatted correctly.
But, when using an Arabic text similar to the previous example (reversed) but the letters are isolated (not connected), arabic_reshaper did work properly.
Original text in PDF: "على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة"
Extracted text from PDF: "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"
Sample code:
import arabic_reshaper
from bidi.algorithm import get_display
text = "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"
reshaped_text = arabic_reshaper.reshape(text) # correct its shape
print(reshaped_text)
# result: ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ
bidi_text = get_display(reshaped_text)
print(bidi_text)
# result: على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة
I couldn't find out why it behaves this way. Also tried using the ArabicReshaper class with configuration and changing args such as use_unshaped_instead_of_isolated and support_ligatures, but the behavior was the same.
The pdf font affects the extracted text output, it might be also why the text sometimes is extracted with connected or isolated letters/alphabets. Though in general, I'm not sure if it's a bug or related to ligatures or other causes.
The text was updated successfully, but these errors were encountered:
AnasAG
changed the title
Arabic revered but connected letters not reshaped correctly
Arabic text reversed with connected letters not reshaped correctly
Jun 16, 2021
I have a script for extracting Arabic text from PDF.
pdfminer
lib is used for pdf parsing. When extracting the Arabic text, sentences were reversed but the letters in each work were connected.Original text in PDF: "وضح المقصود بكل من المصطلحات التالية"
Extracted text from PDF: "ﺓﻳﻼﺗﻼ ﺗﺎﺣﻠﻄﺼﻤﻼ ﻧﻢ ﻟﻜﺐ ﺩﻭﺻﻘﻤﻼ ﺣﻀﻮ"
When using arabic_reshaper I noticed a situation where the Arabic text is not formatted correctly.
Sample Code:
But, when using an Arabic text similar to the previous example (reversed) but the letters are isolated (not connected), arabic_reshaper did work properly.
Original text in PDF: "على الترتيب (n-l-m-s) اكتب جميع اعداد الكم الاربعة"
Extracted text from PDF: "ﺐﻴﺗﺮﺘﻟﺍ ﻰﻠﻋ (n-l-m-s) ﺔﻌﺑﺭﻻﺍ ﻢﻜﻟﺍ ﺩﺍﺪﻋﺍ ﻊﻴﻤﺟ ﺐﺘﻛﺍ"
Sample code:
I couldn't find out why it behaves this way. Also tried using the
ArabicReshaper
class withconfiguration
and changing args such asuse_unshaped_instead_of_isolated
andsupport_ligatures
, but the behavior was the same.The pdf font affects the extracted text output, it might be also why the text sometimes is extracted with connected or isolated letters/alphabets. Though in general, I'm not sure if it's a bug or related to ligatures or other causes.
The text was updated successfully, but these errors were encountered: