You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried so search a pdf file created by Markdown PDF using pypdf. But it didn't work as expected because pypdf frequently found a white space after a capital letter.
I think this is a problem of markdown pdf as I couldn't reproduce this error with pdf files from other sources. Though I did not check extensively.
Here is the pdf file I created with Markdown PDF: testfile.pdf
Here is the python script to find the bug (I used pypdf version 4.2.0):
import pypdf
PDF_FILE = "testfile.pdf"
def get_all_text():
all_text = ""
complete_text = pypdf.PdfReader(PDF_FILE)
for page_obj in complete_text.pages:
text = page_obj.extract_text()
all_text += text
with open("all_text.txt", "w") as file:
file.write(all_text)
if __name__ == "__main__":
get_all_text()
Here is the output (watch the spaces after the capital letters). The output seems to be reproducible.
testfile.md 2024-06-18
1 / 1A Lot Of Capitalized W ords Like S witzerland For Example. Where Is R obert?
The text was updated successfully, but these errors were encountered:
I tried so search a pdf file created by Markdown PDF using pypdf. But it didn't work as expected because pypdf frequently found a white space after a capital letter.
I think this is a problem of markdown pdf as I couldn't reproduce this error with pdf files from other sources. Though I did not check extensively.
Here is the pdf file I created with Markdown PDF:
testfile.pdf
Here is the python script to find the bug (I used pypdf version 4.2.0):
Here is the output (watch the spaces after the capital letters). The output seems to be reproducible.
testfile.md 2024-06-18
1 / 1A Lot Of Capitalized W ords Like S witzerland For Example. Where Is R obert?
The text was updated successfully, but these errors were encountered: