Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling special characters #16

Open
pvcastro opened this issue Dec 10, 2024 · 2 comments
Open

Handling special characters #16

pvcastro opened this issue Dec 10, 2024 · 2 comments
Labels
docling Related to Docling library and models enhancement New feature or request

Comments

@pvcastro
Copy link

Hi @ines , how are you?

How can I handle special characters, such as accented characters, ª, º, ç, etc? Some PDFs I'm processing for Portuguese have lots of these characters, and I'm getting some errors extracting text from them, such as:

Na forma de jurisprudncia **(should be jurisprudência)** do Superior Tribunal de Justiça - AgRg no REsp 1269246/RS, Rel. Ministro Luis Felipe Salomªo **(should be Salomão)** -, danos morais in re ipsa, em casos de atraso de voos somente sªo **(should be são)** constatados em tempo de demora superior a oito (08) horas.

No caso concreto, o tempo foi de cerca de cinco horas e, nªo havendo circunstâncias extraordinÆrias **(should be extraordinárias)**, excluem-se os danos morais.

Juiz JosØ **(should be José)** ...

5052582-09.2020.8.09.0051-212510970_Voto.pdf

Code is simple:

nlp = spacy.load('pt_core_news_lg')
layout = spaCyLayout(nlp)
doc = layout(sample_path)

Thanks!

@ines ines added the docling Related to Docling library and models label Dec 11, 2024
@ines
Copy link
Member

ines commented Dec 11, 2024

Thanks for the report – this can definitely happen and it comes straight from the Docling model.

I wonder if there's some postprocessing that could be done, e.g. with ftfy? If this works, this could also be included in spacy-layout by default.

Edit: Okay, it seems like this isn't really working because it's not an encoding issue at its core, it's more that certain characters aren't recognised correctly, e.g. ã as ª. I guess that comes down to the model?

One thing that would be relatively easy to add for now is an API for adding fix-up rules, so if there are common cases the model gets wrong, you can at least replace them yourself.

@ines ines added the enhancement New feature or request label Dec 11, 2024
@pvcastro
Copy link
Author

I think ftfy could help but it wouldn't work completely because I have also seen some cases in which the special character is completely deleted instead of replaced by something else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docling Related to Docling library and models enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants