You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How can I handle special characters, such as accented characters, ª, º, ç, etc? Some PDFs I'm processing for Portuguese have lots of these characters, and I'm getting some errors extracting text from them, such as:
Na forma de jurisprudncia **(should be jurisprudência)** do Superior Tribunal de Justiça - AgRg no REsp 1269246/RS, Rel. Ministro Luis Felipe Salomªo **(should be Salomão)** -, danos morais in re ipsa, em casos de atraso de voos somente sªo **(should be são)** constatados em tempo de demora superior a oito (08) horas.
No caso concreto, o tempo foi de cerca de cinco horas e, nªo havendo circunstâncias extraordinÆrias **(should be extraordinárias)**, excluem-se os danos morais.
Juiz JosØ **(should be José)** ...
Thanks for the report – this can definitely happen and it comes straight from the Docling model.
I wonder if there's some postprocessing that could be done, e.g. with ftfy? If this works, this could also be included in spacy-layout by default.
Edit: Okay, it seems like this isn't really working because it's not an encoding issue at its core, it's more that certain characters aren't recognised correctly, e.g. ã as ª. I guess that comes down to the model?
One thing that would be relatively easy to add for now is an API for adding fix-up rules, so if there are common cases the model gets wrong, you can at least replace them yourself.
I think ftfy could help but it wouldn't work completely because I have also seen some cases in which the special character is completely deleted instead of replaced by something else.
Hi @ines , how are you?
How can I handle special characters, such as accented characters, ª, º, ç, etc? Some PDFs I'm processing for Portuguese have lots of these characters, and I'm getting some errors extracting text from them, such as:
5052582-09.2020.8.09.0051-212510970_Voto.pdf
Code is simple:
Thanks!
The text was updated successfully, but these errors were encountered: