You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When searching text in a document whitespace (more specifically space) should be ignored. For example "text" and "t e x t" and "t e x t" should all return as a search result for the string "text".
Actual behaviour
Only "text" will return as a search result for the search string "text".
Reason for the change
In OCR documents often the OCR engines introduce whitespace into the searchable text of PDF files which often is not visible but under the scanned image. This makes Atril less viable as an option to search through OCR documents.
Open Questions
Should this be implemented for each file-type supported by Atril? I'm not sure. I understand that there are several back-ends for the file-types. And at a quick glance the search would have to be modified for each back-end. I think tackling PDF would already cover a large part of the user base and should be the first step.
Should this be the default? In my opinion yes. I'm using the commercial PDF-Xchange Editor and it's the default. I was very surprised to not find certain text that PDF-Xchange Editor would find. Until I realized that the hidden text in my OCR files has whitespaces in some parts and no whitespace in others.
Should you be able to turn this of with a toggle or similar configuration possibility? In my opinion not necessarily.
Which whitespace should be ignored? Space, Tab, etc? I think Space should be sufficient. CR, LF and Tab are used for a reason and searching words over multiple lines (CR and LF) or different alignments (Tab) doesn't make sense to me as I would expect the words to be not broken up in these situations.
Package version
Atril 1.26.0
Linux Distribution
Debian 12
The text was updated successfully, but these errors were encountered:
I am guessing the current search capability is regex-based, same as in Pluma. I've never worked on that part of the code, but testing a PR to fix this from anyone inside or outside the team should be simple enough.
Expected behaviour
When searching text in a document whitespace (more specifically space) should be ignored. For example "text" and "t e x t" and "t e x t" should all return as a search result for the string "text".
Actual behaviour
Only "text" will return as a search result for the search string "text".
Reason for the change
In OCR documents often the OCR engines introduce whitespace into the searchable text of PDF files which often is not visible but under the scanned image. This makes Atril less viable as an option to search through OCR documents.
Open Questions
Package version
Atril 1.26.0
Linux Distribution
Debian 12
The text was updated successfully, but these errors were encountered: