Search should ignore whitespace #620

nifri · 2024-12-12T18:50:33Z

Expected behaviour

When searching text in a document whitespace (more specifically space) should be ignored. For example "text" and "t e x t" and "t e x t" should all return as a search result for the string "text".

Actual behaviour

Only "text" will return as a search result for the search string "text".

Reason for the change

In OCR documents often the OCR engines introduce whitespace into the searchable text of PDF files which often is not visible but under the scanned image. This makes Atril less viable as an option to search through OCR documents.

Open Questions

Should this be implemented for each file-type supported by Atril? I'm not sure. I understand that there are several back-ends for the file-types. And at a quick glance the search would have to be modified for each back-end. I think tackling PDF would already cover a large part of the user base and should be the first step.
Should this be the default? In my opinion yes. I'm using the commercial PDF-Xchange Editor and it's the default. I was very surprised to not find certain text that PDF-Xchange Editor would find. Until I realized that the hidden text in my OCR files has whitespaces in some parts and no whitespace in others.
Should you be able to turn this of with a toggle or similar configuration possibility? In my opinion not necessarily.
Which whitespace should be ignored? Space, Tab, etc? I think Space should be sufficient. CR, LF and Tab are used for a reason and searching words over multiple lines (CR and LF) or different alignments (Tab) doesn't make sense to me as I would expect the words to be not broken up in these situations.

Package version

Atril 1.26.0

Linux Distribution

Debian 12

lukefromdc · 2024-12-13T04:37:42Z

I am guessing the current search capability is regex-based, same as in Pluma. I've never worked on that part of the code, but testing a PR to fix this from anyone inside or outside the team should be simple enough.

lukefromdc added feature request confirmed labels Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search should ignore whitespace #620

Search should ignore whitespace #620

nifri commented Dec 12, 2024

lukefromdc commented Dec 13, 2024

Search should ignore whitespace #620

Search should ignore whitespace #620

Comments

nifri commented Dec 12, 2024

Expected behaviour

Actual behaviour

Reason for the change

Open Questions

Package version

Linux Distribution

lukefromdc commented Dec 13, 2024