Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search should ignore whitespace #620

Open
nifri opened this issue Dec 12, 2024 · 1 comment
Open

Search should ignore whitespace #620

nifri opened this issue Dec 12, 2024 · 1 comment

Comments

@nifri
Copy link

nifri commented Dec 12, 2024

Expected behaviour

When searching text in a document whitespace (more specifically space) should be ignored. For example "text" and "t e x t" and "t e x t" should all return as a search result for the string "text".

Actual behaviour

Only "text" will return as a search result for the search string "text".

Reason for the change

In OCR documents often the OCR engines introduce whitespace into the searchable text of PDF files which often is not visible but under the scanned image. This makes Atril less viable as an option to search through OCR documents.

Open Questions

  • Should this be implemented for each file-type supported by Atril? I'm not sure. I understand that there are several back-ends for the file-types. And at a quick glance the search would have to be modified for each back-end. I think tackling PDF would already cover a large part of the user base and should be the first step.
  • Should this be the default? In my opinion yes. I'm using the commercial PDF-Xchange Editor and it's the default. I was very surprised to not find certain text that PDF-Xchange Editor would find. Until I realized that the hidden text in my OCR files has whitespaces in some parts and no whitespace in others.
  • Should you be able to turn this of with a toggle or similar configuration possibility? In my opinion not necessarily.
  • Which whitespace should be ignored? Space, Tab, etc? I think Space should be sufficient. CR, LF and Tab are used for a reason and searching words over multiple lines (CR and LF) or different alignments (Tab) doesn't make sense to me as I would expect the words to be not broken up in these situations.

Package version

Atril 1.26.0

Linux Distribution

Debian 12

@lukefromdc
Copy link
Member

I am guessing the current search capability is regex-based, same as in Pluma. I've never worked on that part of the code, but testing a PR to fix this from anyone inside or outside the team should be simple enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants