Preserve no break spaces in Latex to Unicode conversion#15174
Preserve no break spaces in Latex to Unicode conversion#15174faneeshh wants to merge 9 commits intoJabRef:mainfrom
Conversation
Review Summary by QodoFix LaTeX tilde to non-breaking space conversion
WalkthroughsDescription• Replace LaTeX tildes with non-breaking spaces instead of standard spaces
• Use negative lookbehind regex to preserve LaTeX tilde accents like \~{n}
• Add test case for no-break space preservation in author names
• Update CHANGELOG with fix description
Diagramflowchart LR
A["LaTeX input with tilde"] -->|"negative lookbehind regex"| B["Standalone tildes identified"]
B -->|"replace with \\u00a0"| C["Non-breaking space output"]
D["LaTeX accents like \\~{n}"] -->|"not matched"| E["Preserved unchanged"]
File Changes1. jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java
|
Code Review by Qodo
1. NBSP breaks SQL search
|
| String toFormat = inField.replaceAll("(?<!\\\\)~", "\u00a0"); | ||
| toFormat = UNDERSCORE_MATCHER.matcher(toFormat).replaceAll(REPLACEMENT_CHAR); |
There was a problem hiding this comment.
1. Nbsp breaks sql search 🐞 Bug ✓ Correctness
LatexToUnicodeAdapter.format now emits NBSP for ~, and the search index stores this value as the transformed field. Because search queries use LIKE/ILIKE against the transformed column without normalizing whitespace, searching with a normal space may no longer match entries whose transformed text contains NBSP.
Agent Prompt
## Issue description
`LatexToUnicodeAdapter` now converts LaTeX `~` to NBSP (U+00A0). This value is written into the search index’s transformed column and is then queried using SQL LIKE/ILIKE. Because LIKE/ILIKE are character-sensitive, search terms containing normal spaces (U+0020) will not match transformed content containing NBSP.
## Issue Context
The adapter is used for both UI-ish formatting and backend indexing/search normalization. Preserving NBSP for display is desirable, but indexing/search likely needs whitespace normalization to maintain expected matching behavior.
## Fix Focus Areas
- jablib/src/main/java/org/jabref/logic/search/indexing/BibFieldsIndexer.java[474-476]
- jablib/src/main/java/org/jabref/logic/search/query/SearchToSqlVisitor.java[222-263]
- jablib/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java[32-35]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
…me to something more clear
|
Your pull request conflicts with the target branch. Please merge with your code. For a step-by-step guide to resolve merge conflicts, see https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/addressing-merge-conflicts/resolving-a-merge-conflict-using-the-command-line. |
|
Hello! I am a new contributor looking to get involved. I would love to work on this issue. Could you please assign it to me? |
Are you aware that this is a PR and not an issue? |
🚨 TestLens detected 5 failed tests 🚨Here is what you can do:
Test Summary
🏷️ Commit: 1586e0a Test FailuresBstPreviewLayoutTest > generatePreviewForSimpleEntryUsingAbbr() (:jablib:test in Source Code Tests / Unit tests – jablib)BstPreviewLayoutTest > generatePreviewForSliceTheoremPaperUsingAbbr() (:jablib:test in Source Code Tests / Unit tests – jablib)BstPreviewLayoutTest > generatePreviewForSliceTheoremPaperUsingIEEE() (:jablib:test in Source Code Tests / Unit tests – jablib)BstPreviewLayoutTest > monthMayIsCorrectlyRendered() (:jablib:test in Source Code Tests / Unit tests – jablib)CitationStyleGeneratorTest > doiFieldMapping(String, BibDatabaseMode, BibEntry, String, CitationStyleOutputFormat) > [6] "Corti, R., Flammer, A. J., Hollenberg, N. K., & Lüscher, T. F. (2009). Cocoa and Cardiovascular Health. Circulation, 119(10), 1433–1441. https://doi.org/10.1161/circ.108_827022 special\n", BIBLATEX, @Article{, author = {Corti, Roberto and Flammer, Andreas J. and Hollenberg, Norman K. and Lüscher, Thomas F.}, date = {2009-03}, doi = {10.1161/circ.108_827022~special}, issn = {1524-4539}, journaltitle = {Circulation}, number = {10}, pages = {1433--1441}, publisher = {Ovid Technologies (Wolters Kluwer Health)}, title = {Cocoa and Cardiovascular Health}, volume = {119}, _jabref_shared = {sharedId: -1, version: 1}}, "apa.csl", text (:jablib:test in Source Code Tests / Unit tests – jablib)Muted TestsSelect tests to mute in this pull request:
Reuse successful test results:
Click the checkbox to trigger a rerun:
Learn more about TestLens at testlens.app. |
|
|
||
| private static final Pattern TILDE_MATCHER = Pattern.compile("(?<!\\\\)~"); | ||
|
|
||
| private static final String NON_BREAKING_SPACE = "\u00a0"; |
There was a problem hiding this comment.
| private static final String NON_BREAKING_SPACE = "\u00a0"; | |
| private static final String NO_BREAK_SPACE = "\u00a0"; |
There was a problem hiding this comment.
The formal unicode character name is "no-break" space (https://www.unicode.org/charts/charindex.html), which I think is a better choice than the alternative name "non-breaking space" unless JabRef has a precedent of using the alternative. Thanks for the fix, @faneeshh.
| /// @return an `Optional<String>` with LaTeX resolved into Unicode or `empty` on failure. | ||
| public static Optional<String> parse(@NonNull String inField) { | ||
| String toFormat = UNDERSCORE_MATCHER.matcher(inField).replaceAll(REPLACEMENT_CHAR); | ||
| String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NON_BREAKING_SPACE); |
There was a problem hiding this comment.
| String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NON_BREAKING_SPACE); | |
| String toFormat = TILDE_MATCHER.matcher(inField).replaceAll(NO_BREAK_SPACE); |
| } | ||
|
|
||
| @Test | ||
| void formatPreservesNoBreakingSpaces() { |
There was a problem hiding this comment.
| void formatPreservesNoBreakingSpaces() { | |
| void formatPreservesNoBreakSpaces() { |
CHANGELOG.md
Outdated
|
|
||
| ### Fixed | ||
|
|
||
| - We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of non-breaking spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158)) |
There was a problem hiding this comment.
| - We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of non-breaking spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158)) | |
| - We fixed an issue where LaTeX to Unicode conversion replaced tildes with standard spaces instead of no-break spaces. ([#15158](https://github.com/JabRef/jabref/issues/15158)) |
|
8 |
I'm guessing it's failing because it expects a standard space where I've now introduced the no-break space(\u00a0)? If that is the case then should I update the test's expected string to match the new Unicode char? Does that sound right |
|
JUnit tests of You can then run these tests in IntelliJ to reproduce the failing tests locally. We offer a quick test running howto in the section Final build system checks in our setup guide. |
|
It looks like my change is introducing no-break spaces into generated previews and DOIs. I can update the tests to expect \u00a0 but I feel like that might break DOI links if a user copies them. Should i restrict the conversion so it doesn't affect DOIs or should I update the test assertions |
|
Interesting unexpected side effect. Will label as Devcall discussion. |
|
@faneeshh you can work on something else in the meanwhile if you wish to. Feel free to ping if you get "maximum assignment limit reached". |
I really appreciate the concern. I was lurking around this SLR support improving issue (#12642). I ended up reading through the old reverted PR history and I really feel like I have an overview in my mind of how to fix the missing V2 requirements. Now this might be a bit of a long shot but I could try implementing a cleaner foundation for the study.yaml format and see if I can get the migrator and those string based tests right... @koppor Let me know what you think of the same... |
Please feel free to go ahead, we can comment further once you give a solution sketch or a draft PR. |
Will also happen when spaces are introduced. I am not sure when DOIs contain ~ If URLs contain ~, they should be kept. Maybe follow-up issue? |
Closes #15158
PR Description
I updated the Latex to Unicode conversion to ensure the tilde (~) is converted to a non breaking space (\u00a0) instead of a standard space and I should mention that I used a negative lookbehind regex to make sure this only affects standalone tildes and doesn't break Latex tilde accents like ~{n}. I think this would maintain support for accented characters.
Screenshot
Checklist
CHANGELOG.mdin a way that can be understood by the average user (if change is visible to the user)