Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Soft Hyphens and Zero Width Spaces #806

Open
cskeeters opened this issue Mar 1, 2025 · 3 comments
Open

Soft Hyphens and Zero Width Spaces #806

cskeeters opened this issue Mar 1, 2025 · 3 comments

Comments

@cskeeters
Copy link

U+00AD is a soft hyphen. This is used to break a word if necessary and input a hyphen when that occurs.

U+200B is a zero width space. This is used to break a word if necessary and not input a hyphen when that occurs.

Both MS Word and harper-ls in neovim both fail to check the full word for spelling mistakes with either of these symbols are used. Has there been any thought to stripping these before checking against the dictionary?

@elijah-potter
Copy link
Collaborator

That's an interesting proposition. I can't imagine it would be the most challenging ask. I'm curious: how do these characters make their way into your documents?

@hippietrail
Copy link
Contributor

There is another issue with hyphens. In some formats, especially plain text or files automatically converted from plain text (think Project Gutenberg), hyphens are used at the ends of lines between either syllables or morphemes, so the real word needs to be constructed from the two halves before it can take part in spelling and grammar checks. But some of the words this happens to are hyphenated compounds, so there is an ambiguity. It's similar to sentences that end with abbreviations that end with a period.

We won't have to worry about this for some time though.

@cskeeters
Copy link
Author

I'm curious: how do these characters make their way into your documents?

I just learned about these and am in the exploration phase, so I don't have a set way and am not using them often. I'm trying to discover if it makes sense to use unicode in documentation if the tooling around their use was better. I know this is long. It's just for those that are curious.

Typst supports using #sym.hyph.soft; and #sym.zws;. When I try it in the body, it works, but that's a lot of syntax to read if you're using them often. Since my Typst/markdown source is typically utf8, I thought maybe I could insert the Unicode characters. This approach makes it easier to reuse text (like if I was copying the source into markdown or even MS Word).

Since I don't want to take the time to memorize Unicode values to enter into Neovim using the <C-v>u1234 method, I wrote a Lua function to select/insert unicode characters with FZF that uses UnicodeData.txt as the source. It's similar to alduraibi/telescope-glyph.nvim, which is good, but doesn't include soft hyphens or zero width spaces. (They want to support font awesome and nerd font symbols.)

Reading text with unicode characters in Neovim isn't great either (<200B>). I can use the syntax match Entity "­" conceal cchar=⌿ syntax in vim to fix the visual issues for text files but tree-sitter parsers are not compatible with these simple rules. I found Freed-Wu/conceal.vim which uses tree-sitter queries to convert unicode and emojis, but I think it only works when text is the only text in the (inline) of a markdown document. With the help of Grok, I was able to make my own solution where I use nvim_buf_set_extmark to create custom conceal areas for matched text. It only works in markdown right now.

For Typst, show rules can be utilized once to affect all later uses uses. Here's an example (shortened to fit on one line):

#set page(margin:.1in, width:2in, height:1in,)

// Either work
// #show "supercalifragilistic": "super­cali­fragilistic" // There are soft hyphens that don't show in github
#show "supercalifragilistic": [super#sym.hyph.soft;cali#sym.hyph.soft;fragilistic]

That is very supercalifragilistic.

For Word Joiners (that are not relevant to spelling) I wrote a Lua filter (unpublished) for Pandoc that can be used with ptp. If you source:

---
title: Word Joiner Example
words:
    - 800-53
---

RMF Controls are documented in NIST SP 800-53.

The filter will read the list of words and utilize a Typst template to generate a show rule for each word that adds a Word Joiner UTF character after all /, -, and spaces. In this case, the unicode characters aren't entered into the markdown source at all.

So, there are a lot of road blocks to using these characters. Spell checking is just one issue to inserting soft hyphens and zero width spaces into text. I suspect many people don't know about the utility of these characters, and I don't see that significantly changing. It might just be too complex and confusing for those that don't have these tools setup, so I'm not sure how much I'll use them in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants