-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: count string by codepoint #44
Conversation
3ee7a33
to
ad37387
Compare
1ece1c7
to
ac71454
Compare
ac71454
to
c32cb9f
Compare
Unicode says that there are 4 ways to count string length. https://unicode.org/faq/char_combmark.html#7 This commit supports counting by Code points.
c32cb9f
to
3bc67bf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: new Intl.Segmenter("ja-JP", { granularity: "grapheme" })
is more precise, but also more complex to implement due to language dependencies.
(probably, Intl.Segmenter is slower than others)
https://blog.jxck.io/entries/2017-03-02/unicode-in-javascript.html#unicode-text-segmentation
https://github.com/tc39/proposal-intl-segmenter
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter
src/sentence-length.ts
Outdated
* By default or set to "code units", count string by UTF-16 code unit(= using `String.prototype.length`). | ||
* If set to "codepoints", count string by codepoint. | ||
*/ | ||
countBy?: "code units" | "codepoints"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
countBy?: "code units" | "codepoints"; | |
countBy?: "codeunits" | "codepoints"; |
I think it would be better to align without spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/sentence-length.ts
Outdated
const reporter: TextlintRuleReporter<Options> = (context, options = {}) => { | ||
const maxLength = options.max ?? defaultOptions.max; | ||
const skipPatterns = options.skipPatterns ?? options.exclusionPatterns ?? defaultOptions.skipPatterns; | ||
const skipUrlStringLink = options.skipUrlStringLink ?? defaultOptions.skipUrlStringLink; | ||
const strLen = | ||
options.countBy == null || options.countBy === "code units" ? (s: string) => s.length : strLenByCodePoint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create a function like strLenByCodeUnits
and use it?
const countBy = options?.countBy ?? defaultOptions.countBy;
const strLen = countBy === "codeunits" ? strLenByCodeUnits : strLenByCodePoint;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref: - textlint-rule#44 (comment) Co-authored-by: azu <[email protected]>
ref: - textlint-rule#44 (comment) Co-authored-by: azu <[email protected]>
@azu Thank you for your review! I applied your suggestions.
I just now noticed the API. When we pass However, I think it's out of this PR's scope. |
Yes, I agree. |
Abstruct
Unicode says that there are 4 ways to count string length. https://unicode.org/faq/char_combmark.html#7
This commit supports counting by Code points.
Motivation
When we write text something like Japanese, surrogate pair will be used as usual. In such context, restricting string length is painful without considering surrogate pair.