feat: count string by codepoint #44

yumetodo · 2024-08-13T03:56:54Z

Abstruct

Unicode says that there are 4 ways to count string length. https://unicode.org/faq/char_combmark.html#7

This commit supports counting by Code points.

Motivation

When we write text something like Japanese, surrogate pair will be used as usual. In such context, restricting string length is painful without considering surrogate pair.

Unicode says that there are 4 ways to count string length. https://unicode.org/faq/char_combmark.html#7 This commit supports counting by Code points.

azu

FYI: new Intl.Segmenter("ja-JP", { granularity: "grapheme" }) is more precise, but also more complex to implement due to language dependencies.
(probably, Intl.Segmenter is slower than others)

https://blog.jxck.io/entries/2017-03-02/unicode-in-javascript.html#unicode-text-segmentation
https://github.com/tc39/proposal-intl-segmenter
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

azu · 2024-08-13T06:10:31Z

src/sentence-length.ts

+     * By default or set to "code units", count string by UTF-16 code unit(= using `String.prototype.length`).
+     * If set to "codepoints", count string by codepoint.
+     */
+    countBy?: "code units" | "codepoints";


Suggested change

countBy?: "code units" | "codepoints";

countBy?: "codeunits" | "codepoints";

I think it would be better to align without spaces.

azu · 2024-08-13T06:13:48Z

src/sentence-length.ts

 const reporter: TextlintRuleReporter<Options> = (context, options = {}) => {
    const maxLength = options.max ?? defaultOptions.max;
    const skipPatterns = options.skipPatterns ?? options.exclusionPatterns ?? defaultOptions.skipPatterns;
    const skipUrlStringLink = options.skipUrlStringLink ?? defaultOptions.skipUrlStringLink;
+    const strLen =
+        options.countBy == null || options.countBy === "code units" ? (s: string) => s.length : strLenByCodePoint;


Can you create a function like strLenByCodeUnits and use it?

const countBy = options?.countBy ?? defaultOptions.countBy; const strLen = countBy === "codeunits" ? strLenByCodeUnits : strLenByCodePoint;

ref: - textlint-rule#44 (comment) Co-authored-by: azu <[email protected]>

yumetodo · 2024-08-13T13:31:45Z

@azu Thank you for your review! I applied your suggestions.

FYI: new Intl.Segmenter("ja-JP", { granularity: "grapheme" }) is more precise, but also more complex to implement due to language dependencies.

I just now noticed the API. When we pass undefined as locale, it will cause unstable lint result. So, we need to decide what is to be specified and how to specify it.

However, I think it's out of this PR's scope. countBy? can be extendable to some thing like countBy?: "codeunits" | "codepoints" | "grapheme";.

azu · 2024-08-14T04:16:41Z

However, I think it's out of this PR's scope. countBy? can be extendable to some thing like countBy?: "codeunits" | "codepoints" | "grapheme";.

Yes, I agree.

azu · 2024-08-14T12:26:09Z

https://github.com/textlint-rule/textlint-rule-sentence-length/releases/tag/v5.2.0 released

yumetodo force-pushed the feat/count_by_codepoint branch from 3ee7a33 to ad37387 Compare August 13, 2024 03:58

yumetodo changed the title ~~feat: count string by codepoint~~ Draft! feat: count string by codepoint Aug 13, 2024

yumetodo force-pushed the feat/count_by_codepoint branch 2 times, most recently from 1ece1c7 to ac71454 Compare August 13, 2024 04:09

yumetodo changed the title ~~Draft! feat: count string by codepoint~~ feat: count string by codepoint Aug 13, 2024

yumetodo force-pushed the feat/count_by_codepoint branch from ac71454 to c32cb9f Compare August 13, 2024 04:41

feat: count string by codepoint

3bc67bf

Unicode says that there are 4 ways to count string length. https://unicode.org/faq/char_combmark.html#7 This commit supports counting by Code points.

yumetodo force-pushed the feat/count_by_codepoint branch from c32cb9f to 3bc67bf Compare August 13, 2024 05:03

azu reviewed Aug 13, 2024

View reviewed changes

yumetodo and others added 2 commits August 13, 2024 22:15

refactor: cut-out to strLenByCodeUnits function

c555591

ref: - textlint-rule#44 (comment) Co-authored-by: azu <[email protected]>

chore: s/code units/codeunits/

511ec16

ref: - textlint-rule#44 (comment) Co-authored-by: azu <[email protected]>

chore: s/strLenByCodePoint/strLenByCodePoints/

4f1a1a1

azu merged commit a6873ea into textlint-rule:master Aug 14, 2024
2 checks passed

azu added the Type: Feature New Feature label Aug 14, 2024

yumetodo deleted the feat/count_by_codepoint branch August 14, 2024 06:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: count string by codepoint #44

feat: count string by codepoint #44

yumetodo commented Aug 13, 2024 •

edited

Loading

azu left a comment •

edited

Loading

azu Aug 13, 2024

yumetodo Aug 13, 2024

azu Aug 13, 2024

yumetodo Aug 13, 2024

yumetodo commented Aug 13, 2024

azu commented Aug 14, 2024

azu commented Aug 14, 2024

	countBy?: "code units" \| "codepoints";
	countBy?: "codeunits" \| "codepoints";

feat: count string by codepoint #44

feat: count string by codepoint #44

Conversation

yumetodo commented Aug 13, 2024 • edited Loading

Abstruct

Motivation

azu left a comment • edited Loading

Choose a reason for hiding this comment

azu Aug 13, 2024

Choose a reason for hiding this comment

yumetodo Aug 13, 2024

Choose a reason for hiding this comment

azu Aug 13, 2024

Choose a reason for hiding this comment

yumetodo Aug 13, 2024

Choose a reason for hiding this comment

yumetodo commented Aug 13, 2024

azu commented Aug 14, 2024

azu commented Aug 14, 2024

yumetodo commented Aug 13, 2024 •

edited

Loading

azu left a comment •

edited

Loading