Skip to content

[RFC] Korean Tokenization: Jamo Decomposition as Pre-tokenizer (A Blueprint for Compositional Scripts) #2198

@nicezic

Description

@nicezic

API Impact

Korean text currently consumes 2-3x more tokens than semantically equivalent English text due to how syllable blocks are tokenized. This directly impacts API users: higher costs, faster context window exhaustion, and degraded multilingual performance — all without any gain in semantic resolution. A simple preprocessing step (jamo decomposition) could substantially reduce token overhead for Korean and other compositional scripts.

Summary

Korean (Hangul) characters should be decomposed into their constituent jamo (consonant/vowel components) before tokenization. While this proposal focuses on Korean, this sub-character preprocessing step serves as a highly scalable blueprint for improving token efficiency and pattern generalization across other compositional scripts (e.g., Tibetan, Devanagari) in multilingual models.

This is a low-hanging fruit. Current approaches expand vocabulary tables by tens of thousands of entries to brute-force Korean coverage — at enormous compute and memory cost. A 10-line structural preprocessing step in the pre-tokenizer can achieve better coverage, slash token overhead by half, and improve multilingual performance simultaneously.

Background: How Hangul Works

Hangul is a compositional writing system. Every syllable block is a combination of 2-3 components:

한 = ㅎ (initial) + ㅏ (vowel) + ㄴ (final)
글 = ㄱ (initial) + ㅡ (vowel) + ㄹ (final)
가 = ㄱ (initial) + ㅏ (vowel)          (no final)

The entire system uses only 68 jamo: 19 initials + 21 vowels + 28 finals (including none).

These combine into 11,172 possible syllable blocks in Unicode (U+AC00–U+D7A3). The composition is pure arithmetic:

code = 0xAC00 + (initial * 588) + (vowel * 28) + final
# Decomposition is the reverse — 3 lines of code

The Problem

Current BPE/Unigram tokenizers treat each of the 11,172 syllable blocks as independent symbols with no structural relationship:

한 (U+D55C) → token #8234
할 (U+D560) → token #9102
함 (U+D568) → token #7891

These three share initial ㅎ + vowel ㅏ (differing only in final consonant), but the tokenizer sees zero relationship. It's like treating "cat", "car", "can" as completely unrelated symbols instead of recognizing the "ca-" prefix.

Result:

  • Korean text consumes 2-3x more tokens than equivalent English text.
  • Pattern generalization across similar-sounding words is lost.
  • Vocabulary table wastes space on 11,172 entries vs 68 jamo.
  • Korean users pay more for API usage for the same semantic content.

Proposed Solution

Add an optional Hangul jamo decomposition step in the normalizer/pre-tokenizer pipeline:

Input:    "한글 처리"
Current:  [한] [글] [처] [리]              → 4+ tokens (opaque blocks)
Proposed: [ㅎㅏㄴ] [ㄱㅡㄹ] [ㅊㅓ] [ㄹㅣ]  → jamo sequences (composable)

The decomposition is trivial — Unicode arithmetic, ~10 lines of code:

def decompose_hangul(char: str) -> str:
    code = ord(char) - 0xAC00
    if not (0 <= code < 11172):
        return char
    initial = code // 588
    vowel = (code % 588) // 28
    final = code % 28
    result = INITIALS[initial] + VOWELS[vowel]
    if final > 0:
        result += FINALS[final]
    return result

Expected Benefits

Metric Current With Jamo Decomposition
Vocabulary entries for Korean ~11,172 syllable blocks 68 jamo
Token count for Korean text 2-3x vs English Potentially close to parity
Cross-word pattern sharing None (opaque blocks) Structural (shared jamo = shared tokens)
Implementation cost Minimal (Unicode arithmetic)

Considerations

  • Token count per Korean word may increase (1 syllable → 2-3 jamo tokens), but vocabulary compression + pattern generalization should offset this.
  • Reconstruction is lossless — jamo → syllable block is deterministic.
  • This could be opt-in (language-detected or user-specified).
  • Multilingual Scalability: This sub-character decomposition approach provides a foundational framework that can be directly adapted to benefit other compositional scripts (e.g., Tibetan, Devanagari, Thai) for broader multilingual token optimization.
  • Empirical Validation: Research like KR-BERT (Lee et al., 2020) has already proven the empirical superiority of Sub-character BPE for Korean. Modern global tokenizers still rely on inefficient Byte-Level BPE (BBPE) that ignores the mathematical compositionality of scripts. Supporting Jamo decomposition in the pre-tokenizer is a cost-free architectural win.

References & Empirical Evidence

  • KR-BERT (Lee et al., 2020): Proves that Sub-character BPE drastically reduces vocabulary size while outperforming character-level models. (arXiv:2008.03979)
  • Jeon et al., 2023: "Improving Korean NLP Tasks with Linguistically Informed Subword Tokenization and Sub-character Decomposition" (arXiv:2311.03928)
  • ACL 2024 Findings: "Korean Character Representations Based on the Combination Rules of Subcharacters" (ACL Anthology)
  • Unicode Hangul Syllable Decomposition: Official algorithmic specification. (Unicode Standard, Chapter 3.12)
  • Reference Implementation: MorphSubDecomp-Korean — Sub-character decomposition pre-tokenizer pipeline

Metadata

Metadata

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions