Skip to content

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

License

Notifications You must be signed in to change notification settings

ioridev/tiny_segmenter_dart

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TinySegmenter for Dart

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

Features

  • 🚀 Lightweight: No external dependencies, compact implementation
  • 🎯 Dictionary-free: Uses statistical modeling instead of requiring large dictionary files
  • 🔧 Simple API: Easy to use with just one method call
  • 📦 Pure Dart: Works on all Dart platforms (Flutter, Web, Server, etc.)
  • 🇯🇵 Japanese text support: Handles Hiragana, Katakana, Kanji, and mixed text

Installation

Add this to your package's pubspec.yaml file:

dependencies:
  tiny_segmenter_dart: ^1.0.0

Then run:

dart pub get

Usage

Basic Usage

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Segment Japanese text
  final words = segmenter.segment('私は日本人です');
  print(words); // ['私', 'は', '日本人', 'です']
  
  // Works with mixed text types
  final mixed = segmenter.segment('今日はいい天気です');
  print(mixed); // ['今日', 'は', 'いい', '天気', 'です']
}

More Examples

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Katakana text
  final katakana = segmenter.segment('コンピューターを使います');
  print(katakana); // ['コンピューター', 'を', '使い', 'ます']
  
  // Text with numbers
  final withNumbers = segmenter.segment('今日は2023年12月です');
  print(withNumbers); // ['今日', 'は', '2', '0', '2', '3', '年', '1', '2', '月', 'です']
  
  // Complex sentence
  final complex = segmenter.segment('私は東京大学で日本語を勉強しています。');
  print(complex); // ['私', 'は', '東京', '大学', 'で', '日本語', 'を', '勉強', 'し', 'て', 'い', 'ます', '。']
  
  // Empty string handling
  final empty = segmenter.segment('');
  print(empty); // []
}

How it Works

TinySegmenter uses a statistical approach to segment Japanese text:

  1. Character Classification: Characters are classified into types (Hiragana, Katakana, Kanji, Alphabet, Numbers, etc.)
  2. Statistical Modeling: Uses pre-trained statistical models to determine word boundaries
  3. Context Analysis: Considers surrounding characters and their types to make segmentation decisions
  4. No Dictionary Required: Unlike traditional approaches, it doesn't need large dictionary files

Character Types

The segmenter recognizes these character types:

  • M: Japanese numbers (一二三四五六七八九十百千万億兆)
  • H: Kanji characters (一-龠々〆ヵヶ)
  • I: Hiragana (ぁ-ん)
  • K: Katakana (ァ-ヴーア-ン゙ー)
  • A: Alphabet (a-zA-Za-zA-Z)
  • N: Arabic numbers (0-90-9)
  • O: Other characters

Performance

TinySegmenter is designed to be fast and memory-efficient:

  • No external dependencies
  • Minimal memory footprint
  • Fast segmentation speed
  • Suitable for real-time applications

Limitations

  • Optimized for Japanese text; may not work well with other languages
  • Segmentation accuracy depends on the statistical model and may not be perfect for all text types
  • Numbers are often segmented character by character

Credits

  • Original JavaScript implementation by Taku Kudo
  • Dart port implementation

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

If you encounter any issues or have suggestions, please open an issue on GitHub.

About

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages