TinySegmenter for Dart

A compact Japanese text tokenizer for Dart. TinySegmenter is a Japanese word segmentation library based on the original JavaScript implementation by Taku Kudo.

Features

🚀 Lightweight: No external dependencies, compact implementation
🎯 Dictionary-free: Uses statistical modeling instead of requiring large dictionary files
🔧 Simple API: Easy to use with just one method call
📦 Pure Dart: Works on all Dart platforms (Flutter, Web, Server, etc.)
🇯🇵 Japanese text support: Handles Hiragana, Katakana, Kanji, and mixed text

Installation

Add this to your package's pubspec.yaml file:

dependencies:
  tiny_segmenter_dart: ^1.0.0

Then run:

dart pub get

Usage

Basic Usage

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Segment Japanese text
  final words = segmenter.segment('私は日本人です');
  print(words); // ['私', 'は', '日本人', 'です']
  
  // Works with mixed text types
  final mixed = segmenter.segment('今日はいい天気です');
  print(mixed); // ['今日', 'は', 'いい', '天気', 'です']
}

More Examples

import 'package:tiny_segmenter_dart/tiny_segmenter_dart.dart';

void main() {
  final segmenter = TinySegmenter();
  
  // Katakana text
  final katakana = segmenter.segment('コンピューターを使います');
  print(katakana); // ['コンピューター', 'を', '使い', 'ます']
  
  // Text with numbers
  final withNumbers = segmenter.segment('今日は2023年12月です');
  print(withNumbers); // ['今日', 'は', '2', '0', '2', '3', '年', '1', '2', '月', 'です']
  
  // Complex sentence
  final complex = segmenter.segment('私は東京大学で日本語を勉強しています。');
  print(complex); // ['私', 'は', '東京', '大学', 'で', '日本語', 'を', '勉強', 'し', 'て', 'い', 'ます', '。']
  
  // Empty string handling
  final empty = segmenter.segment('');
  print(empty); // []
}

How it Works

TinySegmenter uses a statistical approach to segment Japanese text:

Character Classification: Characters are classified into types (Hiragana, Katakana, Kanji, Alphabet, Numbers, etc.)
Statistical Modeling: Uses pre-trained statistical models to determine word boundaries
Context Analysis: Considers surrounding characters and their types to make segmentation decisions
No Dictionary Required: Unlike traditional approaches, it doesn't need large dictionary files

Character Types

The segmenter recognizes these character types:

M: Japanese numbers (一二三四五六七八九十百千万億兆)
H: Kanji characters (一-龠々〆ヵヶ)
I: Hiragana (ぁ-ん)
K: Katakana (ァ-ヴーｱ-ﾝﾞｰ)
A: Alphabet (a-zA-Zａ-ｚＡ-Ｚ)
N: Arabic numbers (0-9０-９)
O: Other characters

Performance

TinySegmenter is designed to be fast and memory-efficient:

No external dependencies
Minimal memory footprint
Fast segmentation speed
Suitable for real-time applications

Limitations

Optimized for Japanese text; may not work well with other languages
Segmentation accuracy depends on the statistical model and may not be perfect for all text types
Numbers are often segmented character by character

Credits

Original JavaScript implementation by Taku Kudo
Dart port implementation

License

This project is licensed under the BSD 3-Clause License - see the LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Issues

If you encounter any issues or have suggestions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.dart_tool		.dart_tool
lib		lib
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pubspec.lock		pubspec.lock
pubspec.yaml		pubspec.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

TinySegmenter for Dart

Features

Installation

Usage

Basic Usage

More Examples

How it Works

Character Types

Performance

Limitations

Credits

License

Contributing

Issues

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

ioridev/tiny_segmenter_dart

Folders and files

Latest commit

History

Repository files navigation

TinySegmenter for Dart

Features

Installation

Usage

Basic Usage

More Examples

How it Works

Character Types

Performance

Limitations

Credits

License

Contributing

Issues

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages