Skip to content

Latest commit

 

History

History
69 lines (42 loc) · 2.71 KB

README.md

File metadata and controls

69 lines (42 loc) · 2.71 KB

myanmar_word_segmenter

In Myanmar sentences, words are often written without spaces between them. This can make it inconvenient for beginners to find them in dictionaries.

This script utilizes a simplistic approach by using a known Myanmar word list to split long words or phrases into smaller individual words, making it easier to locate them in the dictionary.

Usage

The MyWordSegmenter.js is standalone code file that can be used directly in a browser or as a NodeJs module. Check example.js file.

const MyWordSegmenter = require("./MyWordSegmenter");
const mSegmenter = new MyWordSegmenter();

const imagineTestWord =
  "123ညီအစ်ကိုမသိတသိအချိန်ညီအစ်ကိုမသိတသိအချိန်Singaporeမသေကောင်းမပျောက်ကောင်းမှုန်မှုန်မွှားမွှား";

console.log("\nTest input", imagineTestWord);
console.log("Test output", mSegmenter.word_segment(imagineTestWord));

Expected output:
[
  '123',
  'ညီအစ်ကိုမသိတသိအချိန်',
  'ညီအစ်ကိုမသိတသိအချိန်',
  'Singapore',
  'မသေကောင်းမပျောက်ကောင်း',
  'မှုန်မှုန်မွှားမွှား'
]

Attributions


Developer notes

1. Download or update the myanmar-word list

Clone or download the myanmar-words repository to the same directory with prepareWordList.js file. The relative path to prepareWordList.js should be myanmar-words/json-files.

git clone --depth=1 https://github.com/myanmartools/myanmar-words.git

# then run
node prepareWordList.js

It will use dev_standalone_template to generate MYWORDS.json and the standalone code file MyWordSegmenter.js.

2. Miscellaneous

  • Combine with dictionaries

It is feasible to integrate Myanmar dictionaries within MyWordSegmenter.js to develop a basic Myanmar Popup dictionary browser extension for Myanmar learners.

  • Work with other languages

The theSegmenter method in MyWordSegmenter.js can be easily adapted to work with many other languages with minor modifications.

Try yourself with the Pāḷi language.