GitHub - hideaki-t/whoosh-igo: tokenizers for Whoosh designed for Japanese language

hideaki-t / whoosh-igo Public

Notifications You must be signed in to change notification settings
Fork 3
Star 7

tokenizers for Whoosh designed for Japanese language

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
whooshjp		whooshjp
CHANGES		CHANGES
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README		README
TODO		TODO
setup.py		setup.py
test.py		test.py

Repository files navigation

================================
 Japanese Tokenizers for Whoosh
================================

About
=====

Tokenizers for Whoosh full text search library designed for Japanese language.
This package contains two Tokenizers.

* IgoTokenizer

 + requires igo-python(http://pypi.python.org/pypi/igo-python/) and its dictionary.

* TinySegmenterTokenizer

 + requires TinySegmenter (https://pypi.python.org/pypi/tinysegmenter3)

* MeCabTokenizer

 * requires one of MeCab python bindings https://pypi.python.org/pypi/mecab-python or https://pypi.python.org/pypi/mecab-python3


How To Use
==========

IgoTokenizer::

 import igo.Tagger
 import whooshjp
 from whooshjp.IgoTokenizer import IgoTokenizer

 tk = IgoTokenizer()
 scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))


TinySegmenterTokenizer::

 import tinysegmenter
 import whooshjp
 from whooshjp.TinySegmenterTokenizer import TinySegmenterTokenizer

 tk = TinySegmenterTokenizer()
 scm = Schema(title=TEXT(stored=True, analyzer=tk), path=ID(unique=True,stored=True), content=TEXT(analyzer=tk))

Note
====

IgoTokenizer
------------

Whoosh stores a schema including a tokenizer into an index. If the stored schema includes the dictionary used in IgoTokenizer, the size of index might be too large. Storing a schema into an index is a great idea. You just need an index to search. Even your dictionary has been updated, the index has the dictionary used at indexing time.

Also an error will be happened when opening an index with mmap mode IgoTokenizer, since mmaped dictionary will not be stored into an index.
If you already built an index with mmap mode IgoTokenizer, you can open it by passing a schema to whoosh.index.open_dir. e.g. open_dir(path_to_index, schema=an_schema)

To avoid storing *full* tokenizer, IgoTokenizer has 2 instantiation modes.
 1. passing an igo tagger object. i.e. IgoTokenizer(IgoTagger(path='dic')). an instance created by this will store everything including loaded dictionary.
 2. passing arguments for instantiating an igo tokenizer. i.e. IgoTokenizer(path='dic'). Only the arguments will be stored, and instantiate an igo tokenizer with the arguments when an index which uses IgoTokenizer with this mode is opened. If no arguments are not given to IgoTokenizer initializer, i.e. IgoTokenizer(), it is also considered as a this mode, an IgoTagger will be instantiated without any arguments.