- Before 1.5, punctuations and spaces are removed during normal tokenization, but are kept during tokenization for transformation, which is used internally by Coc Coc Search Engine. This update introduces option
keep_puncts
inrun_tokenize()
function, which can be used to keep punctuations (but not spaces and dots in segmented URLs) in normal tokenization. - New argument
-k
and-t
are introduced in CLI, to togglekeep_puncts
andfor_transformation
when running tokenizer.
- Before 1.5,
run_tokenize()
has a param nameddont_push_puncts
, which is used to prevent inclusion of punctuations in result when tokenizing for transformation. It was replaced bykeep_puncts
, which serves the same purpose but (1) can be used for both normal tokenization and tokenization for transformation and (2) positive parameter naming is a better practice. The default value ofkeep_puncts
is equal tofor_transformation
- false for normal tokenization, true for transformation. So previous behaviour remains the same, but this will break old code ifrun_tokenize()
was called withfor_transformation = true
anddont_push_puncts = false
.
- Wrapper functions are added to C++ implementation, matching those in Java binding, for ease of use.