Skip to content

Latest commit

 

History

History
44 lines (34 loc) · 2 KB

readme.md

File metadata and controls

44 lines (34 loc) · 2 KB

Prerequisite for the application

NLTK

    - pip install nltk 

Flask

    - pip install Flask

Regex

    - pip install re

What is Tokenization?

Tokenization is a very common task in NLP, it is basically a task of chopping a character into pieces, called as token, and throwing away the certain characters at the same time, like punctuation.

The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.

Trained Corpus

Other Available Corpus

Examples:
Input Outout
We need small heroes so that big heroes can shine ['[CLS]', 'we', 'need', 'small', 'heroes', 'so', 'that', 'big', 'heroes', 'can', 'shine', '[SEP]']
Tokenization is a very common task in NLP ['[CLS]', 'token', '##ization', 'is', 'a', 'very', 'common', 'task', '[SEP]']

References