Prerequisite for the application

NLTK

    - pip install nltk

Flask

    - pip install Flask

Regex

    - pip install re

What is Tokenization?

Tokenization is a very common task in NLP, it is basically a task of chopping a character into pieces, called as token, and throwing away the certain characters at the same time, like punctuation.

The tokens may be words or number or punctuation mark. Tokenization does this task by locating word boundaries. Ending point of a word and beginning of the next word is called word boundaries. These tokens are very useful for finding such patterns as well as is considered as a base step for stemming and lemmatization.

Trained Corpus

[Bert Large Uncased Vocabulary] ( https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased-vocab.txt)

Other Available Corpus

[Bert Large Cased Vocabulary] ( https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt)
[Bert Base Uncased Vocabulary] (https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt)
[Bert Base Cased Vocabulary] (https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt)

Examples:

Input	Outout
We need small heroes so that big heroes can shine	['[CLS]', 'we', 'need', 'small', 'heroes', 'so', 'that', 'big', 'heroes', 'can', 'shine', '[SEP]']
Tokenization is a very common task in NLP	['[CLS]', 'token', '##ization', 'is', 'a', 'very', 'common', 'task', '[SEP]']

References

Packages
- Build and Deploy a web-application using Flask
- Flask
Papers
- Hugging Face Transformer
- NLTK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Prerequisite for the application

What is Tokenization?

Trained Corpus

Other Available Corpus

Examples:

References

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Prerequisite for the application

What is Tokenization?

Trained Corpus

Other Available Corpus

Examples:

References