Skip to content

Latest commit

 

History

History
86 lines (60 loc) · 2.33 KB

section_tokenization.md

File metadata and controls

86 lines (60 loc) · 2.33 KB

Tokenization

Notes:

Tokenization

a book about information retrieval

[a, book, about, information, retrieval]

"Information Retrieval": Also available as e-book!

[Information Retrieval]? [Information, Retrieval]?

[e-book]? [eBook]? [e, book]?

Notes:

Challenge

Inverted index can only find exact tokens

Term Doc IDs
book #1, #2, #3
information #1, #2, #3
retrieval #1
search #2

e-book will return no results!

Try in Elasticsearch.

Notes:

Challenge

  • How can books find book?
  • wi-fiwifi?
  • ­ Jack'sJack?
  • ­ MMTMultimediatechnology?
  • ­ U.S.A.USA?
  • ­ runningrun?

Notes: What are other examples?

Text analysis

  • Analyze docs and query
  • Add, remove, change terms

Try improved tokenization in Elasticsearch.

Notes:

Nomenclature

Token
  • Character sequence, meaningful semantic unit
  • No analysis yet
  • the, routers, the
Term
  • Index tokens
  • router