Skip to content

Latest commit

 

History

History
129 lines (89 loc) · 4.9 KB

section_inverted_index.md

File metadata and controls

129 lines (89 loc) · 4.9 KB

Inverted index*

* This will be your homework

Notes:

  • Same information as Term-document Matrix, but
  • Do not store 0s
  • Store list of matching documents per term

Notes:

  • #1: a book about information retrieval
  • #2: a book about the search for information
  • #3: a book about information

Term Doc IDs
Book #1, #2, #3
Information #1, #2, #3
Retrieval #1
Search #2

Notes: Audience question

Nomenclature

Vocabulary / Dictionary / Lexicon
List of terms
Posting
Document which term occurs in
Postings list
All documents which term occurs in
Postings
All postings lists

Notes:

Index time

  1. Extract terms from docs
  2. Sort vocabulary alphabetically
  3. Sort postings lists by document ID

Notes:

Query time

  1. Look up each query term in vocabulary
  2. Retrieve postings
  3. Intersect / Union

Only exact terms can be found!

Notes: Why can only exact terms be found?

Term Doc IDs
Book #1, #2, #3
Information #1, #2, #3
Retrieval #1
Search #2

­Query: Information AND Search

(#1, #2, #3) AND (#2)

→ #2

Notes: Audience question

Time complexity

$$O(\text{num query terms} \times \text{num distinct terms})$$

Example

  • ­English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
  • ­Term-Document Matrix: 2 query terms × 1.2M distinct terms = 2.4M lookups
  • ­Inverted Index: 2 query terms × 1.2M distinct terms = 2.4M lookups

Size

$$O(\text{num distinct terms} \times \text{average postings list length} \times \text{bytes per document ID})$$

Example

  • ­English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
  • ­Term-Document Matrix: 1.2M distinct terms × 6M articles = 7.200.000.000.000 cells = 7.2 terrabit = 900 gigabyte
  • ­Inverted Index: 1.2M distinct terms × 10K postings per terms × 4 bytes per document ID = 48.000.000.000 bytes = 48 gigabyte