Inverted index*

* This will be your homework

Notes:

Same information as Term-document Matrix, but
Do not store 0s
Store list of matching documents per term

Notes:

#1: a book about information retrieval
#2: a book about the search for information
#3: a book about information

↓

Term	Doc IDs
Book	#1, #2, #3
Information	#1, #2, #3
Retrieval	#1
Search	#2

Notes: Audience question

Nomenclature

Vocabulary / Dictionary / Lexicon: List of terms
Posting: Document which term occurs in
Postings list: All documents which term occurs in
Postings: All postings lists

Notes:

Index time

Extract terms from docs
Sort vocabulary alphabetically
Sort postings lists by document ID

Notes:

Query time

Look up each query term in vocabulary
Retrieve postings
Intersect / Union

Only exact terms can be found!

Notes: Why can only exact terms be found?

Term	Doc IDs
Book	#1, #2, #3
Information	#1, #2, #3
Retrieval	#1
Search	#2

↓

Query: Information AND Search

(#1, #2, #3) AND (#2)

→ #2

Notes: Audience question

Time complexity

$$O(\text{num query terms} \times \text{num distinct terms})$$

Example

English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
Term-Document Matrix: 2 query terms × 1.2M distinct terms = 2.4M lookups
Inverted Index: 2 query terms × 1.2M distinct terms = 2.4M lookups

Size

$$O(\text{num distinct terms} \times \text{average postings list length} \times \text{bytes per document ID})$$

Example

English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
Term-Document Matrix: 1.2M distinct terms × 6M articles = 7.200.000.000.000 cells = 7.2 terrabit = 900 gigabyte
Inverted Index: 1.2M distinct terms × 10K postings per terms × 4 bytes per document ID = 48.000.000.000 bytes = 48 gigabyte

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

section_inverted_index.md

section_inverted_index.md

Inverted index*

Notes:

Notes:

Notes: Audience question

Nomenclature

Notes:

Index time

Notes:

Query time

Notes: Why can only exact terms be found?

Notes: Audience question

Time complexity

Example

Size

Example

Files

section_inverted_index.md

Latest commit

History

section_inverted_index.md

File metadata and controls

Inverted index*

Notes:

Notes:

Notes: Audience question

Nomenclature

Notes:

Index time

Notes:

Query time

Notes: Why can only exact terms be found?

Notes: Audience question

Time complexity

Example

Size

Example