* This will be your homework
- Same information as Term-document Matrix, but
- Do not store 0s
- Store list of matching documents per term
- #1: a book about information retrieval
- #2: a book about the search for information
- #3: a book about information
↓
Term | Doc IDs |
---|---|
Book | #1, #2, #3 |
Information | #1, #2, #3 |
Retrieval | #1 |
Search | #2 |
- Vocabulary / Dictionary / Lexicon
- List of terms
- Posting
- Document which term occurs in
- Postings list
- All documents which term occurs in
- Postings
- All postings lists
- Extract terms from docs
- Sort vocabulary alphabetically
- Sort postings lists by document ID
- Look up each query term in vocabulary
- Retrieve postings
- Intersect / Union
Only exact terms can be found!
Term | Doc IDs |
---|---|
Book | #1, #2, #3 |
Information | #1, #2, #3 |
Retrieval | #1 |
Search | #2 |
↓
Query: Information AND Search
(#1, #2, #3) AND (#2)
→ #2
- English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
- Term-Document Matrix: 2 query terms × 1.2M distinct terms = 2.4M lookups
- Inverted Index: 2 query terms × 1.2M distinct terms = 2.4M lookups
- English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
- Term-Document Matrix: 1.2M distinct terms × 6M articles = 7.200.000.000.000 cells = 7.2 terrabit = 900 gigabyte
- Inverted Index: 1.2M distinct terms × 10K postings per terms × 4 bytes per document ID = 48.000.000.000 bytes = 48 gigabyte