- Calculate result lists for every query term:
grep
- Apply boolean operators: Intersect, Union
- Query: book
- Document: a book about information retrieval
Compare query with document, from first to last character
- a␣book about information retrieval
- a␣book about information retrieval
- a book about information retrieval → Match
unzip files/simple-wikipedia.zip
cd corpus
# Search for "information" in a single document:
grep "information" Information.txt
# Recursively search for "information":
grep --recursive "information" .
# Measure the time it takes and discard the output:
time grep --recursive "information" . >/dev/null
# Count number of matching files
grep --files-with-matches "information" . | wc -l
Notes: How often is "information" found in Information.txt when searching with a text editor? Why the discrepancy?
(information AND retrieval) OR search
↓
[information, retrieval, search]
- a book about information retrieval
- a book about the search for information
- a book about retrieving information
Query term: information → [#1, #2, #3]
- #1 a book about information retrieval
- #2 a book about the search for information
- #3 a book about retrieving information
Query term: retrieval → [#1]
- #1 a book about information retrieval
- #2 a book about the search for information
- #3 a book about retrieving information
Query term: search → [#2]
- #1 a book about information retrieval
- #2 a book about the search for information
- #3 a book about retrieving information
-
#1: a book about information retrieval
-
#2: a book about the search for information
-
#3: a book about retrieving information
(information AND retrieval) OR search
- Big Ο notation
- Describe time or memory complexity of algorithms and data structures
- Which inputs influence their runtime and memory requirements?
- Look at worst case
How many loops do you need?
Big Ο | Name | Example | Explanation |
---|---|---|---|
Ο(1) | Constant | Odd or even number? | No loop |
Ο(n) | Linear | Calculate array sum | Iterate over all values |
Ο(n²) | Quadratic | Find duplicates in unsorted array | Compare every value against every other value |
Search every query term as a string in every document:
Can take reaaally long
- English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
- grep: 2 query terms × 12GB = 24 billion string comparisons
Notes: How can this be improved?
- Does books match book?
- Cannot deal with singular / plural, books does not match book.
- go vs went
- running vs run
- go vs gong
Not user friendly!
Notes: More examples?