Simple Boolean Retrieval Implementation

Calculate result lists for every query term: grep
Apply boolean operators: Intersect, Union

Grep

Query: book
Document: a book about information retrieval

Compare query with document, from first to last character

a␣book about information retrieval
a␣book about information retrieval
a book about information retrieval → Match

Notes: What are some issues with grep?

Try grep

unzip files/simple-wikipedia.zip
cd corpus

# Search for "information" in a single document:
grep "information" Information.txt

# Recursively search for "information":
grep --recursive "information" .

# Measure the time it takes and discard the output:
time grep --recursive "information" . >/dev/null

# Count number of matching files
grep --files-with-matches "information" . | wc -l

Notes: How often is "information" found in Information.txt when searching with a text editor? Why the discrepancy?

Calculate result lists

Generate a list of terms from the query

(information AND retrieval) OR search

↓

[information, retrieval, search]

Grep every term in every document

a book about information retrieval
a book about the search for information
a book about retrieving information

Notes:

Generate result lists

Query term: information → [#1, #2, #3]

#1 a book about information retrieval
#2 a book about the search for information
#3 a book about retrieving information

Query term: retrieval → [#1]

#1 a book about information retrieval
#2 a book about the search for information
#3 a book about retrieving information

Query term: search → [#2]

#1 a book about information retrieval
#2 a book about the search for information
#3 a book about retrieving information

Notes: Audience question

Intersect / Union

#1: a book about information retrieval
#2: a book about the search for information
#3: a book about retrieving information

(information AND retrieval) OR search

Notes: Audience question

Complexity

Big Ο notation
Describe time or memory complexity of algorithms and data structures
Which inputs influence their runtime and memory requirements?
Look at worst case

Time complexity examples

How many loops do you need?

Big Ο	Name	Example	Explanation
Ο(1)	Constant	Odd or even number?	No loop
Ο(n)	Linear	Calculate array sum	Iterate over all values
Ο(n²)	Quadratic	Find duplicates in unsorted array	Compare every value against every other value

Grep Complexity

Search every query term as a string in every document:

$$O(\text{num query terms} \times \text{total length of all documents})$$

Can take reaaally long

Notes: Audience question

Grep complexity example

English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
grep: 2 query terms × 12GB = 24 billion string comparisons

Notes: How can this be improved?

Grep disadvantages

Does books match book?
- Cannot deal with singular / plural, books does not match book.
go vs went
running vs run
go vs gong

Not user friendly!

Notes: More examples?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

section_grep.md

section_grep.md

Simple Boolean Retrieval Implementation

Grep

Notes: What are some issues with grep?

Try grep

Notes: How often is "information" found in Information.txt when searching with a text editor? Why the discrepancy?

Calculate result lists

Generate a list of terms from the query

Grep every term in every document

Notes:

Generate result lists

Notes: Audience question

Intersect / Union

Notes: Audience question

Complexity

Time complexity examples

Grep Complexity

Notes: Audience question

Grep complexity example

Grep disadvantages

Files

section_grep.md

Latest commit

History

section_grep.md

File metadata and controls

Simple Boolean Retrieval Implementation

Grep

Notes: What are some issues with grep?

Try grep

Notes: How often is "information" found in Information.txt when searching with a text editor? Why the discrepancy?

Calculate result lists

Generate a list of terms from the query

Grep every term in every document

Notes:

Generate result lists

Notes: Audience question

Intersect / Union

Notes: Audience question

Complexity

Time complexity examples

Grep Complexity

Notes: Audience question

Grep complexity example

Grep disadvantages