Skip to content

Latest commit

 

History

History
177 lines (116 loc) · 5.22 KB

section_grep.md

File metadata and controls

177 lines (116 loc) · 5.22 KB

Simple Boolean Retrieval Implementation

  1. Calculate result lists for every query term: grep
  2. Apply boolean operators: Intersect, Union

Grep

  • Query: book
  • Document: a book about information retrieval

Compare query with document, from first to last character

  1. a␣book about information retrieval
  2. a␣book about information retrieval
  3. a book about information retrieval → Match

Notes: What are some issues with grep?

Try grep

unzip files/simple-wikipedia.zip
cd corpus

# Search for "information" in a single document:
grep "information" Information.txt

# Recursively search for "information":
grep --recursive "information" .

# Measure the time it takes and discard the output:
time grep --recursive "information" . >/dev/null

# Count number of matching files
grep --files-with-matches "information" . | wc -l

Notes: How often is "information" found in Information.txt when searching with a text editor? Why the discrepancy?

Calculate result lists

Generate a list of terms from the query

(information AND retrieval) OR search

[information, retrieval, search]

Grep every term in every document

  1. a book about information retrieval
  2. a book about the search for information
  3. a book about retrieving information

Notes:

Generate result lists

Query term: information[#1, #2, #3]

  • #1 a book about information retrieval
  • #2 a book about the search for information
  • #3 a book about retrieving information

Query term: retrieval[#1]

  • #1 a book about information retrieval
  • #2 a book about the search for information
  • #3 a book about retrieving information

Query term: search[#2]

  • #1 a book about information retrieval
  • #2 a book about the search for information
  • #3 a book about retrieving information

Notes: Audience question

Intersect / Union

  • ­ #1: a book about information retrieval

  • ­ #2: a book about the search for information

  • ­ #3: a book about retrieving information

(information AND retrieval) OR search

...

Notes: Audience question

Complexity

  • Big Ο notation
  • Describe time or memory complexity of algorithms and data structures
  • Which inputs influence their runtime and memory requirements?
  • Look at worst case

Time complexity examples

How many loops do you need?

Big Ο Name Example Explanation
Ο(1) Constant Odd or even number? No loop
Ο(n) Linear Calculate array sum Iterate over all values
Ο(n²) Quadratic Find duplicates in unsorted array Compare every value against every other value

Grep Complexity

Search every query term as a string in every document:

$$O(\text{num query terms} \times \text{total length of all documents})$$

Can take reaaally long

Notes: Audience question

Grep complexity example

  • English Wikipedia: 6M articles, 12B characters, 1.2M distinct terms
  • grep: 2 query terms × 12GB = 24 billion string comparisons

Notes: How can this be improved?


Grep disadvantages

  • ­ Does books match book?
    • ­Cannot deal with singular / plural, books does not match book.
  • ­ go vs went
  • ­ running vs run
  • ­ go vs gong

Not user friendly!

Notes: More examples?