Skip to content

Multimedia Information Retrieval project A.Y. 2023-2024 (UniPi - AIDE)

Notifications You must be signed in to change notification settings

francescogrillea/SearchEngine_MIRCVProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

87 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This project aims to develop a Search Engine from scratch. Is made up of two main stages: the construction of an inverted index structure from a set of text documents (from MSMARCO Passages collection) and the query processing stage over such inverted index.

Environment Setup

sudo apt install maven 

Install SearchEngine

mvn compile

Setup Environment

mkdir data/intermediate_postings/
mkdir data/intermediate_postings/index/
mkdir data/intermediate_postings/lexicon/
mkdir data/intermediate_postings/doc_index/

Cleanup Environment

cd data
chmod +x cleanup.sh
./cleanup.sh
cd ..

Run

Build Index

 mvn -e exec:java -Dexec.mainClass="org.offline_phase.MainClass"  -Dexec.args="-p -c"
[-p] apply stemming and stopword removal 
[-c] index compression

Command Line Interface

 mvn -e exec:java -Dexec.mainClass="org.online_phase.MainClass"  -Dexec.args="-p -c -k=20 -s=bm25
[-p] apply stemming and stopword removal 
[-c] index compression
[-k=20] retrieve the top 20 document
[-s=bm25] use BM25 scoring function (otherwise TFIDF will be applied)

Evaluation using TrecEval

 mvn -e exec:java -Dexec.mainClass="org.evaluation.MainClass"  -Dexec.args="-p -c -k=20 -s=bm25 -mode=d
[-p] apply stemming and stopword removal 
[-c] index compression
[-k=20] retrieve the top 20 document
[-s=bm25] use BM25 scoring function (otherwise TFIDF will be applied)
[-mode=c] use DAAT in conjunctive mode
[-mode=d] use DAAT in disjunctive mode (if no mode is specified, MaxScore si used

Performance

All experiments and benchmarks detailed were executed on the same machine (MSI Prestige 14 Evo A11M) in order to ensure a standardized and well-defined environment for our experiments.

  • OS: Microsoft Windows 11 Home 64 bit Ver.2009(OS build 22000.675)
  • CPU: 11th Gen Intel(R) Core(TM) i7-1185G7 @ 3.00GHz
  • Memory: 16 GB @ 2133 MHz, 8 × 2048 MB, LPDDR4-4267
  • Graphics: Intel(R) Iris(R) Xe Graphics, 1024 MB
  • Disk: SSD, SAMSUNG MZVL2512HCJQ-00B00, 476.94 GB

Index Construction

flags # Terms Index (MB) Lexicon (MB) DocIndex (MB) Time Elapsed
none 1,369,123 2690 54.8 101 00:15:09 $\pm$ 01:30
-c 1,369,123 1340 54.8 101 00:17:38 $\pm$ 02:22
-p 1,170,498 1430 46.3 101 00:06:30 $\pm$ 01:34
-p -c 1,170,498 738 46.3 101 00:07:35 $\pm$ 01:16

Query Execution

mode score MAP P@20 NDCG Time Elapsed (s)
DAAT Conj. TFIDF 0.139 0.329 0.262 0.023 $\pm$ 0.020
DAAT Conj. BM25 0.141 0.351 0.266 0.021 $\pm$ 0.017
DAAT Disj. TFIDF 0.132 0.367 0.257 0.034 $\pm$ 0.030
DAAT Disj. BM25 0.182 0.460 0.323 0.035 $\pm$ 0.030
MaxScore TFIDF 0.132 0.367 0.257 0.028 $\pm$ 0.026
MaxScore BM25 0.182 0.460 0.323 0.026 $\pm$ 0.024

for more information about results and performance, please talke a look to the final report

About

Multimedia Information Retrieval project A.Y. 2023-2024 (UniPi - AIDE)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published