A lightweight, POC, vector-based search engine implementation with Porter stemming algorithm for improved text preprocessing and search accuracy. This simple project implements the idea from Basic Vector Space Search Engine Theory.
- Vector Search: Implements cosine similarity-based document ranking
- Porter Stemming: Reduces words to their root forms to improve matching (e.g., "running" → "run")
- Common Crawl Integration: Fetches real web documents from Common Crawl archive
- HTML Text Extraction: Strips HTML tags and extracts clean text content
- Modular Design: Separate classes for different functionalities
- Flexible API: Can be used with or without stemming
py-search-engine/
├── src/
│ ├── __init__.py
│ ├── algorithms/
│ │ ├── __init__.py
│ │ └── porter_stemming.py # Porter stemming algorithm
│ ├── crawl/
│ │ ├── __init__.py
│ │ └── common_crawl.py # Common Crawl client and HTML processing
│ └── search/
│ ├── __init__.py
│ └── vector_search.py # Vector search and search engine
├── example.py # Basic usage examples
├── example_crawl.py # Common Crawl integration example
├── hello.py # Simple hello world
└── README.md
- Clone the repository:
git clone <repository-url>
cd py-search-engine
- Ensure Python 3.x is installed:
python --version
- Install required dependencies:
pip install requests
from src.search import SearchEngine
# Create a search engine with stemming (default)
engine = SearchEngine(use_stemming=True)
# Add documents
documents = {
0: "Python is a great programming language",
1: "Machine learning algorithms require good data",
2: "Data science involves programming skills"
}
engine.add_documents(documents)
# Search
results = engine.search("programming")
for score, doc_id, content in results:
print(f"Score: {score:.4f} - {content}")
# Create engine without stemming
engine = SearchEngine(use_stemming=False)
Basic example with static data:
python example.py
Common Crawl example with real web documents:
python example_crawl.py
The basic example demonstrates the difference between search results with and without Porter stemming. The Common Crawl example fetches real web documents and shows how to search through them.
- Document Processing: Each document is converted to a concordance (word frequency map)
- Vector Representation: Documents and queries are represented as vectors in word space
- Similarity Calculation: Uses cosine similarity to rank document relevance
- Ranking: Results are sorted by relevance score (higher = more relevant)
The Porter stemming algorithm reduces inflected words to their root form through a series of rule-based transformations:
- Step 1: Handle plurals and past participles (e.g., "cats" → "cat", "running" → "run")
- Step 2: Handle double consonants and ly-endings
- Step 3: Handle various suffixes
- Step 4: Remove suffixes in longer words
- Step 5: Clean up remaining suffixes
- Improved Recall: Matches variations of words (e.g., "run", "running", "runs")
- Reduced Index Size: Fewer unique terms in the index
- Better Semantic Matching: Focuses on word roots rather than exact forms
Core vector operations for document similarity.
Methods:
magnitude(concordance)
: Calculate vector magnituderelation(concordance1, concordance2)
: Calculate cosine similarityconcordance(document, use_stemming, stemmer)
: Create word frequency map
High-level interface for document management and search.
Methods:
add_document(doc_id, content)
: Add single documentadd_documents(documents_dict)
: Add multiple documentsadd_crawl_documents(crawl_documents)
: Add documents from CommonCrawlClientsearch(query, max_results)
: Search and return ranked results
Implementation of the Porter stemming algorithm.
Methods:
stem(word)
: Reduce word to its root form
Fetches and processes web documents from Common Crawl archive.
Methods:
search_urls(domain, limit)
: Search for URLs from a specific domainfetch_document(filename, offset, length)
: Fetch a specific documentextract_text_from_html(html_content)
: Extract clean text from HTMLget_documents(domains, max_docs_per_domain)
: Fetch documents from multiple domainsget_sample_documents()
: Get sample documents for testing
Using Porter stemming provides several computational advantages:
- Reduced Vocabulary: Fewer unique terms to process
- Better Matching: Related word forms are treated as equivalent
- Smaller Index: Less memory usage for large document collections
- Improved Precision: More focused search results
Query: 'programming languages'
----------------------------------------
With Porter Stemming:
Score: 0.2357 - Doc 0: Python is a great programming language for machine learning applications
Score: 0.1543 - Doc 5: Learning new programming languages improves your coding skills
Without Stemming:
Score: 0.0000 - (no matches found)
Stemming found 2 results vs 0 without stemming