MFA press statements scraped from official websites. Separated by countries and dates. Cleaned, processed, and ingested into Elasticsearch for easy searching and analysis. |
---|
This project is part of PopFigExpert.
- Scrape and clean press statements from the websites
Function
scrape()
handles the scraping and the cleaning. This includes selecting the correct elements, and handling the weird spacing of the text.
- Chunk the press statements into smaller chunks
Each doucment can range from 100 - 5000+ characters. We need to equalize the size of the chunks, because it will be fed into our RAG system.
- Store the chunks in Elasticsearch
Bulk store the documents into elasticsearch.
- Extra: Playing around with retrieval
Trying methods to abstract the data retrieval.
type ArticleSearchResult = {
title: string;
url: string;
date: string;
country: string;
content: string;
}
index_mappings = {
"mappings": {
"properties": {
"date": {"type": "date"},
"title": {"type": "text"},
"url": {"type": "text"},
"country": {"type": "text"},
"content": {"type": "text"},
}
}
}
The notebook requires the following dependencies:
Python 3.x Jupyter Notebook Pandas Requests Langchain & Tiktoken to split text into chunks