MFA-PressData

	MFA press statements scraped from official websites. Separated by countries and dates. Cleaned, processed, and ingested into Elasticsearch for easy searching and analysis.

This project is part of PopFigExpert.

Scrape and clean press statements from the websites

Function scrape() handles the scraping and the cleaning. This includes selecting the correct elements, and handling the weird spacing of the text.

Chunk the press statements into smaller chunks

Each doucment can range from 100 - 5000+ characters. We need to equalize the size of the chunks, because it will be fed into our RAG system.

Store the chunks in Elasticsearch

Bulk store the documents into elasticsearch.

Extra: Playing around with retrieval

Trying methods to abstract the data retrieval.

Data Structure

TypeScript

type ArticleSearchResult = {
  title: string;
  url: string;
  date: string;
  country: string;
  content: string;
}

Elasticsearch index mappings

index_mappings = {
  "mappings": {
    "properties": {
      "date": {"type": "date"},
      "title": {"type": "text"},
      "url": {"type": "text"},
      "country": {"type": "text"},
      "content": {"type": "text"},
    }
  }
}

Dependencies

The notebook requires the following dependencies:

Python 3.x Jupyter Notebook Pandas Requests Langchain & Tiktoken to split text into chunks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

MFA-PressData

Data Structure

TypeScript

Elasticsearch index mappings

Dependencies

Files

README.md

Latest commit

History

README.md

File metadata and controls

MFA-PressData

Data Structure

TypeScript

Elasticsearch index mappings

Dependencies