MFA-PressData

	MFA press statements scraped from official websites. Separated by countries and dates. Cleaned, processed, and ingested into Elasticsearch for easy searching and analysis.

This project is part of PopFigExpert.

Scrape and clean press statements from the websites

Function scrape() handles the scraping and the cleaning. This includes selecting the correct elements, and handling the weird spacing of the text.

Chunk the press statements into smaller chunks

Each doucment can range from 100 - 5000+ characters. We need to equalize the size of the chunks, because it will be fed into our RAG system.

Store the chunks in Elasticsearch

Bulk store the documents into elasticsearch.

Extra: Playing around with retrieval

Trying methods to abstract the data retrieval.

Data Structure

TypeScript

type ArticleSearchResult = {
  title: string;
  url: string;
  date: string;
  country: string;
  content: string;
}

Elasticsearch index mappings

index_mappings = {
  "mappings": {
    "properties": {
      "date": {"type": "date"},
      "title": {"type": "text"},
      "url": {"type": "text"},
      "country": {"type": "text"},
      "content": {"type": "text"},
    }
  }
}

Dependencies

The notebook requires the following dependencies:

Python 3.x Jupyter Notebook Pandas Requests Langchain & Tiktoken to split text into chunks

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
READMEmedia		READMEmedia
scrape		scrape
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MFA-PressData

Data Structure

TypeScript

Elasticsearch index mappings

Dependencies

About

Releases

Packages

Languages

pclk/MFA-PressData

Folders and files

Latest commit

History

Repository files navigation

MFA-PressData

Data Structure

TypeScript

Elasticsearch index mappings

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages