Skip to content

Nefariousgh/articleScraper

Repository files navigation

21BCE7665_ML

Backend for document retrieval which can be used as context for LLMs

API Readme

This Flask API provides endpoints for scraping news articles from Google news website. It includes background tasks to periodically scrape news depending on the query and a mechanism to log API requests and track user call frequency.

Setup Instructions Prerequisites

Python 3.7 or higher
Docker (optional, for containerization)
Flask
SQLite

The API will be available at http://127.0.0.1:5000.

API Endpoints

/health: Checks if the server is active, if it is it displays status as active. image

/search: search function takes few parameters such as data text top_k threshold user_id in order to perform the query. since there is no frontend implemented yet this can be accessed via the terminal using the following command. on powershell :

                    Invoke-RestMethod -Uri http://127.0.0.1:5000/search -Method Post -Body '{
                    "text": "testing",
                    "top_k": 3,
                    "threshold": 0.8,
                    "user_id": "user123"
                  }' -ContentType "application/json"  

on Linux:

                    curl -X POST http://127.0.0.1:5000/search \
                     -H "Content-Type: application/json" \
                     -d '{
                           "text": "testing",
                           "top_k": 3,
                           "threshold": 0.8,
                           "user_id": "user123"
                         }'  

The results will be displayed as:
image If the user api limit is hit then: image

The program first parses the articles database inorder to check if the query has been called before. If it has it returns the data from the database. If the query is not found then the scraper is run and data is added to articles.db in an effort to optimize.

Database Structure

articles.db

Table: articles
Columns:
    id: INTEGER PRIMARY KEY AUTOINCREMENT
    link: TEXT
Purpose: Stores news articles associated with queries.

image

api_requests.db

Table: api_requests
Columns:
    id: INTEGER PRIMARY KEY AUTOINCREMENT
    user_id: TEXT
    query: TEXT
    results: TEXT
    inference_time: REAL
    timestamp: DATETIME DEFAULT CURRENT_TIMESTAMP
Purpose: Logs API requests including query, results, and inference time.

image

user_calls.db

Table: user_calls
Columns:
    user_id: TEXT PRIMARY KEY
    call_frequency: INTEGER
Purpose: Tracks the frequency of API calls per user.

image

Dockerization

The docker image can be built as: docker build -t example

docker build -t example

To run the docker container:

docker run -p 5000:5000 my-flask-app

Docker image: image

The application can be accessed at http://localhost:5000.

About

Backend for document retrieval which can be used as context for LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages