Skip to content

saiteja-0408/Automated-Extraction-Sorting-Validation-Using-Playwright

Repository files navigation

📌 Project Overview

This project is an automated web scraper that extracts the latest articles from Hacker News using Playwright.
It checks whether the articles are sorted from newest to oldest, saves them in structured files, and provides multiple execution options.

The script:
✅ Extracts article titles, submission IDs, and timestamps.
✅ Validates sorting order using submission IDs.
✅ Handles pagination dynamically.
✅ Exports data to CSV & JSON for easy analysis.
✅ Supports flexible execution options (--headless, --limit).

📌 Project Structure

QA_WOLF_TAKE_HOME │── node_modules/ # Dependencies (not included in GitHub)
│── .gitignore # Ignore unnecessary files
│── hacker_news_articles_.csv # Extracted articles (CSV format)
│── hacker_news_articles_.json # Extracted articles (JSON format)
│── index.js # Main script
│── package.json # Node.js package file
│── package-lock.json # Package dependencies
│── playwright.config.js # Playwright configuration
│── README.md # Project documentation (this file)

📌 Installation Instructions

1️⃣ Prerequisites
     Install Node.js (v14+ recommended)
     Install Playwright (if not installed)
     npm install playwright

2️⃣ Clone the Repository
     git clone https://github.com/YOUR_GITHUB_USERNAME/qa-wolf-take-home.git
     cd qa-wolf-take-home

3️⃣ Install Dependencies
     npm install

📌 How to Run the Script

The script provides multiple execution options for flexibility:

1️⃣ Default Execution (Fetches 100 Articles in Visible Mode)
     node index.js
✔ Runs Playwright with a visible browser.
✔ Fetches 100 articles from Hacker News.
✔ Saves data to CSV & JSON files.

2️⃣ Run in Headless Mode (Faster Execution)
     node index.js --headless
✔ Runs without opening a browser window.
✔ Useful for CI/CD pipelines & faster execution.

3️⃣ Fetch a Custom Number of Articles (e.g., 50)
     node index.js --limit 50
✔ Retrieves 50 articles instead of 100.
✔ Saves output in structured files.

4️⃣ Fetch More Articles in Headless Mode      node index.js --limit 200 --headless
✔ Fetches 200 articles.
✔ Runs without opening a browser for maximum speed.

📌 Expected Output

The extracted articles are saved in CSV & JSON files, with a timestamp to prevent overwriting.

1️⃣ CSV Output (hacker_news_articles_.csv)
S.No,Title,Submission ID,Time
1,"AI Breakthrough in Healthcare",39543218,"2 minutes ago"
2,"New JavaScript Framework Released",39543216,"5 minutes ago"
3,"NASA’s Latest Discovery",39543214,"8 minutes ago"
...
📂 Location: Same directory as the script.

2️⃣ JSON Output (hacker_news_articles_.json)

[
{
"S.No": 1,
"title": "AI Breakthrough in Healthcare",
"submissionId": 39543218,
"time": "2 minutes ago"
},
{
"S.No": 2,
"title": "New JavaScript Framework Released",
"submissionId": 39543216,
"time": "5 minutes ago"
}
]
📂 Location: Same directory as the script.

📌 How the Script Works

Step Description
1 Launches a Chromium browser and navigates to Hacker News "Newest" page.
2 Extracts up to N articles (default: 100).
3 Handles pagination dynamically if more articles are needed.
4 Validates sorting using submission IDs to ensure articles are arranged from newest to oldest.
5 Saves extracted data to CSV & JSON files with dynamically generated filenames.
6 Displays the first 10 extracted articles in the console for verification.
7 Closes the browser session after successful execution.

📌 Key Features & Optimizations

✔ Supports Headless & Visible Mode → Flexible execution options.
✔ Handles Pagination Automatically → Fetches multiple pages if needed.
✔ Parallelized Data Extraction → Uses Playwright’s fast element selection.
✔ Error Handling & Robust Execution → Skips missing data, prevents crashes.
✔ Dynamically Named Output Files → Prevents overwriting, keeps data organized.
✔ Command-Line Customization → Fetch any number of articles with --limit.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published