A powerful web scraping tool for collecting news articles from Google News about specific topics over a defined time period.
This project is designed to help researchers, journalists, and developers gather news data efficiently while respecting web scraping best practices. For educational purposes only.
- Scrape Google News search results for specified search terms
- Search across multiple date ranges automatically
- Stealth browser automation to avoid detection
- Extract article content, titles, dates, and sources
- Deduplication using content fingerprinting
- Store results in SQLite database
- Export data to Excel spreadsheet
- User agent rotation to reduce blocking
- Clone the repository
git clone https://github.com/gauravfs-14/gnews-harvester.git
cd gnews-harvester
- Install dependencies
npm install
- Node.js (v14 or higher recommended) | Install Node.js from nodejs.org
- NPM package manager
Edit the configuration in config.js
to customize your search:
// Configuration settings
const SEARCH_TERMS = ["climate change", "AI in education"]; // Topics to search for
const YEARS_TO_SEARCH = 2; // How many years back to search
const PAGES_PER_TERM = 5; // Pages per search term per date range
const OUTPUT_FILE = "news_output.xlsx"; // Excel output filename
const DB_FILE = "news_harvester.db"; // SQLite database filename
const DELAY_BETWEEN_REQUESTS = 2000; // Delay between requests (ms)
Run the application with:
npm start
The program will:
- Generate monthly date ranges based on the
yearsToSearch
setting - For each search term, iterate through all date ranges
- Scrape Google News results page by page
- Extract and store article data in SQLite
- Export the collected data to an Excel file
The harvester produces two outputs:
- SQLite Database (
data/news_harvester.db
): Contains all scraped articles with metadata - Excel Spreadsheet (
data/news_output.xlsx
): Formatted report with the following columns:- News Media Name
- Date
- Title of the News
- Descriptive Text
- URL
- The tool uses Puppeteer with a stealth plugin to navigate Google News
- It searches for each term within specific monthly date ranges
- For each search result, it extracts the article URL
- It then visits each URL and extracts content using Cheerio
- Articles are deduplicated using content fingerprinting
- Results are stored in SQLite and exported to Excel
- CAPTCHA Issues: If you see CAPTCHA warnings in the console, the script will pause briefly. You may need to reduce the scraping frequency or use proxies.
- No Links Found: This may indicate that Google has changed its DOM structure. Check for updates.
- Scraping Errors: Individual article scraping errors are logged but won't stop the overall process.
This tool is for educational and research purposes only. When scraping websites:
- Respect robots.txt files
- Implement reasonable rate limiting
- Review and comply with terms of service for target websites
- Be aware that web scraping may be subject to legal restrictions in some jurisdictions
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
gnews-harvester/
├── config.js # Configuration settings
├── data/ # Output directory for database and Excel files
├── scripts/
│ ├── index.js # Main application entry point
│ └── utils/ # Utility functions
│ ├── getRandomUA.js
│ ├── generateFingerprint.js
│ ├── generateMonthlyRanges.js
│ └── setupSQLite.js
├── package.json # Project dependencies
└── README.md # This documentation
Have questions? Reach out to the maintainer:
- GitHub: @gauravfs-14
- Twitter: @gaurav_fs_14