Skip to content

Latest commit

 

History

History
182 lines (135 loc) · 9.2 KB

README.md

File metadata and controls

182 lines (135 loc) · 9.2 KB

📰 Google News Scraper

A lightweight package that scrapes article data from Google News. Simply pass a keyword or phrase, and the results are returned as an array of JSON objects.

"Buy Me A Coffee"

Google News Scraper

Installation 🔌

# Install via NPM
npm install google-news-scraper
# Install via Yarn
yarn add google-news-scraper

Usage 🕹️

Simply import the package and pass a config object.

import googleNewsScraper from 'google-news-scraper';
const articles = await googleNewsScraper({ searchTerm: "The Oscars" });

A minimum working example can be found in this repo. Full documentation on the config object can be found below.

Output 📲

The output is an array of JSON objects, with each article following the structure below:

[
    {
        "title":  "Article title",
        "link":  "http://url-to-website.com/path/to/article",
        "image":"http://url-to-website.com/path/to/image.jpg",
        "source":  "Name of publication",
        "datetime": 2024-05-13T08:02:22.000Z,
        "time":  "Time/date published (human-readable)", 
        "articleType": "String, one of ['regular' | 'topicFeatured' | 'topicSmall']"
    }
]

Config ⚙️

The config object passed to the function above has the following properties:

searchTerm

This is the search query you'd like to find articles for, simply pass the search string like so: searchTerm: "The Oscars".

The search term is no longer a required field, as hahagu added support for topic pages in #44. If searchTerm and baseUrl are both supplied, the scraper will just return results from the Google News homepage.

baseUrl

The baseUrl property enables you to specify an alternate base URL for your search. This is useful when you want to scrape, for example, a specific Google news topic.

PLEASE NOTE: Using both a baseUrl that points to a topic AND a searchTerm is not advised, as the searchTerm will typically be ignored in favour of the topic in the baseUrl.

In the scraped URL, your baseUrl will be immediately followed by query parameters (eg: ?hl=en-US&gl=US&ceid=US), so it doesn't matter whether your baseUrl has a trailing slash or not.

Defaults to https://news.google.com/search

prettyURLs

The URLs that Google News supplies for each article are "ugly" links (eg: "https://news.google.com/articles/CAIiEPgfWP_e7PfrSwLwvWeb5msqFwgEKg8IACoHCAowjuuKAzCWrzwwt4QY?hl=en-GB&gl=GB&ceid=GB%3Aen"), buy default the scraper will retrieve the actual "pretty" URL (eg: "https://www.nytimes.com/2020/01/22/movies/expanded-best-picture-oscar.html"). This is done using some base64 decoding, so the overhead is negligible. To prevent this default behaviour and retrieve the "ugly" links instead, pass prettyURLs: false to the config object.

Credit to anthonyfranc for the base64 decode fix 🙏

Defaults to true.

timeframe

The results can be filtered to articles published within a given timeframe prior to the request. The format of the timeframe is a string comprised of a number, followed by a letter prepresenting the time operator. For example 1y would signify 1 year. Full list of operators below:

  • h = hours (eg: 12h)
  • d = days (eg: 7d)
  • m = months (eg: 6m)
  • y = years (eg: 1y)

Defaults to 7d.

getArticleContent

By default, the scraper does not return the article content, as this would require Puppeteer to navigate to each individual article in the results (increasing execution time significantly). If you would like to enable this behaviour, and receive the content of each article, simply pass getArticleContent: true, in the config. This will add two fields to each article in the output: content and favicon.

[
    {
        "title":  "Article title",
        "link":  "https://url-to-website.com/path/to/article",
        "image":"https://url-to-website.com/path/to/image.jpg",
        "source":  "Name of publication",
        "time":  "Time/date published (human-readable)", 
        "content": "The full text content of the article...", 
        "favicon": "https://url-to-website.com/path/to/favicon.png",
    }
]

PLEASE NOTE: Due to the large amount of variable factors to take into account, this feature fails on many websites. All errors are handled gracefully and wil return an empty string as the content. Please ensure you handle such outcomes in your application.

Defaults to false

logLevel

You can customise the log level to any of the following:

  • none: No logs will be output at all.
  • error: Only errors will be outputted to the log.
  • warn: Errors and warnings will be output to the log.
  • info: Info, errors and warnings will be output to the log.
  • verbose: All of the above and potentially more. Currently there are no specifically verbose logs, but in future I may move some of the info logs to verbose and/or add some debugging info there.

Defaults to error.

queryVars

An object of additional query params to add to the Google News URL string, formatted as key value pairs. This can be useful if you want to search for articles in a specific language, for example:

const articles = await googleNewsScraper({
    searchTerm: "Últimas noticias en Madrid",
    queryVars: {
        gl:"ES",
        ceid:"ES:es"
    },
});

Defaults to null

puppeteerArgs

An array of Chromium flags to pass to the browser instance. By default, this will be an empty array. A full list of available flags can be found here. NB: if you are launching this in a Heroku app, you will need to pass the --no-sandbox and --disable-setuid-sandbox flags, as explained in this SO answer.

Defaults to []

puppeteerHeadlessMode

Whether or not Puppeteer should run in headless mode. Running in headless mode increases performance by approximately 30% (credit to ole-ve for finding this). If you're not sure about this setting, leave it as it is.

Defaults to true

limit

The total number of articles that you would like to be returned. Please note that with higher numbers, the actual returned number may be lower. Typically the max is 99, but it varies depending on many variables in Puppeteer (such as rate limiting, network conditions etc.).

Defaults to 99

TypeScript 💙

Google News Scraper includes full TypeScript definitions.

Your IDE should pick the types up automatically, but if not you can find them in the dist/tsc/ folder.

Common JS 👴🏻

Google News Scraper is built to work as an ESM module out of the box, but also works as a Common JS module too, just use require instead of import:

const googleNewsScraper = require('google-news-scraper');

const articles = await googleNewsScraper({ searchTerm: "The Oscars" });

Performance 📈

My test query returned 94 results, which took 4.5 seconds with article content and 3.6 seconds without it. I'm on a fibre connection, and other queries may return a different number of results, so your mileage may vary.

Upkeep 🧹

Please note that this is a web-scraper, which relies on DOM selectors, so any fundamental changes in the markup on the Google News site will probably break this tool. I'll try my best to keep it up-to-date, but changes to the markup on Google News will be silent and therefore difficult to keep track of. Feel free to submit an issue if the tool stops working.

Bugs 🐞

Due to the size of Chromium, this package is too large to run on Vercel free tier. For more information please refer to this issue.

Please report bugs via the issue tracker.

Contribute 🤝

Feel free to submit a PR if you've fixed an open issue. Thank you.

Python version 🐍

If you're looking for a Python version, there's one here. Please note, the Python version is a fork and is maintained separately. If you have any issues with the Python version, please open an issue on that repo instead here.