Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chapter 20, Challenge 1 (Web Scraper) #8

Open
jonathan-j-stone opened this issue Jun 6, 2017 · 1 comment
Open

Chapter 20, Challenge 1 (Web Scraper) #8

jonathan-j-stone opened this issue Jun 6, 2017 · 1 comment

Comments

@jonathan-j-stone
Copy link

jonathan-j-stone commented Jun 6, 2017

The linked solution in the book, [(http://tinyurl.com/gkv6fuh)], when run, returns url's just as the practice example did. The same url's, in fact.

I thought it was supposed to return Headlines.

There's also so much new material in this chapter that goes unexplained that I didn't feel I had any hope of solving the challenge. Even if the solution worked, I don't understand the code.

@totodo713
Copy link

Me too!
I think google news site front-end change html format to js format.
My URL-collecting here.
(So sorry, if I get a miss.)

import urllib.request
from bs4 import BeautifulSoup


class Scraper:
    def __init__(self, site):
        self.site = site

    def scrape(self):
        r = urllib.request.urlopen(self.site)
        html = r.read()
        parser = "html.parser"
        sp = BeautifulSoup(html, parser)
        articles = set()
        for tag in sp.find_all("a"):
            article = tag.get("href")
            if article is None:
                continue
            if "articles" in article:
                articles.add(article) if article not in articles else None

        urls = set()
        for i, article in enumerate(articles):

            r = urllib.request.urlopen(self.site + article[2:])
            html = r.read()
            parser = "html.parser"
            sp = BeautifulSoup(html, parser)
            title = sp.find('title').text.replace("Google News - ", "")

            if len(title) > 0:
                for tag in sp.find_all("a"):
                    url = tag.get("href")
                    if url is None:
                        continue
                    if "html" in url:
                        urls.add(url)

        with open(f"./out/news.txt", "w", encoding="utf8") as f:
            for url in urls:
                f.write(f"{url}\n")


if __name__ == '__main__':
    news = "https://news.google.com/"
    Scraper(news).scrape()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants