Skip to content

Adds a headless browser scraper for dynamic navigation sites. #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mdurepos
Copy link

Sites with links dynamically generated by JavaScript don't get scraped by the normal scraper. This new method is a bit more resource intensive as it uses a headless browser, but the headless browser option allows us to scrape more dynamic websites.

This was tested manually on the docs.rust-lang.org/book site, which I couldn't get to convert previously. Automated tests were added both to verify the new functionality and to make sure that the original scraper was indeed failing with JS-rendered links.

Sites with links dynamically generated by JavaScript don't
get scraped by the normal scraper. This new method is a bit more
resource intensive as it uses a headless browser, but the headless
browser option allows us to scrape more dynamic websites.

This was tested manually on the docs.rust-lang.org/book site, which
I couldn't get to convert previously. Automated tests were added
both to verify the new functionality and to make sure that the
original scraper was indeed failing with JS-rendered links.
@lapwat
Copy link
Owner

lapwat commented Apr 1, 2025

Hello thank you for the PR, this is a great feature that will improve papeer. All tests pass great!

There are several locations where we trigger an HTML request on visited websites. All should be aware and handle the useHeadless config property.

  1. When we get a webpage body

response, err := http.Get(url)

  1. When we get the links present on the page, during the get command

links, _, _, err := GetLinks(base, config.Selector, config.Limit, config.Offset, config.Reverse, false)

  1. When we get the links present on the page, during the list command

links, path, home, err := book.GetLinks(base, listOpts.Selector[0], listOpts.limit, listOpts.offset, listOpts.reverse, listOpts.include)

2 and 3 are using GetLinks and the HTML request is performed here:

c.Visit(url.String())

  1. This reference of GetLinks in here can be ignored since it is in tableOfContent() function which is not used anymore.

links, _, home, err := GetLinks(base, config.Selector, config.Limit, config.Offset, config.Reverse, config.Include)

Solutions

We do not want to re-implement the logic of listing the chapters, filtering, ordering, limit, offset, etc...

We want the headless browser to only get the HTML content of the page. All other logic should be unchanged.

GetLinksWithHeadlessBrowser() should be something more like GetHTMLWithHeadlessBrowser(url) (htmlString string || reader io.Reader),

The same way, we may need another function which does the same but without the headless browser: GetHTML(url) (htmlString string || reader io.Reader).

Now that we have those 2 functions, we have 2 solutions:

  • test config.useHeadless / useHeadless every time we want to perform a GetHTML request
  • or set config.getLinkFunction in the get / list commands and pass this function as a parameter to other functions, instead of passing the useHeadless bool

The tricky part is that the go-colly package does not allow us to "Visit()" an HTML string. We may need to use another package that allows HTML Document visiting like go-query.

I hope this makes sense.

@mdurepos
Copy link
Author

mdurepos commented Apr 2, 2025

Hi,

Yes it makes sense, though I'm not sure if you are hoping for me to implement those additional changes or if these were more notes for yourself for future development. For my part the intent was mostly to commit back the code after modifying it to suit my own purposes. Hopefully it can either be used as-is or used to improve the software.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants