Adds a headless browser scraper for dynamic navigation sites. #26

mdurepos · 2025-03-21T00:39:05Z

Sites with links dynamically generated by JavaScript don't get scraped by the normal scraper. This new method is a bit more resource intensive as it uses a headless browser, but the headless browser option allows us to scrape more dynamic websites.

This was tested manually on the docs.rust-lang.org/book site, which I couldn't get to convert previously. Automated tests were added both to verify the new functionality and to make sure that the original scraper was indeed failing with JS-rendered links.

Sites with links dynamically generated by JavaScript don't get scraped by the normal scraper. This new method is a bit more resource intensive as it uses a headless browser, but the headless browser option allows us to scrape more dynamic websites. This was tested manually on the docs.rust-lang.org/book site, which I couldn't get to convert previously. Automated tests were added both to verify the new functionality and to make sure that the original scraper was indeed failing with JS-rendered links.

lapwat · 2025-04-01T11:01:47Z

Hello thank you for the PR, this is a great feature that will improve papeer. All tests pass great!

There are several locations where we trigger an HTML request on visited websites. All should be aware and handle the useHeadless config property.

When we get a webpage body

papeer/book/scraper.go

Line 114 in 99b7d16

response, err := http.Get(url)

When we get the links present on the page, during the get command

papeer/book/scraper.go

Line 145 in 99b7d16

    
           links, _, _, err := GetLinks(base, config.Selector, config.Limit, config.Offset, config.Reverse, false)

When we get the links present on the page, during the list command

papeer/cmd/list.go

Line 86 in 99b7d16

    
           links, path, home, err := book.GetLinks(base, listOpts.Selector[0], listOpts.limit, listOpts.offset, listOpts.reverse, listOpts.include)

2 and 3 are using GetLinks and the HTML request is performed here:

papeer/book/scraper.go

Line 436 in 99b7d16

c.Visit(url.String())

This reference of GetLinks in here can be ignored since it is in tableOfContent() function which is not used anymore.

papeer/book/scraper.go

Line 268 in 99b7d16

    
           links, _, home, err := GetLinks(base, config.Selector, config.Limit, config.Offset, config.Reverse, config.Include)

Solutions

We do not want to re-implement the logic of listing the chapters, filtering, ordering, limit, offset, etc...

We want the headless browser to only get the HTML content of the page. All other logic should be unchanged.

GetLinksWithHeadlessBrowser() should be something more like GetHTMLWithHeadlessBrowser(url) (htmlString string || reader io.Reader),

The same way, we may need another function which does the same but without the headless browser: GetHTML(url) (htmlString string || reader io.Reader).

Now that we have those 2 functions, we have 2 solutions:

test config.useHeadless / useHeadless every time we want to perform a GetHTML request
or set config.getLinkFunction in the get / list commands and pass this function as a parameter to other functions, instead of passing the useHeadless bool

The tricky part is that the go-colly package does not allow us to "Visit()" an HTML string. We may need to use another package that allows HTML Document visiting like go-query.

I hope this makes sense.

mdurepos · 2025-04-02T18:46:38Z

Hi,

Yes it makes sense, though I'm not sure if you are hoping for me to implement those additional changes or if these were more notes for yourself for future development. For my part the intent was mostly to commit back the code after modifying it to suit my own purposes. Hopefully it can either be used as-is or used to improve the software.

Cheers!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds a headless browser scraper for dynamic navigation sites. #26

Adds a headless browser scraper for dynamic navigation sites. #26

mdurepos commented Mar 21, 2025

lapwat commented Apr 1, 2025 •

edited

Loading

mdurepos commented Apr 2, 2025

Adds a headless browser scraper for dynamic navigation sites. #26

Are you sure you want to change the base?

Adds a headless browser scraper for dynamic navigation sites. #26

Conversation

mdurepos commented Mar 21, 2025

lapwat commented Apr 1, 2025 • edited Loading

Solutions

mdurepos commented Apr 2, 2025

lapwat commented Apr 1, 2025 •

edited

Loading