-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Puppeteering and integration with ArchiveBox #2
Comments
Hi! Thanks for the compliments! Reading your post I have a feeling you have a slightly different use case in mind compared to what I'm trying to do here. It seems to me, you want this:
Which is a valid use case, and I'm willing to help support it where it does not hurt pwebarc (the project), but it is not what this project is designed to do. Creation of self-sufficient website archives that can be shared with others is a very low-priority for pwebarc, which mainly exists to
So, as noted in the the FAQ, I feel like adding anti-bot-blocking and anti-browser-automation-detection to pWebArc itself would not help archiving my goals here. So, in your use case, to make integration with a smart archiving server (like ArchiveBox) simpler it could be useful to add a REST call the dumping protocol so that pWebArc could notify the archiving server when the whole page finished loading. Alternatively, pWebArc will get in-browser's local storage persistence in the next version (implementing it turned out to be much more annoying than I expected, even with all my pre-planning in the sources) and navigation tracking will be coming after that. Also, I'm toying with the idea of pWebArc storing its config on the archiving server, which could also help a bit. Anyway, so, if I were you and I wanted to use pWebArc from the near future with ArchiveBox within my interpretation of your use case, I would put those 10K LOC+ of autopilot and anti-bot-blocking code into a separate extension (or a set of UserScripts, or some such) and then add two public in-browser
So, the integration with ArchiveBox then becomes this:
Ta-da! Thoughts? As to DWeb Camp: it's on the other side of the planet, it will also require a US visa, and going through TSA, and then spending days in a forest, speaking to random people instead of doing productive stuff. No, thanks. :] As to chatting: I consider synchronous communications like IM or voice/video calls to be poison to productive deep work, so I'm trying really hard to minimize the amount of those too. But I don't see why we can't discuss this topic publicly here. |
Oh I wasn't intending to propose integrating with ArchiveBox at all! I just meant I have some code I can share that you might find useful for your own project (the in-browser page saving logic to create WARCs with more than just HTML+images, e.g handling iframes, shadow DOMs, webrtc, etc). And no worries if DWEB is not your thing, just figured Id share it in case you were nearby. Here is fine for discussion too, no need to call. |
(the in-browser page saving logic to create WARCs with more than just HTML+images, e.g handling iframes, shadow DOMs, webrtc, etc)
Point by point:
- WARCs: pWebArc does not generate WARCs by design since WARCs are pretty restrictive in what they can represent (WARC generation might become an optional feature later, but I don't see how pWebArc could completely replace its own WRR file format with WARC, how would you capture search engine queries done via HTTP POST otherwise?);
- more than just HTML+images: pWebArc captures everything your browser fetches via webRequest and Chromium debugger APIs, and I mean everything;
- iframes: are handled transparently via the above APIs, and also in DOM snapshots;
- shadow DOMs: `snapshotTab` action of pWebArc takes non-shadow DOM snapshots, by design;
- webrtc: web video fetching is usually done via HTTP, which pWebArc captures, but it could be useful if there's a simple way to capture it, I suppose.
For me, at the moment, the most obvious things pWebArc (as it is in my local git repo) lacks is per-host profile inheritance and WebSockets capture.
But, for WebSockets Firefox provides no APIs, so I can't do anything there without patching it first, which is a very low-priority ATM.
Also, why should pWebArc care about shadow DOM?
I just meant I have some code I can share that you might find useful for your own project
In general, yes, I would at the very least read most of (and would not be opposed to borrowing some of) your code if you would publish those snippets under a license compatible with GPLv3+.
After all, I did read most of the source of archiveweb.page while making this (and then did almost everything almost completely differently, but, oh, well).
If I were to judge just from looking at your screenshot, I don't expect I would borrow much, since I don't think pWebArc should include most of the things I can see there, but every bit helps, I suppose.
|
Hi! I'm the ArchiveBox maintainer and I just found your project.
It looks pretty sweet, I've been dreaming about in-browser archiving for a while now and actually implemented my own puppeteer/CDP extension to do something very similar to yours. (it records live pages from within the browser extension context and saves into archivebox)
I have a ton of asset-extraction and browser-automation-detection-avoidance snippets (10k LOC+) to share if you're interested, maybe it could save you a lot of time with your work.
ArchiveBox's core is still focused on saving on a separate machine, but I'm happy to share my side-project work on in-browser archiving with other projects so it doesn't go to waste.
Would love to have a call/chat sometime if you're interested:
https://calendly.com/nicksweeting/choose-a-time or https://sweeting.me/#contact (click for email addr)
Also you should go to DWeb camp (https://dwebcamp.org/), it's the best archiving conference imo and it's not marketed very heavily but lots of great people attend including the Webrecorder team and Archive.org
The text was updated successfully, but these errors were encountered: