-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streams support (pt. 2) #177
Conversation
@WebReflection Looks like we have typing issues, maybe we should wait for #168 first before merging this one? (XMLDocument is exported from 'linkedom/types/xml/document'). Edit: Never mind, the correct path was 'linkedom/types/esm/xml/document'. |
dunno mate, I really don't have any of your (folks) problems with LinkeDOOM because I don't care a bit about TS ... but sure ... as that's meant to solve the problem you mentioned, let's wait. |
Fair enough. There is not much to do here anyway, For future reference, I think users will have to explicitly specify the argument type of the 'document' event callback: const domStream = new DOMStream('text/xml', name => name === 'hotelName')
.on('document', async (doc: XMLDocument) => {
//
}) |
code coverage is not happy ... the goal here is 100% or nothing so far, it'd be great to keep it as such, thanks. |
@WebReflection I wrote more tests to cover the new feature. Is it good enough? |
@iMrDJAi you can |
@iMrDJAi coverage down to 92% from 100% https://coveralls.io/builds/53677686 ... if there's some obvious case you think should not be covered you can use special comments around but as this is 100% new code I'd like to see it covered. |
@WebReflection I don't have to write more tests, I'll do what you've suggested earlier. |
will try to have a closer look locally soon, thanks for the refactoring though! |
No problem! |
a possible stopper #180 this module works in JS Workers and if the stream implementation in third part libraries use something not Web compatible, I am not sure (actually I am) this should move forward ... it would rather be a dedicated parser usable only in NodeJS envs ... thoughts? |
the stream uses NodeJS specific modules, including emitter, so that pushing this change will break everything out there based on CouldFlare Workers or just Web JS Workers ... this is not going to happen, if my understanding is correct ... feel free to fork and make this module a NodeJS only project, as I have no interests in doing that, because there are already consumers using this module not as a crawler and that's the promise behind: this module is not a crawler and never want to be one. |
beside apologizing for my last comment, this PR made me realize I need to improve testing via puppeteer or similar and be sure the module works in browsers, not just in servers. I really did appreciate all the effort around this PR, it's just sneaky for the whole use case this module server (not on purpose, but I'm glad I didn't just merge it) |
@WebReflection That's actually a concern I had at the beginning (#174):
But there is definitely a solution for that. How about env check? polyfills? We may always include polyfills for node And regarding CouldFlare Workers, it would work even without including polyfills: https://developers.cloudflare.com/workers/wrangler/configuration/#node-compatibility. I'll try with it, it's all about installing a rollup plugin at the end. |
It needs to be tested |
@iMrDJAi have you tried using WebStream Api to achieve those 🔥 awesome 🔥 benchmarks It is supported on all platforms: |
@chr1se Is WebStream API considered a stable feature? Also, |
This module works in WebWorkers too … worker.js is meant to provide SSR for Service Workers PWA … polyfills are ok only if tiny and battle tested, otherwise are a burden to maintain I’ve no capacity… is a fork targeting NodeJS only with easy upstream sync really off the table? I don’t want to ruin this project scope which is, once again, not crawling. |
@WebReflection Not sure why do you assume that I'm interested in crawling, this is by no means my intention, just because the person who opened the first issue (#98 (comment)) was trying to crawl doesn't mean that everyone else are doing the same. What I'm doing here is trying to achieve what
Sure it provided a huge improvement, but still, it doesn't scale well. ¯\_(ツ)_/¯ I'm aware that you're a busy man, and I respect that, however It's unlikely that anything I've added would force you to take an action in the future or cause complexity in any way, and as for polyfills, the build step is automatic, not sure what do you mean by "burden to maintain".
No, that would be an overkill, most of the files are already classes, and classes are extendable? |
like I've said, if the worker.js file works in browsers out of the box and the resulting size is not increased by too many bytes, I'd be OK but it's a priority of this project to work everywhere and not inly in NodeJS ... bun, deno, everywhere like it does right now already. the polyfill part is in case they patch some breaking change and I get the blame, something unfortunately frequent in the npm world. P.S. not a crawler means it's not the goal of this project to consume the internet at speedlight ... its parser is meant for templates, the fact it works and doesn't throw or bail-out on bigger templates, is indeed part of the goal. The heavy benchmarked websites and pages are there to demo this module won't crash ... parsing huge MB of sites is not the main goal. If this is not clear on the README I can change the readme and make it more clear. |
@WebReflection Hello, sorry for the delay. Since polyfills add around 200 kb to the 300 kb |
What are the numbers once minified and gzipped? But yeah, 3/4th of the current size for polyfills doesn’t look too compelling to me … but what if I publish a branch myself with the streaming option as linkedom-stream so that both maintaining it and synchronizing it would be under this repo? I’d be ok with that, or even a dedicated linkedom/stream export so no branch dance is needed, and most code gets shared by default |
Actually, that's a good idea! A mono repo with multiple npm packages would work. Would you like to proceed? |
Closed since I got no confirmation from maintainer to proceed. |
Introducing
DOMStream
: an efficient way for parsing very large HTML/XML documents!Usage:
Using
DOMParser
: Time: ~11mins, Memory: ~3GB (I gotheap out of memory
error but I was able to resume from JavaScript debug console).Using
DOMStream
🔥: Time: ~27s, Memory: ~13MB.