Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams support (pt. 2) #177

Closed
wants to merge 14 commits into from
Closed

Conversation

iMrDJAi
Copy link

@iMrDJAi iMrDJAi commented Oct 24, 2022

Introducing DOMStream: an efficient way for parsing very large HTML/XML documents!

Usage:

import {DOMStream} from 'linkedom'

const src = fs.createReadStream(join(__dirname, './big.xml')) // ~80 MB!
const domStream = new DOMStream('text/xml', (name, _attributes) => {
  // From htmlparser2's `onopentag`
  return name === 'hotelName'
})

src.pipe(domStream).on('document', doc => {
  // Magic!
  console.log(doc.documentElement.outerHTML)
})

Using DOMParser: Time: ~11mins, Memory: ~3GB (I got heap out of memory error but I was able to resume from JavaScript debug console).

image

Using DOMStream 🔥: Time: ~27s, Memory: ~13MB.

image

@iMrDJAi iMrDJAi mentioned this pull request Oct 24, 2022
@iMrDJAi
Copy link
Author

iMrDJAi commented Oct 24, 2022

@WebReflection Looks like we have typing issues, maybe we should wait for #168 first before merging this one? (XMLDocument is exported from 'linkedom/types/xml/document').

image

Edit: Never mind, the correct path was 'linkedom/types/esm/xml/document'.

@WebReflection
Copy link
Owner

dunno mate, I really don't have any of your (folks) problems with LinkeDOOM because I don't care a bit about TS ... but sure ... as that's meant to solve the problem you mentioned, let's wait.

@iMrDJAi
Copy link
Author

iMrDJAi commented Oct 24, 2022

dunno mate, I really don't have any of your (folks) problems with LinkeDOOM because I don't care a bit about TS ... but sure ... as that's meant to solve the problem you mentioned, let's wait.

Fair enough. There is not much to do here anyway, tsc doesn't support @property yet (microsoft/TypeScript#28730).

For future reference, I think users will have to explicitly specify the argument type of the 'document' event callback:

 const domStream = new DOMStream('text/xml', name => name === 'hotelName')
   .on('document', async (doc: XMLDocument) => {
     //
   })

esm/dom/stream.js Outdated Show resolved Hide resolved
@WebReflection
Copy link
Owner

code coverage is not happy ... the goal here is 100% or nothing so far, it'd be great to keep it as such, thanks.

@iMrDJAi
Copy link
Author

iMrDJAi commented Oct 27, 2022

code coverage is not happy ... the goal here is 100% or nothing so far, it'd be great to keep it as such, thanks.

@WebReflection I wrote more tests to cover the new feature. Is it good enough?

@WebReflection
Copy link
Owner

@iMrDJAi you can npm test locally and see the table with the code coverage score ... if it's not 100% all rows and fields, is not good enough, if it is, great 🥳

@WebReflection
Copy link
Owner

@iMrDJAi coverage down to 92% from 100% https://coveralls.io/builds/53677686 ... if there's some obvious case you think should not be covered you can use special comments around but as this is 100% new code I'd like to see it covered.

README.md Outdated Show resolved Hide resolved
@iMrDJAi
Copy link
Author

iMrDJAi commented Oct 28, 2022

P.S. it is very possible that sharing code will also help with code coverage

@WebReflection I don't have to write more tests, I'll do what you've suggested earlier.

@WebReflection
Copy link
Owner

will try to have a closer look locally soon, thanks for the refactoring though!

@iMrDJAi
Copy link
Author

iMrDJAi commented Oct 29, 2022

will try to have a closer look locally soon, thanks for the refactoring though!

No problem!

@WebReflection
Copy link
Owner

WebReflection commented Nov 4, 2022

a possible stopper #180

this module works in JS Workers and if the stream implementation in third part libraries use something not Web compatible, I am not sure (actually I am) this should move forward ... it would rather be a dedicated parser usable only in NodeJS envs ... thoughts?

@WebReflection
Copy link
Owner

WebReflection commented Nov 4, 2022

the stream uses NodeJS specific modules, including emitter, so that pushing this change will break everything out there based on CouldFlare Workers or just Web JS Workers ... this is not going to happen, if my understanding is correct ... feel free to fork and make this module a NodeJS only project, as I have no interests in doing that, because there are already consumers using this module not as a crawler and that's the promise behind: this module is not a crawler and never want to be one.

@WebReflection
Copy link
Owner

WebReflection commented Nov 4, 2022

beside apologizing for my last comment, this PR made me realize I need to improve testing via puppeteer or similar and be sure the module works in browsers, not just in servers. I really did appreciate all the effort around this PR, it's just sneaky for the whole use case this module server (not on purpose, but I'm glad I didn't just merge it)

@iMrDJAi
Copy link
Author

iMrDJAi commented Nov 5, 2022

@WebReflection That's actually a concern I had at the beginning (#174):

Worried about Node.js streams support for cross environments (browser/deno), including types.

But there is definitely a solution for that. How about env check? polyfills? We may always include polyfills for node stream (streams are instances of events) and string_decoder.

And regarding CouldFlare Workers, it would work even without including polyfills: https://developers.cloudflare.com/workers/wrangler/configuration/#node-compatibility.

I'll try with it, it's all about installing a rollup plugin at the end.

@iMrDJAi
Copy link
Author

iMrDJAi commented Nov 5, 2022

It needs to be tested

@chr1se
Copy link

chr1se commented Nov 5, 2022

polyfills? We may always include polyfills for node

@iMrDJAi have you tried using WebStream Api to achieve those 🔥 awesome 🔥 benchmarks

It is supported on all platforms:
node
deno-deploy
cloudflare
browser/webworkers

@iMrDJAi
Copy link
Author

iMrDJAi commented Nov 5, 2022

@chr1se Is WebStream API considered a stable feature? Also, htmlparser2 depends on NodeJS streams, so a new adapter is needed? Or we may just use both to achieve maximum compatibility (we can do a runtime environment check)?

@WebReflection
Copy link
Owner

This module works in WebWorkers too … worker.js is meant to provide SSR for Service Workers PWA … polyfills are ok only if tiny and battle tested, otherwise are a burden to maintain I’ve no capacity… is a fork targeting NodeJS only with easy upstream sync really off the table? I don’t want to ruin this project scope which is, once again, not crawling.
Maybe a plugin system to bring your own parser could be a better solution?

@iMrDJAi
Copy link
Author

iMrDJAi commented Nov 5, 2022

I don’t want to ruin this project scope which is, once again, not crawling.

@WebReflection Not sure why do you assume that I'm interested in crawling, this is by no means my intention, just because the person who opened the first issue (#98 (comment)) was trying to crawl doesn't mean that everyone else are doing the same. What I'm doing here is trying to achieve what linkedom has claimed to do:

This structure also allows programs to avoid issues such as "Maximum call stack size exceeded" (basicHTML), or "JavaScript heap out of memory" crashes (JSDOM), thanks to its reduced usage of memory and zero stacks involved, hence scaling better from small to very big documents.

Sure it provided a huge improvement, but still, it doesn't scale well. ¯\_(ツ)_/¯
The point is, this is under the scope of the library, according to README.

I'm aware that you're a busy man, and I respect that, however It's unlikely that anything I've added would force you to take an action in the future or cause complexity in any way, and as for polyfills, the build step is automatic, not sure what do you mean by "burden to maintain".

Maybe a plugin system to bring your own parser could be a better solution?

No, that would be an overkill, most of the files are already classes, and classes are extendable?

@WebReflection
Copy link
Owner

WebReflection commented Nov 6, 2022

like I've said, if the worker.js file works in browsers out of the box and the resulting size is not increased by too many bytes, I'd be OK but it's a priority of this project to work everywhere and not inly in NodeJS ... bun, deno, everywhere like it does right now already. the polyfill part is in case they patch some breaking change and I get the blame, something unfortunately frequent in the npm world.

P.S. not a crawler means it's not the goal of this project to consume the internet at speedlight ... its parser is meant for templates, the fact it works and doesn't throw or bail-out on bigger templates, is indeed part of the goal. The heavy benchmarked websites and pages are there to demo this module won't crash ... parsing huge MB of sites is not the main goal. If this is not clear on the README I can change the readme and make it more clear.

@iMrDJAi
Copy link
Author

iMrDJAi commented Nov 10, 2022

@WebReflection Hello, sorry for the delay. Since polyfills add around 200 kb to the 300 kb worker.js (can be less if minified), are you okay with implementing Web Streams API instead? Otherwise, I'm okay with not merging and keeping my changes in my fork (unless you become interested or this feature gains more demand in the future, then I'll be happy to reopen the pull request again).

@WebReflection
Copy link
Owner

WebReflection commented Nov 10, 2022

What are the numbers once minified and gzipped? But yeah, 3/4th of the current size for polyfills doesn’t look too compelling to me … but what if I publish a branch myself with the streaming option as linkedom-stream so that both maintaining it and synchronizing it would be under this repo? I’d be ok with that, or even a dedicated linkedom/stream export so no branch dance is needed, and most code gets shared by default

@iMrDJAi
Copy link
Author

iMrDJAi commented Nov 17, 2022

What are the numbers once minified and gzipped? But yeah, 3/4th of the current size for polyfills doesn’t look too compelling to me … but what if I publish a branch myself with the streaming option as linkedom-stream so that both maintaining it and synchronizing it would be under this repo? I’d be ok with that, or even a dedicated linkedom/stream export so no branch dance is needed, and most code gets shared by default

Actually, that's a good idea! A mono repo with multiple npm packages would work. Would you like to proceed?

@iMrDJAi
Copy link
Author

iMrDJAi commented May 21, 2023

Closed since I got no confirmation from maintainer to proceed.

@iMrDJAi iMrDJAi closed this May 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants