Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing fails and there is raw html code in rendered html #200

Open
Appress opened this issue Mar 20, 2023 · 1 comment
Open

Parsing fails and there is raw html code in rendered html #200

Appress opened this issue Mar 20, 2023 · 1 comment
Labels
3rd Party Issue from external dependencies

Comments

@Appress
Copy link

Appress commented Mar 20, 2023

Hi there,
Parsing fails for some pages ( eg. this article )

To replicate, open the generated html in a browser

    const document = (new DOMParser).parseFromString(htmlFromTheArticle, 'text/html');
    const html = document.body.innerHTML;

Instead of the original page, it now includes raw html code.

Διαβάστε το πλήρες κείμενο του σημειώματος του CEO της UBS στο&nbsp;<a href="https://www.newmoney.gr/roh/bloomberg/to-esoteriko-simioma-tou-ceo-tis-ubs-pros-tous-ergazomenous-meta-tin-exagora-tis-credit-suisse/" target="_blank" rel="noopener noreferrer">newmoney.gr</a><br> <br> <strong><a href="https://www.protothema.gr/oles-oi-eidiseis/" target="_blank" rel="noopener noreferrer">Ειδήσεις σήμερα:</a><br> <br> <a href="https://www.protothema.gr/greece/article/1351532/xanthi-ston-eisaggelea-simera-o-36hronos-pou-skotose-ton-45hrono-epeidi-ton-theorise-roufiano/" target="_blank" rel="noopener noreferrer">...

It happened for many html documents already. The culprit is htmlparser2, if I downgrade to v6.1.0, it works properly.

I tried to debug and the problem is caused in Tokenizer.ts. When I simply replace these lines

            if (this.isSpecial) {
                this.state = State.InSpecialTag;
                this.sequenceIndex = 0;
            } else {
                this.state = State.Text;
            } 

With

this.state = State.Text;

It works properly. I'm not sure what is the proper fix, which will not affect the performance of htmlparser2, so I opened this issue instead.

@WebReflection
Copy link
Owner

The culprit is htmlparser2, if I downgrade to v6.1.0

so ... this bug is for a library used by this repository? if that's the case, what are you expecting me to do here? 🤔

@WebReflection WebReflection added the 3rd Party Issue from external dependencies label Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd Party Issue from external dependencies
Projects
None yet
Development

No branches or pull requests

2 participants