Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not encoding UTF-8 correctly #18

Open
shtse8 opened this issue Feb 27, 2017 · 11 comments
Open

Not encoding UTF-8 correctly #18

shtse8 opened this issue Feb 27, 2017 · 11 comments

Comments

@shtse8
Copy link

shtse8 commented Feb 27, 2017

Code to reproduce:

use \Wa72\HtmlPageDom\HtmlPageCrawler;
$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
<body>
网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
</body>
</html>
EOF;
$document = new HtmlPageCrawler($html);
echo $document->saveHTML();

Result:

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8"><title>&#32593;&#21451;&#32456;&#20110;&#32905;&#25628;&#20986;&#12300;&#33539;&#20912;&#20912;&#12301;&#23478;&#26063;&#29031;&#29255;&#65292;&#27809;&#24819;&#21040;&#30475;&#35265;&#22905;&#22902;&#22902;&#25165;&#21457;&#29616;&#12300;&#33539;&#20912;&#20912;&#26159;&#20840;&#23478;&#26368;&#38590;&#30475;&#30340;&#12301;&#65281;</title></head><body>
&#32593;&#21451;&#32456;&#20110;&#32905;&#25628;&#20986;&#12300;&#33539;&#20912;&#20912;&#12301;&#23478;&#26063;&#29031;&#29255;&#65292;&#27809;&#24819;&#21040;&#30475;&#35265;&#22905;&#22902;&#22902;&#25165;&#21457;&#29616;&#12300;&#33539;&#20912;&#20912;&#26159;&#20840;&#23478;&#26368;&#38590;&#30475;&#30340;&#12301;&#65281;
</body></html>

Expected Result:

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
<body>
网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
</body></html>

It is a known bug of PHP DomDocument. Here is the reference:
http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly

@shtse8
Copy link
Author

shtse8 commented Feb 27, 2017

I have tried DomCrawler, but it is fine without any problem when encoding Utf-8.

Code:

use Symfony\Component\DomCrawler\Crawler;
$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
<body>
网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
</body>
</html>
EOF;
$crawler = new Crawler($html);
echo $crawler->html();

Result:

<head>
<meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
</head>
<body>
网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
</body>

And I find they have fixed this bug already.
symfony/dom-crawler#4

@shtse8
Copy link
Author

shtse8 commented Feb 27, 2017

I am trying to fix this problem. #19

@kukungkung
Copy link

I think it can help you
curl_setopt($ch, CURLOPT_ENCODING, 'UTF-8');

@yellow1912
Copy link

Any update on this issue, do we still have problem with UTF-8 because this is a huge problem if it does exist, most sites use UTF-8 anyhow.

@havran
Copy link

havran commented Mar 4, 2019

Same problem here, in v1.3.

@glensc
Copy link
Contributor

glensc commented Mar 4, 2019

I just decode entities after the save:

        $html = html_entity_decode($html, ENT_NOQUOTES, 'UTF-8');

@glensc
Copy link
Contributor

glensc commented Mar 4, 2019

seems the underlying problem is that symfony/dom-crawler switches to entities to avoid some other bugs:

@havran
Copy link

havran commented Mar 4, 2019

I just decode entities after the save:

        $html = html_entity_decode($html, ENT_NOQUOTES, 'UTF-8');

I think this is not good idea - because this decode all entities from HTML (for example i have bigger document where can by for example used &gt; or &nbsp;).

I found solution which work for me - in line:

return $this->getDOMDocument()->saveHTML();
i change from:

return $this->getDOMDocument()->saveHTML();

to:

return $this->getDOMDocument()->saveHTML($this->getDOMDocument()->documentElement);

based by https://stackoverflow.com/a/20675396

@havran
Copy link

havran commented Mar 4, 2019

But my solution for some reason remove DOCTYPE in this test script:

<?php

$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#">
  <head>
    <meta charset="UTF-8">
    <title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
  </head>
  <body>
    网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
  </body>
</html>
EOF;

use \Wa72\HtmlPageDom\HtmlPageCrawler;
$document = new HtmlPageCrawler($html);
echo "--- HtmlPageDom -----------------------------------------------------------" .PHP_EOL.PHP_EOL;
echo $document->saveHTML();
echo PHP_EOL;

use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
echo "--- DomCrawler ------------------------------------------------------------" .PHP_EOL.PHP_EOL;
echo $crawler->html();

This is output:

vagrant@d8:/data/uniweb/uniweb-cms/cms3[feature/UCMS-313-content-migration-blogs *]$ drush @lv scr test.php
--- HtmlPageDom -----------------------------------------------------------

<html prefix="og: http://ogp.me/ns#">
<head>
<meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
</head>
<body>
    网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
  </body>
</html>
--- DomCrawler ------------------------------------------------------------

<head>
<meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
</head>
<body>
    网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
  </body>

@glensc
Copy link
Contributor

glensc commented Mar 4, 2019

html_entity_decode is perfectly valid, if you wanted to have putput as &amp; then it should be in source document as &amp;amp;.

@havran
Copy link

havran commented Mar 9, 2019

I make small test script for compare various DOM parsers - https://github.com/havran/php-html-parsers-test

Old simplehtmldom seems still best :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants