Skip to content

spec: resolved encoding name lowercased for ISO-8859-x/KOI8-x/EUC-JP (Document.encoding casing) #97

@gaborbernat

Description

@gaborbernat

Summary

Document.encoding returns the resolved encoding's label (ASCII-lowercased) instead of the exact-case canonical Name for ISO-8859-2, ISO-8859-5, ISO-8859-7, ISO-8859-15, KOI8-R, KOI8-U, and EUC-JP. The CJK/Unicode encodings already return the correct exact-case name (Shift_JIS, Big5, GBK, EUC-KR, UTF-8), which shows the lowercase ISO-8859/KOI8/EUC-JP entries are an oversight rather than deliberate normalization.

Spec

Encoding Standard §4.2 "Names and labels":

An encoding has a name and one or more labels...

The encodings table's Name column reads exactly ISO-8859-2, ISO-8859-5, ISO-8859-7, ISO-8859-15, KOI8-R, KOI8-U, EUC-JP (and Shift_JIS, Big5, GBK, EUC-KR, UTF-8). The spec further notes:

for each encoding, ASCII-lowercasing its name yields one of its labels

i.e. the lowercase form is a label, not the name.

DOM Standard §4.5: the characterSet/charset/inputEncoding getter steps return "this's ... encoding's name" — the exact-case Name column value, which is what Document.encoding surfaces.

Repro

```python
import turbohtml
for c in ['iso-8859-2','iso-8859-5','iso-8859-7','iso-8859-15','koi8-r','koi8-u','euc-jp','shift_jis','big5','gbk','euc-kr','utf-8']:
print(c, '->', turbohtml.parse(('<meta charset="%s">

x' % c).encode()).encoding)
```

Output:

```
iso-8859-2 -> iso-8859-2 # expected ISO-8859-2
iso-8859-5 -> iso-8859-5 # expected ISO-8859-5
iso-8859-7 -> iso-8859-7 # expected ISO-8859-7
iso-8859-15 -> iso-8859-15 # expected ISO-8859-15
koi8-r -> koi8-r # expected KOI8-R
koi8-u -> koi8-u # expected KOI8-U
euc-jp -> euc-jp # expected EUC-JP
shift_jis -> Shift_JIS # correct (exact-case Name)
big5 -> Big5 # correct
gbk -> GBK # correct
euc-kr -> EUC-KR # correct
utf-8 -> UTF-8 # correct
```

Expected vs actual

charset label spec Name (expected) turbohtml actual
iso-8859-2 `ISO-8859-2` `iso-8859-2`
iso-8859-5 `ISO-8859-5` `iso-8859-5`
iso-8859-7 `ISO-8859-7` `iso-8859-7`
iso-8859-15 `ISO-8859-15` `iso-8859-15`
koi8-r `KOI8-R` `koi8-r`
koi8-u `KOI8-U` `koi8-u`
euc-jp `EUC-JP` `euc-jp`

The exact-case CJK/Unicode results prove the intent is to return the Name column, so the lowercase ISO-8859/KOI8/EUC-JP entries are inconsistent.

html5lib (klass B — shared lag)

```python
import webencodings
for c in ['iso-8859-2','koi8-r','euc-jp','shift_jis','utf-8']:
print(c, '->', webencodings.lookup(c).name)

iso-8859-2 -> iso-8859-2 ; koi8-r -> koi8-r ; euc-jp -> euc-jp ; shift_jis -> shift_jis ; utf-8 -> utf-8

```

html5lib's webencodings ASCII-lowercases all names uniformly, so it diverges from the exact-case Name column for every encoding (a documented uniform normalization). Both impls report a label where the spec requires the name; turbohtml is internally inconsistent (exact-case for CJK/Unicode, lowercase for ISO-8859/KOI8/EUC-JP).

Severity

Low — decoding itself is correct (the codec column is unaffected); only the reported Document.encoding name string casing is wrong.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions