From 1e10bdb64b6fc9bc43005a6d07f0b2d1b98a27af Mon Sep 17 00:00:00 2001 From: "Michael[tm] Smith" Date: Fri, 21 Aug 2020 12:26:03 +0900 Subject: [PATCH 1/2] Test the (meta) prescan algorithm MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This change adds a `preparsed` subdirectory in the `encoding` directory, with tests for which the result of the *encoding sniffing algorithm* at https://html.spec.whatwg.org/#encoding-sniffing-algorithm is the expected result — that is, tests for which the expected result is the output of running *only* the encoding sniffing algorithm (of which the main sub-algorithm is the so-called “meta prescan”) — without also running the tokenization state machine and tree-construction stage. This change also adds a README file that explicitly documents what the expected results for the encoding tests are, based on whether or not they’re in the `preparsed` subdirectory. Without those changes, it’s unclear whether the expected results shown in the existing tests are for the output of fully parsing the test data — through the tokenization state machine and tree-construction stage — or instead just the output of the encoding sniffing algorithm only. And without those changes, we also don’t have any tests a system can use for testing only the output from the encoding sniffing algorithm. Fixes https://github.com/html5lib/html5lib-tests/issues/28 --- encoding/README.md | 39 +++++++++++++++++++++++++++ encoding/preparsed/tests1.dat | 51 +++++++++++++++++++++++++++++++++++ 2 files changed, 90 insertions(+) create mode 100644 encoding/README.md create mode 100644 encoding/preparsed/tests1.dat diff --git a/encoding/README.md b/encoding/README.md new file mode 100644 index 0000000..1641c49 --- /dev/null +++ b/encoding/README.md @@ -0,0 +1,39 @@ +Encoding Tests +============== + +Each file containing encoding tests has any number of tests separated by +two newlines (LF) and a single newline before the end of the file: + + [TEST]LF + LF + [TEST]LF + LF + [TEST]LF + +...where [TEST] is the format documented below. + +Encoding test format +==================== + +Each test must begin with a string "\#data", followed by a newline (LF). +All subsequent lines until a line that says "\#encoding" are the test data +and must be passed to the system being tested unchanged, except with the +final newline (on the last line) removed. + +Then there must be a line that says "\#encoding", followed by a newline +(LF), followed by string indicating an encoding name, followed by a newline +(LF). The encoding name indicated is the expected character encoding for +the output with the given test data as input. + +For the tests in the `preparsed` subdirectory, the encoding name indicated +is the expected result of running the *encoding sniffing algorithm* at +https://html.spec.whatwg.org/#encoding-sniffing-algorithm with the given +test data as input; this is, it's the expected result of running *only* the +*encoding sniffing algorithm* — without also running the tokenization state +machine and tree-construction stage defined in the spec. + +For all tests outside the subdirectory named `preparsed`, the encoding name +indicated is instead the expected character encoding for the output after +fully parsing the given test data; that is, it's the expected character +encoding for the output after running the tokenization state machine and +tree-construction stage. diff --git a/encoding/preparsed/tests1.dat b/encoding/preparsed/tests1.dat new file mode 100644 index 0000000..2dd4801 --- /dev/null +++ b/encoding/preparsed/tests1.dat @@ -0,0 +1,51 @@ +#data + + +#encoding +Windows-1252 + +#data + + +#encoding +Windows-1252 + +#data + + +#encoding +Windows-1252 + +#data + + +#encoding +Windows-1252 + +#data + + +#encoding +Windows-1252 + +#data + + +#encoding +Windows-1252 + +#data + + +#encoding +Windows-1252 + +#data + + + + + + +#encoding +Windows-1252 From b6c4e3f21a1f8b5148cf9bda741ff7276f8cf574 Mon Sep 17 00:00:00 2001 From: "Michael[tm] Smith" Date: Tue, 25 Aug 2020 02:08:38 +0900 Subject: [PATCH 2/2] Note preparsed encoding tests are first 1024 bytes --- encoding/README.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/encoding/README.md b/encoding/README.md index 1641c49..6b4b186 100644 --- a/encoding/README.md +++ b/encoding/README.md @@ -30,7 +30,10 @@ is the expected result of running the *encoding sniffing algorithm* at https://html.spec.whatwg.org/#encoding-sniffing-algorithm with the given test data as input; this is, it's the expected result of running *only* the *encoding sniffing algorithm* — without also running the tokenization state -machine and tree-construction stage defined in the spec. +machine and tree-construction stage defined in the spec — and specifically, +for running the *prescan the byte stream to determine its encoding* +https://html.spec.whatwg.org/#prescan-a-byte-stream-to-determine-its-encoding +algorithm on only the first 1024 bytes of the test data. For all tests outside the subdirectory named `preparsed`, the encoding name indicated is instead the expected character encoding for the output after