This package offers a simple HTML parser, motivated by a desire to query the DOM and extract information from it.
The current parser is NOT spec compliant, and is not guaranteed to work on all HTML input. This may change.
package main
import html "../"
import "core:fmt"
main :: proc() {
doc := html.parse("<html><ul><li>one</li><li>two</li><li>three</li></ul></html>")
defer html.document_delete(doc)
iter := html.node_iterator_from_document(doc)
for node in html.node_iterator_depth_first(&iter) {
fmt.println(html.node_to_string(node))
}
}
All strings on the Node are a slice into the original input string. The dynamic arrays for the attributes and children can be deleted with [html.document_delete].
- record parse errors
- spec compliance
- respect content model: eg special hadling for
<script>
,<pre>
, etc
- respect content model: eg special hadling for
- stream in source data with a reader
- support unicode input instead of just ascii
Special tags like <script>
are not handled specially. Such a tag is expected to have it's inner HTML
be raw text. This version of the parser will parse script content as HTML.