Skip to content
This repository was archived by the owner on Oct 4, 2022. It is now read-only.

How does the tree parser work?

Hans-Christiaan Braun edited this page Jan 13, 2020 · 4 revisions

The tree parser takes a marked-up text and transforms it into a tree-like structure of nodes. Currently, only HTML is supported, but support for Gutenberg blocks will be added in the future. However, Gutenberg blocks can always be transformed by converting them to HTML first, before building the tree.

Structure of the tree

A tree consists of nodes. Each node represents an element in the marked-up text, be it an article, section, heading, paragraph or other element. Nodes, in turn, can be divided into two types of nodes: structured nodes and leaf nodes.

Each element in the tree, including formatting elements, has a source code location. This marks the location of the element in the original source code.

An example

This example text consists of an article divided up into a section. The section, in turn, consists of a heading and a paragraph:

<article>
	<h1>The Stranger in the Night</h1>
	<section>
		<h2>Prelude</h2>
		<p>It was a <b>dark</b> and stormy night…</p>
	</section>
</article>

This ultimately leads to this tree structure:

Resulting tree structure

The resulting tree structure after parsing the above source code.

Leaf nodes

Leaf nodes are nodes that may only contain formatted text. These include headings, paragraphs and list items.

The formatted text in leaf nodes is stored in a text container. This container separates the text from its formatting. This makes linguistic analysis of the text easier, since we want to focus on the contents and the meaning of the text, rather than its formatting.

Headings

A heading node represents a heading in the text. As a leaf node, it has a text container containing the node’s text and formatting elements. It also includes a level parameter.

List items

A list item represents an item in a list. List items are only allowed in list elements. As a leaf node, it has a text container containing the node’s text and formatting elements.

Paragraphs

A paragraph represents a paragraph in the text. Paragraphs can either be explicit or implicit.

Explicit paragraphs are those that are explicitly marked up as being paragraphs. In HTML, for example, these are texts that are enveloped in <p> tags.

Implicit paragraphs are texts that are not inside an explicitly defined heading, list item or paragraph. Since they do represent a continuous piece of text, we still want to be able to analyze it.

Formatting Elements

Formatting elements are elements that do not divide a text into parts, like leaf nodes or structured nodes do, but format it instead. They make, for example, some text bold, or they transform text into a link to another document.

We currently identify these formatting elements when parsing HTML.

Structured nodes

Structured nodes combine leaf nodes and other structured nodes to form the actual tree-like structure. Structured nodes can be differentiated based on their type. For example a HTML <section> element has as its type section. The structured node's child nodes are contained within its children property.

We discern one special type of structured node: a list. Lists can be either ordered, or unordered. Lists may only contain list items.

Ignored elements

Some elements that occur in the source text are purposely ignored and not added to the tree structure. These are elements that are either not shown on screen, like inline scripts or styling elements, or are not useful for linguistic analysis, like code snippets.

Currently, these HTML elements are ignored.