HTML Parser in Go

This repository provides a minimal HTML parser written in Go without using any external libraries. The parser reads an HTML string, breaks it down into elements, and constructs a hierarchical structure of nodes, allowing you to inspect or manipulate the HTML content programmatically.

Features

Basic HTML Parsing: Parses HTML elements and text nodes.
Attributes Parsing: Handles element attributes.
Hierarchical Structure: Builds a tree structure representing the HTML document.
Console Output: Prints the HTML structure in a formatted way, resembling the original structure.

Structure

The parser identifies two types of nodes:

ElementNode: Represents an HTML element with a tag (e.g., <div>, <p>) and optional attributes.
TextNode: Represents text content within an HTML element.

How It Works

The parser tokenizes the HTML input and recursively constructs a tree of nodes. Each node can have a list of child nodes, making it easy to visualize or traverse the document structure.

Example

For the HTML input:

<html>
  <head><title>Sample Page</title></head>
  <body>
    <h1>Welcome to the Sample Page</h1>
    <p>This is a <b>simple</b> HTML parser in Go.</p>
  </body>
</html>

The output would look like:

<root>
  <html>
    <head>
      <title>
        Sample Page
      </title>
    </head>
    <body>
      <h1>
        Welcome to the Sample Page
      </h1>
      <p>
        This is a 
        <b>
          simple
        </b>
         HTML parser in Go.
      </p>
    </body>
  </html>
</root>

Code Overview

The code is broken down into several key components:

Node Struct: Represents each HTML element or text in the structure.
Parser Struct: Manages the parsing process, including the current position in the HTML string.
Parsing Functions: Functions to parse elements, tags, attributes, and text nodes.
Print Function: A recursive function to display the parsed HTML structure in a readable format.

Usage

To use the parser, simply include the code and call the Parse method with your HTML content.

Running the Example

Clone this repository and navigate to the directory:

git clone https://github.com/your-username/html-parser-go.git
cd html-parser-go

Run the code:
```
go run main.go
```

The sample HTML included in main.go will be parsed, and the output structure will be printed to the console.

Code Snippets

Parsing HTML

parser := NewParser("<html><body><h1>Title</h1></body></html>")
root, err := parser.Parse()
if err != nil {
    fmt.Println("Error:", err)
    return
}
printNode(root, 0)

Output Format

The parser’s output displays the HTML elements and text nodes in a tree-like format, preserving the original HTML hierarchy.

Limitations

This parser is intended as a minimal example for learning purposes. It does not cover all HTML specifications, such as:

Self-closing tags like <img> or <br>.
Nested structures in more complex HTML.
Advanced error handling for malformed HTML.

Contributing

Feel free to open issues or submit pull requests if you’d like to improve this parser.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

HTML Parser in Go

Features

Structure

How It Works

Example

Code Overview

Usage

Running the Example

Code Snippets

Parsing HTML

Output Format

Limitations

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

HTML Parser in Go

Features

Structure

How It Works

Example

Code Overview

Usage

Running the Example

Code Snippets

Parsing HTML

Output Format

Limitations

Contributing

License