⚡️ HTML Document Data Extractor

This project was inspired by the BeautifulSoup python library and was used as an opportunity to learn the tree sitter parser api.

Project Components

The project can be broken into two main components:

Tree-sitter API
CPP Interface

The Tree-sitter api does did not have a C++ language binding as it is written in C so I wrote a C++ layer on top of this api. The Tree-sitter library does not have native HTML parsing capabilities as it is meant to be parser generator tool, but does provide a query language interface which will be discussed later. This project uses a HTML grammar which can be found here: Tree-sitter-html

Usage

The BeautifulSoup object is the only interface that is required to use the library.

BeautifulSoup soup("url");

Upon creating the BeautifulSoup object given a valid URL it will perform a cURL request using the cURL library and parse the html document at that location. We now have an AST(Abstract Syntax Tree) which we can now explore.

The BeautifulSoup object provides two methods:

find_all
find

Find_all

The find_all can be used to find all occurences of a specific tag in the html document. This is done by first extracting all instances of tags with the same 'tag name' as given in the parameter of the function call. We do this by performing a query on the AST which Tree-sitter allows us to do through its very own query language which is very 'scheme' like.

The query for finding the tag names loooks something like this:

(start_tag (tag_name) @name)

std::vector<TagNode> BeautifulSoup::find_all(const std::string& tagName)

BeautifulSoup soup("url");
auto docSpans = soup.find_all("span");

The above example will extract all nodes from the AST, starting from the root node, with the span tag name. We can further filter the search results(See next example).

std::vector<TagNode> BeautifulSoup::find_all(const std::string& tagName, std::map<std::string,std::string> attrs)

BeautifulSoup soup("url");
auto docSpans = soup.find_all("span", {{"class", "social"}});

The above example will extract all nodes from the AST, starting from the root node, with the span tag name and which have the social class.

std::vector<TagNode> BeautifulSoup::find_all(const Node& startNode, const std::string& tagName, std::map<std::string,std::string> attrs)

BeautifulSoup soup("url");
auto articles = soup.find_all("article", {{"class", "featured"}, {"data-value", "2"}});

for(auto& article : articles) {
    auto articleTitle = soup.find(article, "h1", {{"class", "title"}});
    std::cout << soup.getNodeText(articleTitle) << "\n";
}

The above example finds all articles in the html document and then proceeds to print the titles of those articles by finding the first h1 tag within that tag with the title class and using the getNodeText helper function to extract text data from the source document.

TagNode BeautifulSoup::find(const std::string& tagName)

BeautifulSoup soup("url");
auto docSpans = soup.find("span");

The above example will extract the first node from the AST, starting from the root node, with the span tag name. We can further filter the search results(See next example).

TagNode BeautifulSoup::find(const std::string& tagName, std::map<std::string,std::string> attrs)

BeautifulSoup soup("url");
auto docSpans = soup.find_all("span", {{"class", "social"}});

The above example will extract the first node from the AST, starting from the root node, with the span tag name and includes the social class.

TagNode BeautifulSoup::find(const Node& startNode, const std::string& tagName, std::map<std::string,std::string> attrs)

BeautifulSoup soup("url");
auto articles = soup.find_all("article", {{"class", "featured"}, {"data-value", "2"}});

for(auto& article : articles) {
    auto articleTitle = soup.find(article, "h1", {{"class", "title"}});
    std::cout << soup.getNodeText(articleTitle) << "\n";
}

The above example finds the first article in the html document and then proceeds to print the title of that title by finding the first h1 tag within that tag with the title class and using the getNodeText helper function to extract text data from the source document.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
include		include
src		src
test		test
README.md		README.md
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡️ HTML Document Data Extractor

Project Components

Usage

Find_all

std::vector<TagNode> BeautifulSoup::find_all(const std::string& tagName)

std::vector<TagNode> BeautifulSoup::find_all(const std::string& tagName, std::map<std::string,std::string> attrs)

std::vector<TagNode> BeautifulSoup::find_all(const Node& startNode, const std::string& tagName, std::map<std::string,std::string> attrs)

TagNode BeautifulSoup::find(const std::string& tagName)

TagNode BeautifulSoup::find(const std::string& tagName, std::map<std::string,std::string> attrs)

TagNode BeautifulSoup::find(const Node& startNode, const std::string& tagName, std::map<std::string,std::string> attrs)

About

Uh oh!

Releases

Packages

Languages

shankun/Beautiful-Soup-CPP

Folders and files

Latest commit

History

Repository files navigation

⚡️ HTML Document Data Extractor

Project Components

Usage

Find_all

std::vector<TagNode> BeautifulSoup::find_all(const std::string& tagName)

std::vector<TagNode> BeautifulSoup::find_all(const std::string& tagName, std::map<std::string,std::string> attrs)

std::vector<TagNode> BeautifulSoup::find_all(const Node& startNode, const std::string& tagName, std::map<std::string,std::string> attrs)

TagNode BeautifulSoup::find(const std::string& tagName)

TagNode BeautifulSoup::find(const std::string& tagName, std::map<std::string,std::string> attrs)

TagNode BeautifulSoup::find(const Node& startNode, const std::string& tagName, std::map<std::string,std::string> attrs)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages