Parser

One of the issues with some webscraping tools is the dependency upon navigating the HTML heirarchy. Normally this is done with an xpath or css in Nokogiri. The issue is that getting to the actual data can be callenging and might require a significant amount of logic at each step of the process: find a wrapper element, find a data element within that wrapper, get an attribute from that data element, etc. This can be compliated if elements don't have identifiable class names or ids. Some page designs (speciifcally those using HTML tables) do not consistenly represent the data within the disign (A1 is a label, B1 is data, but A2 is a label and A3 is data.)

The way this parser works is it flattens the heirarchy into a flat, one-dimensional array of hashes, with each hash holding the properties of the original nokogiri element. This array can then be sliced, truncated, or filtered to remove any elements we don't care about. The remaining elements can then be 'chunked' into multi-dimensinoal array and each chunk can then be processed for the data that we are looking for.

Testing

  ruby test_parser.rb

Example

  ruby example.rb

This will produce './data.json' containing a JSON array of objects: { title, url, username }

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitattributes		.gitattributes
example.rb		example.rb
parser.rb		parser.rb
readme.md		readme.md
test_parser.rb		test_parser.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parser

Testing

Example

About

Releases

Packages

Languages

johnfogh/Parser

Folders and files

Latest commit

History

Repository files navigation

Parser

Testing

Example

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages