scrape cli

It's a command-line tool to extract HTML elements using an XPath query or CSS3 selector.

It's based on the great and simple scraping tool written by Jeroen Janssens.

How does it work?
How to use it in Linux
Note on building it

How does it work?

A CSS selector query like this

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be 'table.wikitable > tbody > tr > td > b > a'

or an XPATH query like this one:

curl -L 'https://en.wikipedia.org/wiki/List_of_sovereign_states' -s \
| scrape -be '//table[contains(@class, 'wikitable')]/tbody/tr/td/b/a'

gives you back:

<html>
 <head>
 </head>
 <body>
  <a href="/wiki/Afghanistan" title="Afghanistan">
   Afghanistan
  </a>
  <a href="/wiki/Albania" title="Albania">
   Albania
  </a>
  <a href="/wiki/Algeria" title="Algeria">
   Algeria
  </a>
  <a href="/wiki/Andorra" title="Andorra">
   Andorra
  </a>
  <a href="/wiki/Angola" title="Angola">
   Angola
  </a>
  <a href="/wiki/Antigua_and_Barbuda" title="Antigua and Barbuda">
   Antigua and Barbuda
  </a>
  <a href="/wiki/Argentina" title="Argentina">
   Argentina
  </a>
  <a href="/wiki/Armenia" title="Armenia">
   Armenia
  </a>
...
...
 </body>
</html>

Some notes on the commands:

-e to set the query
-b to add <html>, <head> and <body> tags to the HTML output.

How to use it in Linux

# go in example to the home folder
cd ~
# download scrape-cli
wget "https://github.com/aborruso/scrape-cli/releases/download/v1.0/scrape"
# move it in a folder of your PATH as /usr/bin
sudo mv ./scrape /usr/bin
# give it execute permission
sudo chmod +x /usr/bin/scrape
# use it

Please note: in OSX it seems not to work (#8).

Note on building it

The original source is written in Python 2, then I have built it in Python 2 environment.
There are two modules requirements: install in this environment cssselect and then lxml, in this order (using pip).

I have built it using pyinstaller and this command: pyinstaller --onefile scrape.py.

Once you have built it, it's an executable, and it's possible to use it in any environment.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.vscode		.vscode
_layouts		_layouts
presentation		presentation
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
README.md		README.md
_config.yml		_config.yml
scrape.py		scrape.py
source.md		source.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrape cli

How does it work?

How to use it in Linux

Note on building it

About

Releases 1

Packages

Languages

aborruso/scrape-cli

Folders and files

Latest commit

History

Repository files navigation

scrape cli

How does it work?

How to use it in Linux

Note on building it

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages