You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Papeer is a powerful **ereader internet vacuum**. It can scrape any website, removing ads and keeping only the relevant content (formatted text and images). You can export the content to Markdown, HTML, EPUB or MOBI files.
7
+
<h4align="center">Web scraper for ereaders</h4>
4
8
5
-
# Table of contents
9
+
<palign="center">
10
+
<ahref="#features">Features</a> •
11
+
<ahref="#installation">Installation</a> •
12
+
<ahref="#how-to-use">How To Use</a>
13
+
</p>
6
14
7
-
-[Usage](#usage)
8
-
*[Scrape a web page](#scrape-a-web-page)
9
-
*[Scrape a whole website](#scrape-a-whole-website)
10
-
+[`depth` option](#-depth--option)
11
-
+[`selector` option](#-selector--option)
12
-
+[Display the table of contents](#display-the-table-of-contents)
13
-
+[Scrape time](#scrape-time)
14
-
-[Installation](#installation)
15
-
*[From source](#from-source)
16
-
*[From binary](#from-binary)
17
-
+[Linux / MacOS](#linux---macos)
18
-
+[Windows](#windows)
19
-
*[MOBI support](#mobi-support)
20
-
-[Autocompletion](#autocompletion)
21
-
-[Dependencies](#dependencies)
15
+
<imgsrc="terminal.gif"alt="Papeer">
22
16
23
-
# Usage
24
17
25
-
## Scrape a web page
18
+
## Features
26
19
27
-
The `get` command lets you retrieve the content of any web page or RSS feed.
20
+
* Scrape websites and RSS feeds
21
+
* Keep relevant content only
22
+
- Formatted text (bold, italic, links)
23
+
- Images
24
+
* Save websites as Markdown, HTML, EPUB or MOBI files
25
+
* Use it as a an HTTP proxy
26
+
* Cross platform
27
+
- Windows, MacOS and Linux ready
28
28
29
+
# Installation
30
+
31
+
## From source
32
+
33
+
```sh
34
+
go install github.com/lapwat/papeer@latest
29
35
```
30
-
Scrape URL content
31
-
32
-
Usage:
33
-
papeer get URL [flags]
34
-
35
-
Examples:
36
-
papeer get https://www.eff.org/cyberspace-independence
37
-
38
-
Flags:
39
-
-a, --author string book author
40
-
--delay int time in milliseconds to wait before downloading next chapter, use with depth/selector (default -1)
If this option is not specified, `papeer` will grab only the one page.
82
+
### Scrape a whole website recursively
77
83
78
-
If this option is specified, `papeer` will select the links (a HTML tag) present on the main page, then grab each one of them.
84
+
**Display the table of contents**
79
85
80
-
You can chain this option to grab several level of pages with diferent selectors for each level.
86
+
Before scraping a whole website, it is a good idea to use the `list` command. This command is like a _dry run_, **which lets you vizualize the content before retrieving it**.
81
87
82
-
### Display the table of contents
88
+
You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset`, `reverse` and `include`. Type `papeer list --help` for more information about those options.
83
89
84
-
Before actually scraping a whole website, it is a good idea to use the `list` command. This command is like a **dry run**, which lets you vizualize the content before actually retrieving it. You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset`, `reverse` and `include`. Type `papeer list --help` for more information about those options.
90
+
The selector option should point to **`<a>` HTML tags**. If you don't specify it, the `selector` will be automatically determined based on the links present on the page.
85
91
86
92
```sh
87
-
papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
93
+
papeer list https://12factor.net/ --selector='section.concrete>article>h2>a'
88
94
```
95
+
89
96
```
90
97
# NAME URL
91
98
1 I. Codebase https://12factor.net/codebase
@@ -102,71 +109,81 @@ papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
102
109
12 XII. Admin processes https://12factor.net/admin-processes
103
110
```
104
111
105
-
### Scrape time
112
+
**Scrape the content**
106
113
107
-
Once you are satisfied with the table of contents listed by the `ls` command, you can actually scrape the content of those pages. You can use the same options that you specified for the `ls` command. You can specify `delay` and `threads` options when using `selector` or `depth` options.
114
+
Once you are satisfied with the table of contents listed by the `list` command, you can scrape the content of those pages with the `get` command. You can use the same options that you specified for the `list` command.
108
115
109
116
```sh
110
117
papeer get https://12factor.net/ --selector='section.concrete>article>h2>a'
0 commit comments