Skip to content

Commit 6e55e9a

Browse files
committed
add proxy feature, update readme
1 parent 21d1727 commit 6e55e9a

File tree

7 files changed

+312
-214
lines changed

7 files changed

+312
-214
lines changed

README.md

Lines changed: 122 additions & 105 deletions
Original file line numberDiff line numberDiff line change
@@ -1,91 +1,98 @@
1-
# Papeer
1+
<h1 align="center">
2+
<img src="logo.png" alt="Papeer">
3+
<br>
4+
Papeer
5+
</h1>
26

3-
Papeer is a powerful **ereader internet vacuum**. It can scrape any website, removing ads and keeping only the relevant content (formatted text and images). You can export the content to Markdown, HTML, EPUB or MOBI files.
7+
<h4 align="center">Web scraper for ereaders</h4>
48

5-
# Table of contents
9+
<p align="center">
10+
<a href="#features">Features</a> •
11+
<a href="#installation">Installation</a> •
12+
<a href="#how-to-use">How To Use</a>
13+
</p>
614

7-
- [Usage](#usage)
8-
* [Scrape a web page](#scrape-a-web-page)
9-
* [Scrape a whole website](#scrape-a-whole-website)
10-
+ [`depth` option](#-depth--option)
11-
+ [`selector` option](#-selector--option)
12-
+ [Display the table of contents](#display-the-table-of-contents)
13-
+ [Scrape time](#scrape-time)
14-
- [Installation](#installation)
15-
* [From source](#from-source)
16-
* [From binary](#from-binary)
17-
+ [Linux / MacOS](#linux---macos)
18-
+ [Windows](#windows)
19-
* [MOBI support](#mobi-support)
20-
- [Autocompletion](#autocompletion)
21-
- [Dependencies](#dependencies)
15+
<img src="terminal.gif" alt="Papeer">
2216

23-
# Usage
2417

25-
## Scrape a web page
18+
## Features
2619

27-
The `get` command lets you retrieve the content of any web page or RSS feed.
20+
* Scrape websites and RSS feeds
21+
* Keep relevant content only
22+
- Formatted text (bold, italic, links)
23+
- Images
24+
* Save websites as Markdown, HTML, EPUB or MOBI files
25+
* Use it as a an HTTP proxy
26+
* Cross platform
27+
- Windows, MacOS and Linux ready
2828

29+
# Installation
30+
31+
## From source
32+
33+
```sh
34+
go install github.com/lapwat/papeer@latest
2935
```
30-
Scrape URL content
31-
32-
Usage:
33-
papeer get URL [flags]
34-
35-
Examples:
36-
papeer get https://www.eff.org/cyberspace-independence
37-
38-
Flags:
39-
-a, --author string book author
40-
--delay int time in milliseconds to wait before downloading next chapter, use with depth/selector (default -1)
41-
-d, --depth int scraping depth
42-
-f, --format string file format [md, html, epub, mobi] (default "md")
43-
-h, --help help for get
44-
--images retrieve images only
45-
-i, --include include URL as first chapter, use with depth/selector
46-
-l, --limit int limit number of chapters, use with depth/selector (default -1)
47-
-n, --name string book name (default: page title)
48-
-o, --offset int skip first chapters, use with depth/selector
49-
--output string file name (default: book name)
50-
-q, --quiet hide progress bar
51-
-r, --reverse reverse chapter order
52-
-s, --selector strings table of contents CSS selector
53-
--stdout print to standard output
54-
-t, --threads int download concurrency, use with depth/selector (default -1)
55-
--use-link-name use link name for chapter title
36+
37+
## From binary
38+
39+
Download [latest release](https://github.com/lapwat/papeer/releases/latest) for Windows, MacOS (darwin) and Linux.
40+
41+
## MOBI support
42+
43+
Install kindlegen to convert websites, Linux only.
44+
45+
```sh
46+
TMPDIR=$(mktemp -d -t papeer-XXXXX)
47+
curl -L https://github.com/lapwat/papeer/releases/download/kindlegen/kindlegen_linux_2.6_i386_v2_9.tar.gz > $TMPDIR/kindlegen.tar.gz
48+
tar xzvf $TMPDIR/kindlegen.tar.gz -C $TMPDIR
49+
chmod +x $TMPDIR/kindlegen
50+
sudo mv $TMPDIR/kindlegen /usr/local/bin
51+
rm -rf $TMPDIR
5652
```
5753

58-
## Scrape a whole website
54+
Now you can use `--format=mobi` in your `get` command.
5955

60-
If a navigation menu is present on a website, you can scrape the content of each page.
56+
## How To Use
6157

62-
You can activate this mode by using the `depth` or `selector` options.
58+
### Scrape a single page
6359

64-
### `depth` option
60+
```sh
61+
papeer get URL
62+
```
6563

66-
This option defaults to 0, `papeer` will grab only the main page.
64+
The `get` command let's you retrieve the content of a web page.
6765

68-
This option defaults to 1 if the `limit` option is specified.
66+
It removes ads and menus with `go-readability`, keeping only formatted text and images.
6967

70-
If you specify a value greater than 0, `papeer` will grab pages as deep as the value you specify.
68+
You can chain URLs.
7169

72-
> Using `include` option will include all intermediary levels into the book.
70+
**Options**
7371

74-
### `selector` option
72+
```sh
73+
-a, --author string book author
74+
-f, --format string file format [md, html, epub, mobi] (default "md")
75+
-h, --help help for get
76+
--images retrieve images only
77+
-n, --name string book name (default: page title)
78+
--output string file name (default: book name)
79+
--stdout print to standard output
80+
```
7581

76-
If this option is not specified, `papeer` will grab only the one page.
82+
### Scrape a whole website recursively
7783

78-
If this option is specified, `papeer` will select the links (a HTML tag) present on the main page, then grab each one of them.
84+
**Display the table of contents**
7985

80-
You can chain this option to grab several level of pages with diferent selectors for each level.
86+
Before scraping a whole website, it is a good idea to use the `list` command. This command is like a _dry run_, **which lets you vizualize the content before retrieving it**.
8187

82-
### Display the table of contents
88+
You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset`, `reverse` and `include`. Type `papeer list --help` for more information about those options.
8389

84-
Before actually scraping a whole website, it is a good idea to use the `list` command. This command is like a **dry run**, which lets you vizualize the content before actually retrieving it. You can use several options to customize the table of contents extraction, such as `selector`, `limit`, `offset`, `reverse` and `include`. Type `papeer list --help` for more information about those options.
90+
The selector option should point to **`<a>` HTML tags**. If you don't specify it, the `selector` will be automatically determined based on the links present on the page.
8591

8692
```sh
87-
papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
93+
papeer list https://12factor.net/ --selector='section.concrete>article>h2>a'
8894
```
95+
8996
```
9097
# NAME URL
9198
1 I. Codebase https://12factor.net/codebase
@@ -102,71 +109,81 @@ papeer list https://12factor.net/ -s 'section.concrete>article>h2>a'
102109
12 XII. Admin processes https://12factor.net/admin-processes
103110
```
104111

105-
### Scrape time
112+
**Scrape the content**
106113

107-
Once you are satisfied with the table of contents listed by the `ls` command, you can actually scrape the content of those pages. You can use the same options that you specified for the `ls` command. You can specify `delay` and `threads` options when using `selector` or `depth` options.
114+
Once you are satisfied with the table of contents listed by the `list` command, you can scrape the content of those pages with the `get` command. You can use the same options that you specified for the `list` command.
108115

109116
```sh
110117
papeer get https://12factor.net/ --selector='section.concrete>article>h2>a'
111118
```
119+
112120
```
113-
[======================================>-----------------------------] Chapters 7 / 12
114-
[====================================================================] 1. I. Codebase
115-
[====================================================================] 2. II. Dependencies
116-
[====================================================================] 3. III. Config
117-
[====================================================================] 4. IV. Backing services
118-
[====================================================================] 5. V. Build, release, run
119-
[====================================================================] 6. VI. Processes
120-
[====================================================================] 7. VII. Port binding
121-
[--------------------------------------------------------------------] 8. VIII. Concurrency
122-
[--------------------------------------------------------------------] 9. IX. Disposability
123-
[--------------------------------------------------------------------] 10. X. Dev/prod parity
124-
[--------------------------------------------------------------------] 11. XI. Logs
125-
[--------------------------------------------------------------------] 12. XII. Admin processes
121+
[===>-----------------------------] Chapters 7 / 12
122+
[=================================] 1. I. Codebase
123+
[=================================] 2. II. Dependencies
124+
[=================================] 3. III. Config
125+
[=================================] 4. IV. Backing services
126+
[=================================] 5. V. Build, release, run
127+
[=================================] 6. VI. Processes
128+
[=================================] 7. VII. Port binding
129+
[---------------------------------] 8. VIII. Concurrency
130+
[---------------------------------] 9. IX. Disposability
131+
[---------------------------------] 10. X. Dev/prod parity
132+
[---------------------------------] 11. XI. Logs
133+
[---------------------------------] 12. XII. Admin processes
126134
Markdown saved to "The_Twelve-Factor_App.md"
127135
```
128136

129-
# Installation
137+
**Recursive mode options**
130138

131-
## From source
139+
If a navigation menu is present on a website, you can scrape the content of each subpage.
132140

133-
```sh
134-
go install github.com/lapwat/papeer@latest
135-
```
141+
You can activate this mode by using the `depth` or `selector` options.
136142

137-
## From binary
143+
**`depth`**
138144

139-
### Linux / MacOS
145+
This option defaults to 0, `papeer` will grab only the main page.
140146

141-
```sh
142-
# use platform=darwin for MacOS
143-
platform=linux
144-
release=0.6.3
147+
This option defaults to 1 if the `limit` option is specified.
145148

146-
# download and extract
147-
curl -L https://github.com/lapwat/papeer/releases/download/v$release/papeer-v$release-$platform-amd64.tar.gz > papeer.tar.gz
148-
tar xzvf papeer.tar.gz
149-
rm papeer.tar.gz
149+
If you specify a value greater than 0, `papeer` will grab pages as deep as the value you specify.
150150

151-
# move to user binaries
152-
sudo mv papeer /usr/local/bin
153-
```
151+
**`selector`**
152+
153+
If this option is not specified, `papeer` will grab only the one page.
154154

155-
### Windows
155+
If this option is specified, `papeer` will select the links (a HTML tag) present on the main page, then grab each one of them.
156+
157+
You can chain this option to grab several level of pages with diferent selectors for each level.
156158

157-
Download [latest release](https://github.com/lapwat/papeer/releases/download/v0.6.3/papeer-v0.6.3-windows-amd64.zip).
159+
**`include`**
158160

159-
## MOBI support
161+
Using this option will include all intermediary levels into the book.
160162

161-
Install kindlegen to convert websites, Linux only
163+
**`delay` `threads`**
164+
165+
By default, it will grab all the pages asynchonously.
166+
167+
Use those options to tweak the synchronicity of scrape requests.
168+
169+
**Automatic table of contents extraction**
170+
171+
If you have a `depth` greater than 1 with no `selector`, it will be automatically determined based on the links present on the parent page.
172+
173+
# Proxy
174+
175+
You can use the `proxy` command to act like proxy. It can serve HTML or Markdown content based on the `--output` option.
162176

163177
```sh
164-
TMPDIR=$(mktemp -d -t papeer-XXXXX)
165-
curl -L https://github.com/lapwat/papeer/releases/download/kindlegen/kindlegen_linux_2.6_i386_v2_9.tar.gz > $TMPDIR/kindlegen.tar.gz
166-
tar xzvf $TMPDIR/kindlegen.tar.gz -C $TMPDIR
167-
chmod +x $TMPDIR/kindlegen
168-
sudo mv $TMPDIR/kindlegen /usr/local/bin
169-
rm -rf $TMPDIR
178+
papeer proxy --output=md
179+
# Proxy listening on port 8080...
180+
```
181+
182+
You can call the endpoint with `curl` and the `--proxy` option.
183+
184+
```sh
185+
curl --insecure --location --proxy localhost:8080 http://www.brainjar.com/java/host/test.html
186+
# This is a very simple HTML file.
170187
```
171188

172189
# Autocompletion

cmd/proxy.go

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
package cmd
2+
3+
import (
4+
"fmt"
5+
"io"
6+
"log"
7+
"net/http"
8+
"strings"
9+
10+
md "github.com/JohannesKaufmann/html-to-markdown"
11+
"github.com/elazarl/goproxy"
12+
readability "github.com/go-shiori/go-readability"
13+
"github.com/spf13/cobra"
14+
)
15+
16+
type ProxyOptions struct {
17+
port int
18+
output string
19+
}
20+
21+
var proxyOpts *ProxyOptions
22+
23+
func init() {
24+
proxyOpts = &ProxyOptions{}
25+
26+
proxyCmd.Flags().IntVarP(&proxyOpts.port, "port", "p", 8080, "Port on which to start the proxy")
27+
proxyCmd.Flags().StringVarP(&proxyOpts.output, "output", "o", "html", "response format [html, md]")
28+
29+
rootCmd.AddCommand(proxyCmd)
30+
}
31+
32+
var proxyCmd = &cobra.Command{
33+
Use: "proxy",
34+
Short: "Start http proxy",
35+
Example: "curl --insecure --location --proxy localhost:8080 https://www.eff.org/cyberspace-independence",
36+
Args: func(cmd *cobra.Command, args []string) error {
37+
38+
// check provided output is in list
39+
outputEnum := map[string]bool{
40+
"html": true,
41+
"md": true,
42+
}
43+
if outputEnum[proxyOpts.output] != true {
44+
return fmt.Errorf("invalid output specified: %s", proxyOpts.output)
45+
}
46+
47+
return nil
48+
},
49+
Run: func(cmd *cobra.Command, args []string) {
50+
proxy := goproxy.NewProxyHttpServer()
51+
// proxy.Verbose = true
52+
53+
proxy.OnRequest().HandleConnect(goproxy.AlwaysMitm)
54+
55+
proxy.OnResponse().DoFunc(func(resp *http.Response, ctx *goproxy.ProxyCtx) *http.Response {
56+
57+
// extract HTML body
58+
article, err := readability.FromReader(resp.Body, ctx.Req.URL)
59+
if err != nil {
60+
log.Fatal(err)
61+
}
62+
63+
content := article.Content
64+
65+
if proxyOpts.output == "md" {
66+
// convert content to markdown
67+
content, err = md.NewConverter("", true, nil).ConvertString(content)
68+
if err != nil {
69+
log.Fatal(err)
70+
}
71+
}
72+
73+
stringReader := strings.NewReader(content)
74+
resp.Body = io.NopCloser(stringReader)
75+
76+
log.Printf("Serving %s", ctx.Req.URL)
77+
78+
return resp
79+
})
80+
81+
log.Printf("Proxy listening on port %d...", proxyOpts.port)
82+
log.Printf("Usage: curl --insecure --location --proxy localhost:%d https://www.eff.org/cyberspace-independence", proxyOpts.port)
83+
84+
log.Fatal(http.ListenAndServe(fmt.Sprintf(":%d", proxyOpts.port), proxy))
85+
},
86+
}

cmd/version.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,6 @@ var versionCmd = &cobra.Command{
1414
Use: "version",
1515
Short: "Print the version number of papeer",
1616
Run: func(cmd *cobra.Command, args []string) {
17-
fmt.Println("papeer v0.6.3")
17+
fmt.Println("papeer v0.7.0")
1818
},
1919
}

0 commit comments

Comments
 (0)