Skip to content

Commit 93db014

Browse files
Merge pull request #140 from HolgerDoerner/feature/medium_article_scraper
Add medium article scraper
2 parents d9d262e + 99a7c10 commit 93db014

File tree

4 files changed

+122
-0
lines changed

4 files changed

+122
-0
lines changed

Python/medium_article_scraper/LICENCE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2020 Holger Dörner
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# mediumScreaper
2+
A screaper for Medium.com -posts
3+
4+
This script searches on `Medium.com` for a supplied `Topic` and returns the results as `JSON`.
5+
6+
## Requirements Dependancies
7+
Recommendet Python-Version
8+
- `3.5+`
9+
10+
The only external dependancy is
11+
- `BeautifulSoap`
12+
13+
## Installation
14+
Install the required external dependancies with
15+
16+
```shell
17+
$ pip3 install -r requirements.txt
18+
```
19+
20+
Also, a `chmod +x medium_scraper.py` *can* be needed to make the script executable, otherwise You have to call it with `python3` (see [Usage Examples](#usage-examples))
21+
22+
## Usage examples
23+
Getting help:
24+
```shell
25+
$ (python3) medium_scraper.py -h
26+
```
27+
28+
Get posts for `python`:
29+
```shell
30+
$ (python3) medium_scraper.py python
31+
```
32+
33+
Get maximum `100` posts for `software development`:
34+
```shell
35+
$ (python3) medium_scraper.py "software development" -c 100
36+
```
37+
38+
Pretty-Print the `json` output:
39+
```shell
40+
$ (python3) medium_scraper.py python -b
41+
```
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
#!/usr/bin/env python3
2+
3+
import requests
4+
import argparse
5+
import sys
6+
import json
7+
from urllib import parse
8+
from bs4 import BeautifulSoup
9+
10+
11+
def parseArgs():
12+
parser = argparse.ArgumentParser(description='Gets posts from Medium.com for a specific topic.')
13+
parser.add_argument('topic', metavar='TOPIC', type=str,
14+
help='the topic to search for')
15+
parser.add_argument('-c', '--count', dest='count', action='store', default=15,
16+
help='maximum number of posts')
17+
parser.add_argument('-b', '--beautify', dest='beautiefy', action='store_true',
18+
help='beautiefy json output')
19+
20+
return parser.parse_args()
21+
22+
23+
def run(args):
24+
def parsePost(tag):
25+
title = tag.find('h3', class_='graf')
26+
desc = tag.find('p')
27+
url = tag.find_all('a')[3]
28+
return {
29+
'title': title.text if title else '',
30+
'desc': desc.text if desc else '',
31+
'url': url.get('href').split('?')[0] if url else '',
32+
}
33+
34+
urlParams = {
35+
'topic': parse.quote(args.topic),
36+
'count': args.count
37+
}
38+
39+
url = 'https://medium.com/search/posts?q={topic}&count={count}'.format_map(urlParams)
40+
41+
posts = []
42+
43+
response = requests.get(url)
44+
45+
soup = BeautifulSoup(response.text, 'html.parser')
46+
rawPosts = soup.find_all('div', class_='postArticle')
47+
48+
if len(rawPosts) > 0:
49+
for post in rawPosts:
50+
posts.append(parsePost(post))
51+
else:
52+
print('No posts found for "%s"...' % args.topic)
53+
sys.exit(0)
54+
55+
print(json.dumps(posts, indent=(4 if args.beautiefy else None)))
56+
57+
58+
if __name__ == '__main__':
59+
run(parseArgs())
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
BeautifulSoup4

0 commit comments

Comments
 (0)