PTT Crawler

PTT crawler is a scrapy project for crawling PTT articles and is ready for the deployment on Scrapinghub. The key feature makes it differ from other similiar projects is that some possible complex PTT article patterns are considered (e.g. signature files (簽名檔), edited articles) and some algorithms are applied to deal with important values that are missing in sources (e.g. exact time and date of comments), so that a better data quality can be guaranteed. Note that due to efficiency issue it currently retrieves only articles from yesterday, although with some minor adjustments in the ptt_spider.py it is capable to retrieve articles from any date.

Example Usage

Command pattern:

scrapy crawl ptt <-a argument=value> <-o outputfile.json>

Example 1: Crawl 5 articles from PTT Goossiping and dump the data into output.json

scrapy crawl ptt -a max_articles=5 -a board='Gossiping' -o output.json

Example 2: Crawl 5 articles that title contain 丹丹 from PTT Goossiping and dump the data into output.json

scrapy crawl ptt -a max_articles=5 -a board='Gossiping' -a keyword=丹丹 -o output.json

Example 3: Crawl an article from url (https://www.ptt.cc/bbs/WomenTalk/M.1494689998.A.2AA.html) and dump the data into output.json

scrapy crawl ptt -a test_url=https://www.ptt.cc/bbs/WomenTalk/M.1494689998.A.2AA.html -o output.json

Available Arguments

max_articles: Maximium articles to crawl. default=5
max_retry: Maximium retries during the process. default=5
board: PTT board to crawl. default='HatePolitics'
keyword: If specified, the spider will only retrieve articles that has the given keyword in its title. optional argument
test_url: If set, only the article in the given url will be crawled and all arguments above will be ignored. This argument is especially helpful when debugging. optional argument
get_content: If set False, content of articles will not be retrieved. This helps to reduce the size of dataset if you are not interested in them. default=True
get_comments: If set False, comments of articles will not be retrieved and article scores will not be calculated. This helps to reduce the size of dataset if you are not interested in them. default=True

TODOs

implement two arguments from and until which allow user to specify a querying time range.
implement an argument fields which allows user to determine desired the data schema

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
crawler_app		crawler_app
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PTT Crawler

Example Usage

Available Arguments

TODOs

References

About

Releases

Packages

Languages

TimJJTing/ptt_crawler

Folders and files

Latest commit

History

Repository files navigation

PTT Crawler

Example Usage

Available Arguments

TODOs

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages