Skip to content

Latest commit

 

History

History
41 lines (30 loc) · 1.16 KB

README.md

File metadata and controls

41 lines (30 loc) · 1.16 KB

PTTCrawler

MIT License

PTTCrawler is a post crawler in PTT board. PTTCrawler is implemented by Java.

Features

  • It supports telnet (by Apache commons-net) and SSH (by JSch) protocols to connect to ptt.
  • It renders the VT100 terminal screen to crawl original posts.
  • Connect Ptt by UTF-8 character set.
  • Support multi-thread crawl posts.
  • [API] Also support web version to download the Ptt post.

How to use

If we want to crawl all posts in the Gossiping board, use the following command:

java -jar PTTCrawler.jar -u Username -p Password -b Gossiping [-m]

which Username and Password are your PTT account and password to login PTT.
Use -m flag to enable multi-thread.
注意: 在文章編號大於十萬的看版,例如八卦版(Gossiping),請在個人化設定中啟用使用新式簡化游標使文章編號不被全型的所覆蓋。

Version

0.9.7

TODO

  • Analysis the post content to structured data.
  • Support multi boards list

License

MIT