Skip to content

Pinjasaur/tfc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

fa2fb44 · Apr 10, 2019

History

27 Commits
Mar 20, 2019
Mar 28, 2019
Apr 10, 2019
Mar 21, 2019
Mar 21, 2019
Mar 21, 2019
Mar 21, 2019
Mar 21, 2019
Mar 29, 2019
Mar 22, 2019
Mar 22, 2019
Mar 22, 2019
Apr 10, 2019
Mar 21, 2019

Repository files navigation

text file crawler (tfc)

To quench my curiousity, I wanted to gauge the usage & adoption of the following pseudo-standard text files:

Given a domains.txt file containing one domain per line, the Node.js script will fire off requests for each of the files. Given network I/O is the constraint, this can take a while.

NOTE: This script isn't particularly efficient in terms of memory usage. If you encounter issues running of memory, pass the --max-old-space-size flag like so: node --max-old-space-size=4096 tfc.

Redirects are capped at 20 and validity is based off the HTTP status code, Content-Type, and first few values of the response data. After completing, the statistics will be printed out. Valid text files found will be written to files/, which is created & wiped for you each time the script is started.

If you're interested in a write-up about this along with the metrics, you should check out my article.

Usage

Make a domains.txt by making your own or symlinking one of the provided:

ln -s domains-faang.txt domains.txt

Then, grab the dependencies & start it up:

npm install && npm start

Not all requests receive a response & hang indefinitely. If it's been a while, just Ctrl + C the process, which will print out the stats before exiting.

Thanks

David. Jeff.

License

MIT.

About

Crawl {robots,humans,security}.txt files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published