text file crawler (tfc)

To quench my curiousity, I wanted to gauge the usage & adoption of the following pseudo-standard text files:

Given a domains.txt file containing one domain per line, the Node.js script will fire off requests for each of the files. Given network I/O is the constraint, this can take a while.

NOTE: This script isn't particularly efficient in terms of memory usage. If you encounter issues running of memory, pass the --max-old-space-size flag like so: node --max-old-space-size=4096 tfc.

Redirects are capped at 20 and validity is based off the HTTP status code, Content-Type, and first few values of the response data. After completing, the statistics will be printed out. Valid text files found will be written to files/, which is created & wiped for you each time the script is started.

If you're interested in a write-up about this along with the metrics, you should check out my article.

Usage

Make a domains.txt by making your own or symlinking one of the provided:

ln -s domains-faang.txt domains.txt

Then, grab the dependencies & start it up:

npm install && npm start

Not all requests receive a response & hang indefinitely. If it's been a while, just Ctrl + C the process, which will print out the stats before exiting.

Thanks

David. Jeff.

License

MIT.

Name	Name	Last commit message	Last commit date
Latest commit Pinjasaur Polish for blog post Apr 10, 2019 fa2fb44 · Apr 10, 2019 History 27 Commits
.editorconfig	.editorconfig	Log stats	Mar 20, 2019
.gitignore	.gitignore	Add blacklist for file response	Mar 28, 2019
README.md	README.md	Polish for blog post	Apr 10, 2019
domains-100.txt	domains-100.txt	Add more datasets	Mar 21, 2019
domains-100k.txt	domains-100k.txt	Add more data	Mar 21, 2019
domains-10k.txt	domains-10k.txt	Add more data	Mar 21, 2019
domains-1k.txt	domains-1k.txt	Add more datasets	Mar 21, 2019
domains-1m.txt	domains-1m.txt	Add more data	Mar 21, 2019
domains-25k.txt	domains-25k.txt	Add 25k domain list	Mar 29, 2019
domains-faang.txt	domains-faang.txt	Fix stats decimal rounding	Mar 22, 2019
package-lock.json	package-lock.json	Write out files	Mar 22, 2019
package.json	package.json	Write out files	Mar 22, 2019
tfc.js	tfc.js	Polish for blog post	Apr 10, 2019
top-1m.csv.zip	top-1m.csv.zip	Add more data	Mar 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

text file crawler (tfc)

Usage

Thanks

License

About

Releases

Packages

Languages

Pinjasaur/tfc

Folders and files

Latest commit

History

Repository files navigation

text file crawler (tfc)

Usage

Thanks

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages