This repository contains artifacts for for our IMC 25 paper "Somesite I Used to Crawl". It includes code that parses robots.txt and data supporting our analysis on robots.txt adoption, web crawlers' behaviors, and active blocking.
robotstxt-parser/
contains our wrapper for Google's robots.txt parser and interpreting the results of their parser.historical-analysis/
contains code and data related to Section 3 of the paper (longitudinal analysis on the adoption of robots.txt).respect-robots-txt/
contains code and data related to Section 5 of the paper (whether AI crawlers respect robots.txt) as well as a bonus experiment on Perplexity-User conducted after the paper is accepted.active-blocking/
contains code and data related to Section 6 of the paper (deployment of active blocking).
Tested on Python 3.10.
Please report any issues through Github Issues or email us directly :)
For automated interaction with ChatGPT's web interface, we used this repository. Other similar projects exist (e.g., this).
@inproceedings{liu2025somesite,
title={Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers},
author={Liu, Enze and Luo, Elisa and Shan, Shawn and Voelker, Geoffrey M and Zhao, Ben Y and Savage, Stefan},
booktitle={Proceedings of the 2025 ACM Internet Measurement Conference (IMC 25)},
year={2025}
}