This repo contains a web crawler for detecting websites' data collection and sharing practices at scale. The analysis logic is based on the codebase of Privacy Pioneer. However, it is not necessary to work with this codebase unless you would like to make changes to the privacy analysis or other parts of the extension for your crawl. The crawler makes use of Selenium for browser automation.
The code in this repo is developed and maintained by the Privacy Pioneer team.
1. Research Publications
2. Analyzing Websites from Different Geographic Locations
3. Instructions for Creating a New VM on Google Cloud
4. Instructions for Setting up the Crawler on Windows
5. Instructions for Running the Crawler
6. Changing the Extension for a Crawl
7. Porting Changes from Privacy Pioneer to the Privacy Pioneer Web Crawler
8. Known Issues
9. Thank You!
You can find a list of our research publications in the Privacy Pioneer repo.
We are running our crawler in different geographic locations with the goal of investigating how websites react to privacy laws from different countries and regions. To access sites from different locations it is generally possible to use a VPN, Web Proxy, or VM. We provide instructions for setting up the crawler on a VM using Google Cloud. Using a cloud setup can mitigate some of the challenges we encountered with VPNs. We will also outline the steps to install the crawler locally. If you are not planning to crawl on the cloud, feel free to skip to the crawler setup.
This section is only for non-lab members. Lab members do not have to create new VM instances as they have already been created.
This section will outline the necessary steps to create a VM on Google Cloud. You will need to create a project in the Google console. Unless otherwise specified, leave each setting at its default value. Click the triangles next to the step number to see an example of what you should see at each step.
2. Choose the name, region, and zone for your machine. Your decision should reflect what location you would like to crawl from.
3. Select the appropriate type of machine you would like to use. If you are on the fence about whether or not your machine will be powerful enough, it is better to overestimate. We have had issues with weaker machines where Selenium stopped working when a machine ran out of memory.
4. Change the server boot disk to Windows. In theory, there is no reason why you could not run this crawler on a Linux server. However, we have not tested this, and we recommend the Windows route because you have easy access to a GUI. This makes checking if the crawler is operating as expected significantly easier.
6. Now that you have your server, click on the triangle next to "RDP" and select "Set Windows Password." Be sure to save these credentials somewhere safe as they will not be shown again.
You should now have a working Google Cloud VM. To connect to the VM, use the Remote Desktop Connection app on Windows, which should be installed by default. Provide the external IP, username, and password. After connecting, you should see the server desktop. Next, you will need to go through the crawler setup instructions.
Please make sure to stop any new instances you start. If you keep them running, they will continue to charge the lab's account.
Note: When crawling with multiple locations, you can avoid the hassle of setting up each VM individually by using a machine image.
The previous steps were getting you ready to deploy the crawler on the cloud. Now, we will actually be setting up the crawler. This process is identical locally and on the cloud.
To install the browser and crawler do the following:
-
Install Firefox Nightly.
Important Note: While downloading the latest version of Nightly does work, testing the crawler has revealed that certain versions of Firefox Nightly break the ability to add monetization labels (mostly version 130+). Therefore, we recommend downloading the version we have linked above and disabling automatic updates. This will also help achieve more consistent results across different runs.
Note: In addition to using a specific version of Firefox Nightly, we will also be disabling the Enhanced Tracking Protection that Firefox provides us with. Besides just providing us with additional data, this will also help ensure that Privacy Pioneer is operating as expected.
-
Then, clone this repo with:
git clone https://github.com/privacy-tech-lab/privacy-pioneer-web-crawler.git
If you want to make changes to the Privacy Pioneer extension for the crawl, check out our guide for changing Privacy Pioneer. If you want to use the extension as is, you can skip the guide.
We are using a MySQL database to store analysis results. To install and set up MySQL server and MySQL shell do the following:
-
Once installed, open the MySQL Shell and run the following commands:
\connect root@localhost
Enter your MySQL root password. If you have not set this up yet, the shell should prompt you to create one. You can use a simple password like
abc
if you are just working with a local MySQL instance.Next, switch the shell over from JS to SQL mode.
\sql
-
To set up the crawler for accessing the database via your root account run in the MySQL Shell:
ALTER USER 'root'@'localhost' IDENTIFIED WITH 'mysql_native_password' BY 'abc';
FLUSH PRIVILEGES;
-
If you do not have a password for MySQL and the MySQL Shell did not prompt you to create one, you can run the following command in the MySQL Shell:
ALTER USER 'root'@'localhost' IDENTIFIED BY 'abc';
FLUSH PRIVILEGES;
Next, we will set up the MySQL database. This is important because we need a place to store the evidence that Privacy Pioneer will collect. Interactions with the database will be managed by the scripts located in the rest-api directory.
-
First, in the MySQL shell, create the database:
CREATE DATABASE analysis;
-
Then, access it:
USE analysis;
-
Lastly, create two tables where any evidence that Privacy Pioneer collects will be stored. The
entries
table contains any evidence that Privacy Pioneer is able to find. TheallEv
table contains all requests that Privacy Pioneer examined while the crawler was running.CREATE TABLE entries (id INTEGER PRIMARY KEY AUTO_INCREMENT, timestp varchar(255), permission varchar(255), rootUrl varchar(255), snippet varchar(4000), requestUrl mediumtext, typ varchar(255), ind varchar(255), firstPartyRoot varchar(255), parentCompany varchar(255), watchlistHash varchar(255), extraDetail varchar(255), cookie varchar(255), loc varchar(255));
CREATE TABLE allEv (id INTEGER PRIMARY KEY AUTO_INCREMENT, rootUrl varchar(255), request text(100000));
You can now exit the MySQL shell.
-
In the rest-api folder, create a new file called
.env
, and save the following to that file:DB_CONNECTION=mysql DB_HOST=localhost DB_DATABASE=analysis DB_USERNAME=root DB_PASSWORD=abc
Lastly, manually set the ZIP code and the GPS coordinates that you will be crawling from. While the Privacy Pioneer extension is able to automatically infer users' locations from their IP addresses, this approach proved to be error-prone when performing automated crawls from multiple locations.
You can make the change by opening up the local crawler script local-crawler.js
and modifying the following values:
const TARGET_LAT = 41.5569; // replace this value with your intended latitude
const TARGET_LONG = -72.6652; // replace this value with your intended longitude
const TARGET_ZIP = "06457"; // replace this value with your intended ZIP code (note that it must be a string)
Now, it is time to run the crawler:
-
First, using the terminal, go to the
privacy-pioneer-web-crawler/rest-api
directory. Run either:npm install node index.js
or
npm install npm start
-
Second, in another instance of the terminal, go to the
privacy-pioneer-web-crawler/selenium-crawler
directory. Run either:npm install node local-crawler.js
or
npm install npm start
The crawler should now be running. You will know the crawler is running when an instance of Firefox Nightly opens up and your desktop looks as follows:
Note: The sites the crawler will crawl are in the privacy-pioneer-web-crawler/selenium-crawler/crawl-lists
directory of this repo.
We also provide the user with optional arguments to speed up certain tasks. These are extra arguments passed via the command line with examples below. All of these arguments can be combined or used by themselves.
test
- Launches the crawler in Test Mode. With this mode enabled, the crawler will only crawl one site (the first in the list) and stay on the site indefinitely. This version will also launch with the Firefox DevTools opened by default, thereby simplifying ground truth analysis.
npm start test
state/country_name
- Supplying a location name from the list provided in /selenium-crawler/constants.js will automatically select the correct hard-coded values for you.
npm start iowa
site=[index]
- The index from the crawl list that you'd like to start from.
npm start site=99 # This would start the crawl on the 100th site in the list
Here are some helpful arguments you can pass when starting the backend.
wipe
- Truncates the SQL Tables associated with the API. Useful when you want to quickly delete all data in your database without doing so manually.
npm start wipe
In case you should need it, here are the steps to make changes to Privacy Pioneer that will be reflected when you perform your crawl.
-
Clone the Privacy Pioneer repo and make any changes that you would like to the local files.
Note: If you change Privacy Pioneer and make your own version of the crawler, then you will need to remember to enable "crawl mode" within the extension source code. The instructions for doing so can be found in the comments of Privacy Pioneer's background.js. The gist is that you will need to set the flag
IS_CRAWLING
totrue
. If you are testing changes to the crawler, you will also need to set theIS_CRAWLING_TESTING
flag totrue
.IS_CRAWLING
will enable posting to theentries
table as well as enabling the aforementioned manual location override.IS_CRAWLING_TESTING
will enable posting to theallev
table. This is necessary so that functionality related to setting the location data and recording crawl data are enabled. -
Once the changes have been made, run from within the
privacy-pioneer
directory:npm run build
-
Navigate to the newly made
dev
directory. -
In the
manifest.json
file, add the following code at the bottom (within the json). Firefox will not let you add an extension without this ID."browser_specific_settings": { "gecko": { "id": "{daf44bf7-a45e-4450-979c-91cf07434c3d}" } }
-
Within the
dev
directory, send all the files to a zip file. -
Rename the file extension from
.zip
to.xpi
. Functionally, these files will behave the same. The xpi format is the format that Firefox uses to load an extension. -
Place this new file into the
selenium-crawler
directory, and modify the crawler accordingly. Make sure that the aforementionedlocal-crawler.js
file is looking for the correct extension, i.e.:.addExtensions("ext.xpi");
is pointing to the right xpi file.
The Privacy Pioneer Web Crawler makes use of Privacy Pioneer's extension codebase. Thus, when we make changes to the extension, it may be necessary to port those to the crawler. This is particularly true for changes to the privacy analysis functionality. Porting such functionality is a manual process. It requires manually recompilation of the extension and replacement of the old extension file, i.e., privacy-pioneer-web-crawler/selenium-crawler/ext.xpi
. If you make any changes to the extension code that should be reflected in the crawler, follow the steps outlined below. Also, check the previous section in this readme for additional assistance.
-
After you have made any changes to Privacy Pioneer's extension codebase that should be ported to the crawler, be sure to switch the appropriate Boolean flag for the desired crawl mode. These flags are located in
/src/background/background.js
.- If you only want to record all evidence data, then set
IS_CRAWLING
totrue
. - If you also want to record all requests that Privacy Pioneer looks at, set
IS_CRAWLING_TESTING
totrue
as well.
- If you only want to record all evidence data, then set
-
Next, run the following command from within the
privacy-pioneer
folder:npm run build
-
Navigate to the newly created
dev
directory. Within this directory, there should be a newmanifest.json
file. Open this file and paste the following code (within the JSON). Firefox will not allow you to add an extension to the crawler without this ID."browser_specific_settings": { "gecko": { "id": "{daf44bf7-a45e-4450-979c-91cf07434c3d}" } }
-
Take all of the files within the
dev
directory and add them to a zip file. You can either do this through a GUI or through the command line. If done through a GUI, make sure that you change the file format from.zip
to.xpi
. Functionally, these files will behave the same. The xpi format is the format that Firefox uses to load an extension. Here is the relevant code:zip -r ext.xpi ./
-
Now, to update the extension file in the crawler repo, place your new
ext.xpi
file into theprivacy-pioneer-web-crawler/selenium-crawler/
directory and create a pull request as necessary.
We are aware of various issues:
Our use of Google Cloud was primarily motivated by this issue. As described in the Privacy Pioneer repo, the extension is meant to find evidence of location elements being collected and shared. However, when using a VPN (or any service without a static IP), it becomes nearly impossible for Privacy Pioneer to find evidence of GPS locations and ZIP codes. This is due to how Privacy Pioneer decides where the user's location is, and so there will almost certainly be a discrepancy between where Privacy Pioneer thinks the user is, and where a website thinks the user is. Since these features are built-in to the extension, it would be difficult to make Privacy Pioneer work with a VPN crawl without significant changes to the architecture. Thus, we have opted to hard-code the latitude, longitude, and ZIP code for our crawls. For instructions on how to do this, check the crawler setup.
Currently, the only way to actually see the GUI is through the Remote Desktop Connection app on Windows.
If the crawler fails to start, simply try running it again. Firefox Nightly is updated often, which can cause it to be unstable and crash on the first boot-up. Try running the crawler in privacy-pioneer-web-crawler/selenium-crawler
again.
If the crawler uses a browser that's in .cache/selenium/...
instead of the binary specified in code, try to manually set firefoxBrowserPath = "C:/Program Files/Firefox Nightly/firefox.exe"
at around line 614 in createSession
function in selenium-crawler/node_modules/selenium-webdriver/firefox.js
.
If you encounter an issue that has not been described, try to identify if it is coming from Selenium. To accomplish this, look at any error messages in the terminal that is running in selenium-crawler
. Make sure that you are connected to the Internet, both the crawler and extension are running, and that the crawler looks as shown above.
Feel free to open an issue or contact us via email at [email protected].
We would like to thank our supporters!
Major financial support provided by Google.
Additional financial support provided by Wesleyan University and the Anil Fernando Endowment.
Conclusions reached or positions taken are our own and not necessarily those of our financial supporters, its trustees, officers, or staff.