Skip to content

Commit

Permalink
Update the Crawler Used in New Paper
Browse files Browse the repository at this point in the history
This updated version of crawler was used for the paper. We added a length restriction for the rootUrl, requestUrl, and snippet length to write into the sql in order to prevent error. We also update the crawl lists into the version we have used for this crawl.
  • Loading branch information
atlasharry committed Jan 10, 2025
1 parent e73bd17 commit 9de9bef
Show file tree
Hide file tree
Showing 15 changed files with 6,325 additions and 1,589 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ To install the browser and crawler do the following:

1. Install [Firefox Nightly](http://ftp.mozilla.org/pub/firefox/nightly/2024/01/2024-01-01-23-15-40-mozilla-central/).

**Important Note**: While downloading the [latest version](https://www.mozilla.org/en-US/firefox/channel/desktop/) of Nightly does work, testing the crawler has revealed that certain versions of Firefox Nightly break the ability to add monetization labels. We recommend downloading the version we have linked above and [disabling automatic updates](https://winaero.com/disable-updates-firefox-63-above/). This will also help achieve more consistent results across different runs.
**Important Note**: While downloading the [latest version](https://www.mozilla.org/en-US/firefox/channel/desktop/) of Nightly does work, testing the crawler has revealed that certain versions of Firefox Nightly break the ability to add monetization labels (mostly version 130+). Therefore, we recommend downloading the version we have linked above and [disabling automatic updates](https://winaero.com/disable-updates-firefox-63-above/). This will also help achieve more consistent results across different runs.

**Note**: In addition to using a specific version of Firefox Nightly, we will also be disabling the [Enhanced Tracking Protection](https://support.mozilla.org/en-US/kb/enhanced-tracking-protection-firefox-desktop) that Firefox provides us with. Besides just providing us with additional data, this will also help ensure that Privacy Pioneer is operating as expected.

Expand Down
11 changes: 11 additions & 0 deletions rest-api/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,17 @@ async function rest(table) {
var extraDetail = e.extraDetail;
var cookie = e.cookie;
var loc = e.loc;

if (rootUrl && rootUrl.length >= 255) {
rootUrl = rootUrl.substring(0, 254);
}
if (requestUrl && requestUrl.length >= 4000) {
requestUrl = requestUrl.substring(0, 3999);
}
if (snippet && snippet.length >= 4000) {
snippet = snippet.substring(0, 3999);
}

// console.log("posting to analysis...");
connection.query(
"INSERT INTO ??.?? (timestp, permission, rootUrl, snippet, requestUrl, typ, ind, firstPartyRoot, parentCompany, watchlistHash, extraDetail, cookie, loc) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?)",
Expand Down
Loading

0 comments on commit 9de9bef

Please sign in to comment.