Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test crawler performance #9

Open
SebastianZimmeck opened this issue Dec 1, 2023 · 70 comments
Open

Test crawler performance #9

SebastianZimmeck opened this issue Dec 1, 2023 · 70 comments
Assignees
Labels
omnibus An issue that covers multiple connected (smaller) sub-issues testing An issue related to testing

Comments

@SebastianZimmeck
Copy link
Member

Before we start the crawl, we need to test the crawler's performance. So, we need to compare the manually observed groundtruth with the analysis results. We probably need a 100-site test set.

  • How do we select the test set sites given the different locations and states (issue Create Manually Curated List of Sites to Crawl #7) so that we have good test coverage?
  • One issue is that different loads of a site may lead to different trackers etc. detected. So, we need to look for the groundtruth and analysis results at exactly the same site load. So, maybe, just load one site, get both groundtruth and analysis results and check?
  • We need to document all of that

(@JoeChampeau and @jjeancharles feel free to participate here as well.)

@SebastianZimmeck SebastianZimmeck added the testing An issue related to testing label Dec 1, 2023
@SebastianZimmeck
Copy link
Member Author

Where are we with the testing protocol, @danielgoldelman?

@danielgoldelman
Copy link
Collaborator

danielgoldelman commented Jan 11, 2024

Preliminary testing protocol

  1. Run the crawl, collect the data.

  2. Separate the pp data from the entries data.

For pp data:

  1. Create spreadsheet for each root url

  2. Log every piece of data into the spreadsheet with everything pp gives us, separated by pp data type

For all http request data:


  1. Create spreadsheet for each root url

  2. Do the most generic string matching with the values we are looking for. Note: we will have lists of keywords per vpn, we can get the ipinfo location while using the vpn by going to their site, and we can find monetization labels within the http requests. EX: if the zip code should be 10001, instead of a regex of \D10001\D, we look for just the string 10001. For every single key we could be looking for, we run it on the http requests gathered. Coallate these possible data stealing requests

  3. Go through every http request and label, adding to the spreadsheet when necessary

Now to bring both together:


  1. We have the two spreadsheet documents now. Time to classify

  2. Potentially in a new spreadsheet, place all http requests that occur in both pp and all http requests first. Then all that only occurred in pp. then all that only occurred in the http requests.
  3. Perform classification

@SebastianZimmeck
Copy link
Member Author

@danielgoldelman, can you reformat and reorder your comment? The order is very hard to follow, there are multiple numbers 1 and 2 after each other, etc.

@danielgoldelman
Copy link
Collaborator

@SebastianZimmeck sorry, the original comment was written on GitHub mobile, so formatting was hard to check. Changes made above.

@SebastianZimmeck
Copy link
Member Author

@danielgoldelman and @dadak-dom, please carefully read section 3.2.2 and 3.3 (in particular, 3.3.2) of our paper. We can re-use much of the approach there. I do not think that we need an annotation of the ground truth, but both of you should check the ground truth (for whatever our definition of ground truth is) and come to the same conclusion.

We have to create a testing protocol along the lines of the following:

  1. Select the set of analysis functionality that we are testing and how

    • By default all analysis functionalities
    • But how are we going to test for keywords, for example? How for email addresses, how for phone numbers, ...?
  2. Pick a set of websites to test

    • How many? Probably, 100 to 200. We need some reasonable standard deviation. For example, it is meaningless to test a particular analysis functionality for just one site because a successful test would not allow us to extrapolate and claim that we are successful for, say, 1,000 sites in our crawl set with that functionality. So, we need, say, 10 sites successfully analyzed to make that claim. Can you solidify that? What is the statistical significance of 10 sites? @JoeChampeau can help with the statistics. We should have some statistical power along the lines of "with 95% confidence our analysis of latitude is within the bounds of a 10% error rate" (e.g., if we detect 1,000 sites having a latitude with 95% confidence the real result is between 900 and 1,100 sites).
    • Which sites to select? Again, the selected set should allow us to make the claim that if an analysis functionality works properly on the test set, it also works for the large set of sites that we crawl. So, we would need to pick a diverse set of sites covering every analysis functionality for each region that we cover. There should be no bias. For example, there will be no problems for monetization categories because they occur so frequently, but how do we ensure, e.g., that there is a meaningful number of sites that collect latitudes? Maybe, pick map sites from somewhere? How do we pick sites for keywords (assuming we are analyzing keywords)?
    • Are we using the same test set of sites for each country/state? Yes, no, is some overlap OK, is it harmful, is it good, ...?
    • How are we selecting sites randomly? Use random.org.
    • We can't select any sites that we used for preliminary testing, i.e., validation. So, which are the sites, if any, that need to be excluded? If we randomly select an excluded site, how do we pick a new one? Maybe, just the next one on a given list.
  3. Running the test

    • Are we testing one site a time or run the complete test set? If we do the former we need to record all site data (and be absolutely sure that there are no errors and nothing omitted in the recording). We need to get both the analysis results and the ground truth data at the same time. The reason is that when we load a site multiple times, there is a good chance that not all trackers and other network connections are identical for both loads. So, the analysis results could diverge from the ground truth if the latter is based on a different load. We need to check the ground truth for the exact site load from which we got the analysis results. The alternative to a complete test set crawl is to do the analysis for one site at a time, i.e., visit a site, record the PP analysis results, use browser developer tools (and other tools, as necessary) to check the ground truth, record the evidence, record the ground truth evidence and result, then analyze the next site and so on. So, we would be doing multiple crawls of one site.
    • We will also need to change the VPN for every different location.
    • Who is going to run the test? @JoeChampeau has the computer. Is it you, @dadak-dom or @danielgoldelman? Both the PP analysis results and the ground truth should be checked by two people independently. This seems easier if only one test set crawl is done as opposed to the site-by-site approach.
  4. Ground truth analysis

    • How do we analyze the ground truth? Per your comment above, @danielgoldelman, I'll take that we do string matching in HTTP messages. Is that a reliable indicator? Maybe, we would also need to look at, say, browser API usage for latitude, i.e., the browser prompting the user to allow location access. What are the criteria to reliably analyze the ground truth? This can be different for our different functionalities.

These question cannot be answered in the abstract. @danielgoldelman and @dadak-dom, please play around with some sites for each analysis functionality and come up with a protocol to analyze it. For each functionality you need to be convinced that you can reliably identify true positives (and exclude false positives and false negatives). In other words, please do some validation tests.

@dadak-dom
Copy link
Collaborator

Who is going to run the test?

Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me? I thought it would make sense since that's the computer that we will use to run the actual crawl. That way, we could avoid any potential issues arising when switching between windows and mac. Just a thought.

@SebastianZimmeck , the way I understand it, we will end up with three different site lists for each country (please correct me if I'm wrong)

  1. Validation (what Daniel and I are doing now)
  2. Test set (what we're preparing for and will soon be running)
  3. The actual crawl list.
    We cannot have any overlap between the validation and the test set, but can the test set (and/or the validation) be derived from the actual crawl list? I would need to know this before I start making any lists for the test set.

@SebastianZimmeck
Copy link
Member Author

Would it make sense if @JoeChampeau runs the test, and then hands the data over to Daniel and me?

It certainly makes sense, but that would depend on if @JoeChampeau has time as the task was originally @danielgoldelman's. (Given our slow speed, the point may more or less resolve itself since we will be all back on campus soon anyways.)

(please correct me if I'm wrong)

All correct.

but can the test set (and/or the validation) be derived from the actual crawl list?

Yes, the validation and test set can be derived from the crawl list.

@dadak-dom
Copy link
Collaborator

I have added my proposed crawl testing lists to the branch connected with this issue (issue-9). Here was my procedure:

  1. For each country that we will crawl, create a new .csv file.
  2. Go to random.org and have it generate a list of random integers from 1-525
  3. Take the first six integers and find the matching URL from the general list
  4. Regenerate the random integers and find the six matching URLs from the country specific list
  5. If there seems to be a bias for one functionality, throw the list out and try again. (Or, if there is any overlap with sites that were used for validation. Luckily, this was never the case for me)
  6. Repeat the process for each location we will crawl, so ten times total.

With point 5 I tried my best to include a fair share of sites that take locations, as monetization was easy to come by. @SebastianZimmeck let me know if any changes need to be made.

@SebastianZimmeck
Copy link
Member Author

OK, sounds good!

So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.

With point 5 I tried my best to include a fair share of sites that take locations

How did you make the guess that a site takes locations?

@dadak-dom
Copy link
Collaborator

So, our test set has a total of 120 sites? For each of the 10 countries/states 6 sites from the general list and 6 from the country-specific list.

Yes, 120 sites total.

How did you make the guess that a site takes locations?

A couple of ways, e.g. visiting the site and seeing if it requests the location from the browser, or if PP detects a location, or if I know from my own browsing that the site would take locations.

@SebastianZimmeck
Copy link
Member Author

OK, sounds good!

Feel free to go ahead with that test set then. As we discussed yesterday, maybe the performance is good. Otherwise we call that set a validation set, pick a new test set, and repeat the test (after fixing any shortcomings with the crawler and/or extension).

One important point, the PP analysis needs to be set up exactly as it would be in the real crawl, i.e., with VPN, a crawl not just the extension. Though, it does not need to be on the crawl computer.

@dadak-dom
Copy link
Collaborator

One more thing: I noticed this morning that there are a lot of sites in the general list that redirect to sites that are already on the list. Can't believe I didn't catch that sooner, so I'll fix that ASAP. Just to be safe, I'll also redo the general list part of the test set.

@SebastianZimmeck
Copy link
Member Author

Great!

@dadak-dom
Copy link
Collaborator

@SebastianZimmeck I'm compiling the first round of test data, but so far I'm not getting as many location requests found as I'd like. You mention in one of the comments above that it might be worthwhile to make a list of, say, map sites. If I were to make a test list of sites with the clear intention of finding location requests, how can I make it random? Would it be valid to find, for example, a list of 200 map sites (not necessarily from the lists that we have), and pick randomly from that? If not, what are some valid strategies?

@SebastianZimmeck
Copy link
Member Author

what are some valid strategies?

Just map sites would probably be too narrow of a category. There may be techniques that are map site-specific. In that case our test set would only claim that we are good at identifying locations on map sites. So, we need more categories of sites, ideally, all categories of sites that typically get people's location.

Here is a starting point: Can you give some examples of websites that use geolocation to target local customers? So, the categories mentioned there, plus map sites, plus any other category of site that you found in your tests that collect location data. Maybe, there are generic lists (Tranco, BuiltWith, ...) that have categories of sites. Compile a list out of those and then randomly pick from them. That may be an option, but maybe you have a better idea.

So, maybe our test set is comprised of two parts:

  1. Location test set
  2. Monetization and Tracking test set

Maybe, it even has three parts if tracking pixel, browser fingerprinting, and/or IP address collection (the Tracking categories) are also rare. Then, we would also need to do a more intricate test set construction for the Tracking categories as well. I would expect no shortage of sites with Monetization.

There are no hard rules for testing. The overall question is:

What test would convince you that the crawl results are correct? (as to lat/lon, IP address, ... )

What arguments could someone make if they wanted to punch a hole in our claim that the analysis results demonstrate our crawl results are correct? Some I can think of: too small test set, not enough breadth in the test set, i.e., not covering all the techniques that we use or types of sites, sites not randomly selected, i.e., biased towards sites we know work ... (maybe there are more).

I would think we need at least 100 sites in the test set overall and generally not less than 10 sites for each practice we detect (lat/lon, tracking pixel, ...). Anything less has likely not enough statistical power and would not convince me.

@dadak-dom
Copy link
Collaborator

I've just added the lists and data that we can use for the first go at testing. A couple things to note:

  1. I managed to get a set where PP detected at least 10 of nearly every analysis functionality we were looking for, except for Zip Code and Lat/Long. My theory is that using the VPN makes it harder for sites to take this information, and so there's no requests with this information for PP to find. Of course, this will only be verified after testing fully, but I wanted to raise the possibility that these two analysis functions may not be possible with the setup we are going with. Just from the amount of sites, it's strange that none of them took lat/long, and yet many took region and city. I also did a quick test where I found a site I knew would take lat/long or zip code, and visited it without a vpn to make sure PP found those things. I then connected to the same site with a VPN, and PP wouldn't find lat/long or zip, but it still found Region and City. The good news is that Region and City seem to pop up quite a bit, so I believe we should have no problem testing for them.
  2. For documentation, here was my procedure for generating the lists:
  • I had two pools of sites: one was a mixture of the top sites of different categories, as well as sites that I had encountered in my personal browsing. The other set was the top 100 sites from the builtwith list.
  • For each list, generate a set of random integers. Use the corresponding row number for the site that you'll select
  • Take six sites from the mixture, and six sites from the pre-compiled list, for each country that we crawl.
  • In theory, you should have 12 sites. However, a few of them were bound to crash, and so as long as fewer than two crashed for a given crawl, I thought it was ok, since we'll still have over 100 sites. So you should have 10-12 sites per country/state.
  • If you generate a random integer that corresponds to a URL that was already taken, use the next available URL.

When crawling, I made sure that I was connected to the corresponding VPN for each list, i.e. when crawling using the South Africa list, I was connected to South Africa.

@SebastianZimmeck
Copy link
Member Author

Good progress, @dadak-dom!

I then connected to the same site with a VPN, and PP wouldn't find lat/long or zip, but it still found Region and City.

Not having lat/long would be substantial. Can you try playing around with the Mullvad VPN settings?

Screenshot 2024-01-17 at 10 06 50 PM

Can you try allowing as much as possible? Our goal is to have the sites trigger as most as possible of their tracking functionality.

Also, while I assume that the issue is not related to Firefox settings since you get lat/long with presumably the same settings in Firefox with VPN and Firefox without VPN, we should also have the Firefox settings as allowing as much as possible.

Maybe, also try a different VPN. What happens with the Wesleyan VPN, for example?

The bottom line: Try to think of ways to get the lat/long to show up.

@dadak-dom
Copy link
Collaborator

I messed around with the settings for both Firefox Nightly and Mullvad, no luck there.

I've tried crawling and regularly browsing with both Mullvad and the Wesleyan VPN. I was able to get Wesleyan VPN to show coarse location when browsing, but not when crawling. Under Mullvad, coarse/fine location never shows up.

However, when trying to figure this out, I noticed something that may be of interest. Per the Privacy Pioneer readme, the location value that PP uses to look for lat/long in HTTP requests is taken from the Geolocation API. Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN.
Maybe something strange is going on with my machine, so to check what I did, I encourage anyone to try the following:

  1. Without a VPN connection, visit any website.
  2. Paste in the following code into your developer console:

const options = {
enableHighAccuracy: true,
timeout: 5000,
maximumAge: 0,
};

function success(pos) {
const crd = pos.coords;

console.log("Your current position is:");
console.log(Latitude : ${crd.latitude});
console.log(Longitude: ${crd.longitude});
console.log(More or less ${crd.accuracy} meters.);
}

function error(err) {
console.warn(ERROR(${err.code}): ${err.message});
}

navigator.geolocation.getCurrentPosition(success, error, options);

  1. Compare this value to what ipinfo.io gives you by visiting ipinfo.io (without a VPN, they should be roughly the same)
  2. Now do steps 2 and 3 while connected to a VPN in a different country

When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same (the above code should be set to not use a cached position, i.e. maximiumAge: 0)
I then looked at location evidence that PP collected for crawls I did when connected to other countries. Sure enough, PP would find the region and city, because that info is provided by ipinfo. However, PP would miss the lat/long that was in the same request, most likely because the geolocation API is feeding it a different value, and so PP is looking for something else.

However, this doesn't explain why PP doesn't generate entries for coarse and fine location when crawling without a VPN. From looking at the ground truth of some small test crawls, there clearly are latitudes and longitudes of the user being sent, but for some reason PP doesn't flag them. @danielgoldelman , maybe you have some idea as to what is going on? This doesn't seem to be a VPN issue as I initially thought.

@danielgoldelman
Copy link
Collaborator

Interesting. I was having different experiences, @dadak-dom ... lat and lng seemed to be accurately obtained when performing the crawls before. Have you modified the .ext file?

@dadak-dom
Copy link
Collaborator

No, I didn't make any changes to the .ext file, @danielgoldelman . Was I supposed to?

@SebastianZimmeck
Copy link
Member Author

Good progress, @dadak-dom!

Using the developer console, I noticed that this value doesn't change, regardless of the location that you are using for a VPN.

When I do these steps, I end up with a different value for ipinfo, but the value from geolocation API stays the same

Hm, is this even a larger issue not related to the VPN? In other words, even in the non-VPN scenario do we have a bug that the location is not properly updated? This is the first point we should check. (Maybe, going to a cafe or other place with WiFi can be used to get a second location to test.)

What is not clear to me is that when we crawled with different VPN locations for constructing our training/validation/test set, we got instances of all location types. So, I am not sure what has changed since then.

@danielgoldelman, can you look into that?

@dadak-dom
Copy link
Collaborator

I forgot to use the hashtag in my most recent commit, but @danielgoldelman and I seem to have solved the lat/long issue. Apparently the browser that selenium created did not have the geo.provider.network.url preference set, and so the extension wasn't able to evaluate a lat or long when crawling. My most recent commit to issue-9 should fix this, but this should be applied to the main crawler as well. Hopefully, this means that we can get started with gathering test data and testing.

@danielgoldelman
Copy link
Collaborator

Additionally, we have run the extension as if we were the computer, and compared our results for lat/lng with what we would expect the crawl to reasonably find. This approach worked! We used the preliminary validation set we designated earlier on, so this claim should be supported via further testing when we perform the performance metric crawl, but on first approach the crawl is working as intended for lat/lng.

@SebastianZimmeck
Copy link
Member Author

Great! Once you think the crawler and analysis works as expected, feel free to move to the test set.

@SebastianZimmeck
Copy link
Member Author

Interestingly, feeding the Pytorch model with the mislabeled snippets actually resulted in correct results for nearly all the snippets I tried. ... I've confirmed with Daniel that there was no performance drop in converting from Pytorch to TensorFlow, so the only other thing I could think of is the tensorflow/tfjs#8025 regarding the conversion from TF to TFJS.

That is a very good point!

It is possible that the performance drop, at least some of it, has to do with the conversion from PyTorch/TensorFlow (in Python) to TensorFlow (in JavaScript). However, why are we seeing a bigger drop than we saw earlier?

From our paper:

PyTorch/TensorFlow (in Python) results:

Screenshot 2024-08-03 at 1 13 53 PM

TensorFlow (in JavaScript) results:

Screenshot 2024-08-03 at 1 14 28 PM

For example, latitude dropped from a recall of 0.97 to 0.82. However, the current results above show a drop to a recall of 0.5 (I am assuming fineLocation is latitude or longitude. As a side note, can we also test for latitude and longitude and not fineLocations as results for latitude and longitude can differ?)

The only reason I can think of why that is the case is that the current test set, more specifically, the instances of the current test set for which we get the false negative location results, are different compared to the original test set and/or we have more of those instances now.

So, can we run the current test set on the version of the code at the time?

If we are able to replicate @danielgoldelman's testing procedure with the code at that time but with our current test set, we should be getting results identical or at least close to what we are getting now with the new code and the current test set (if the conversion is the issue).

I am making this suggestion as we discussed that it is difficult to replicate running the test set at the time on our new code. That would also be an option if possible. In that case the old code and new code should return identical or at least very similar results.

I think my next step will be to try replicating Daniel's conversion and seeing if I can get anywhere with that.

That is also worthwhile to get a better understanding in general. Maybe, there is something we are currently not thinking of.

@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Sep 21, 2024

1. Current Status of the Model

It works! @atlasharry and @PattonYin made some great progress! As it turns out, there is nothing wrong with the model. Both the (1) PyTorch/TensorFlow Python and (2) TensorFlow.js versions perform in line with the test results reported in the paper. @atlasharry used the model from Hugging Face that corresponds to our GitHub served model and fixed a small issue in the conversion and also slightly re-tuned the parameters (which, however, did not make much of a difference).

There were a handful of incorrect classifications when testing on the 30+ additional test instances that @dadak-dom created. However, the validity of our original test set stands. The 30+ instances test could have just been an unlucky pick of test set sites or there was something different about those test instances as those were created manually and not according to the prior process.

@PattonYin and @atlasharry, please feel free to include additional information here, if there is anything important to add, to conclude this point.

2. Accuracy Testing

As we know that the model is classifying accurately, we can finally start testing the accuracy of the crawler (by "crawler" I mean the crawler including model, extension and VM).

First, here is the 100-site test set, and here is the methodology how the test set was constructed.

2.1 Test Set

Very important: @atlasharry, @PattonYin, and @ananafrida, if you do any testing on your own, please do not use any of the sites in the test set. The crawler is not allowed to see any test instances to guarantee an unbiased test. If you have used any sites from the test set, we need to randomly replace those sites with unseen ones per the described methodology.

2.2 Two Analyses

As I see it, we need to perform two analyses:

  1. Just do a normal evaluation. In other words, just as we ran the extension (with the model) on its own, we just need to run the crawler checking the ground truth against the analysis data.
  2. Compare crawler (i.e., crawler including model, extension and VM) results with normal (i.e., just extension/model) results. The point of this exercise is to make a valid claim that our crawler results are actually similar to what a normal user experiences.

For both steps, we should calculate precision, recall, and F1.

Important: We should only calculate these scores from the positive instances and not the negative instances (i.e., not use weighted scores).

2.3 How to Do the Crawler Analysis?

Now, while we have no problem of performing the first analysis (just extension/model) on the same test set run, i.e., evaluating the ground truth and analysis data based on the same test set run, this is naturally not possible for the second analysis (i.e., running the crawler vs running just the extension/model are necessarily two different runs). So, this brings us naturally to the problem how can we distinguish natural fluctuations in site loads in different runs from differences caused by adding our crawler infrastructure?

I think if we run each test, say, three times, we get a sense of the natural differences of each type of run and can distinguish those from the crawler-created differences. In other words, run just the extension/model three times on the test set and check their differences. Then, run the crawler three times and check their differences. Now, averaging out the extension/model runs and the crawler runs, do these averaged runs look very different from one another. So far the theory. I hope it works in practice. 😄

Now, next question: How can we get a good comparison? We would need to be physically in a place of a VM location and run the extension/model as a normal user. Since I am at a workshop in the Bay Area on November 7, I can do that. I can run just the extension/model (without crawler and VM) and collect the ground truth data (and extension data) as a real California user. Since that is still a few weeks out, the November 7 date should not stop us from already doing the crawl when we are ready. We probably have a sufficient intuitive understanding of whether the accuracy is good. Even if the November 7 test turns out to be not good, we can still re-crawl. 10K sites is "small big data" and can be done fairly quickly.

@SebastianZimmeck
Copy link
Member Author

@atlasharry and @PattonYin found that while the model works as expected, there is still an issue with the analysis performed by the extension, i.e., the model results do not seem to be properly processed by the extension.

We discussed that @PattonYin and @atlasharry will:

  1. Run the analysis on the complete test set and compare the results against ground truth results to determine the magnitude of the issue. So far there has only been one instance that is not processed correctly. Are there more?
  2. The results from task 1 will tell us to which extent we should address the issue, if at all. If there are more instances of incorrect processing by the extension, the set of instances can help us to pinpoint the issue. E.g., what do these instances have in common?

@SebastianZimmeck
Copy link
Member Author

@PattonYin and @atlasharry found that the incorrect results produced by the extension need to be addressed indeed. It is a bigger issue.

There are two points to note:

  1. The classification suffers from recall; precision is not a problem
  2. Errors occur more when numbers are involved (latitude, longitude, ZIP code) as opposed to alphabetical character strings (region, city)

@atlasharry and @PattonYin will look into the extension code, e.g., logging the process at various stages. @PattonYin mentioned that the search functionality may not run correctly.

@SebastianZimmeck
Copy link
Member Author

@atlasharry and @PattonYin have resolved the issue of the incorrect classifications.

The issue was caused by an incorrect implementation of escape sequences. In particular, there was one line of code in the pre-processing of the HTTP messages that added multiple backslashes to large parts of a message. This pre-processing happened right before a message was passed to the model for classification and caused mis-classifications.

It is not fully clear why this issue did not come up during the Privacy Pioneer extension testing for the PETS paper. One reason could be that the test instances did not have (as much) escaping as our current test set.

@PattonYin, can you link the file and line of code here?

@atlasharry and @PattonYin, please also add relevant details here, if any.

@SebastianZimmeck
Copy link
Member Author

We will proceed as follows:

  1. Implement the pre-processing fix in both extension and crawler
  2. Create a new test set. @atlasharry will also add to the test set creation protocol how we selected the sites for the new test set
  3. Once we have the test set, we start with the first test phase (under "Just do a normal evaluation. ...). For that, we need to crawl the sites and also check manually the underlying data of the test run to tell whether the classifications are correct. We should also save the data (i.e., HTTP messages) for later reference.

This week, we will do 1 and 2 and prepare 3. So that by Friday next week we can do 3.

@PattonYin
Copy link
Member

PattonYin commented Oct 25, 2024

This

@atlasharry and @PattonYin have resolved the issue of the incorrect classifications.

The issue was caused by an incorrect implementation of escape sequences. In particular, there was one line of code in the pre-processing of the HTTP messages that added multiple backslashes to large parts of a message. This pre-processing happened right before a message was passed to the model for classification and caused mis-classifications.

It is not fully clear why this issue did not come up during the Privacy Pioneer extension testing for the PETS paper. One reason could be that the test instances did not have (as much) escaping as our current test set.

@PattonYin, can you link the file and line of code here?

@atlasharry and @PattonYin, please also add relevant details here, if any.

Sure, we added a preprocessing step to the input with the following line of code:
const input_cleaned = input.replace(/\\+\"/g, '\\"');

The key issue we identified is that the number of backslashes preceding the double quotation mark (") directly affects the model's performance. As illustrated in the example, the second snippet containing 2 backslashes + quotation mark \\" is tokenized differently (adding token 1032) compared to the case with 1 backslash + quotation mark \", leading to different predictions. To address this, the most straightforward solution is to replace any number of backslashes + quotation mark* with 1 backslash + quotation mark.

The explanation for this behavior lies in how the backslash is used as an escape character. When a quotation mark needs to appear within a string, the backslash signals that the quotation mark is part of the string rather than an indicator of its end. Additionally, when this string is saved to files like JSON, another backslash is added as part of the escape sequence. Therefore, two backslashes are used to represent a single backslash in the stored string, compounding the number of backslashes introduced.

Since these additional backslashes don't alter the meaning of the text, they can be safely removed during preprocessing.

Tokenized string:
aea10284e1aaaa3ad9a6dc9f7a4e343
String before tokenization:
4d09d193e4d0d7be2f870af9ae7bde6

@atlasharry
Copy link
Member

Here is the updated Version with the input clean-up in the Privacy Pioneer.

Adding to Patton's point, different numbers of backslashes would result in different outputs in the tokenizer. For example:
For the string literals in the snippet: \" would result a tokenized input [1032 1000], while \\\" would be [1032 1032 1032 1000] This difference resulted by incorrect implementation of escape sequences could possibly shift the model's attention or change the importance that assigns to certain parts of the input.

@atlasharry
Copy link
Member

This is the sheet that we use to track the second crawl analysis, we haven't calculated the scores yet, but some intermediary results/observations we have is that:

Among those sites that collect the user's location, the crawler+extension performs pretty well on prediction. In the few cases that lead to a false negative is that

  • The location information extensions chosen to match(call from ipinfo API) do not usually match the sites' location information. E.g. Ipinfo identifies the VM's zip code as 90009, while some sites identify the zip code as 90060.
  • For sites that use more than one API to collect the users' location, the extension can only identify one among all. (we are still investigating why this happened)

We will continue doing the score calculations and analysis.

@SebastianZimmeck
Copy link
Member Author

Thanks, @atlasharry (and @PattonYin)!

A few suggestions:

  • Can you identify in the Sheet what the ground truth values are and what the analysis results are?
  • Can you include in the Sheet logic for calculating precision, recall, and F1?
  • Generally, it is not very clear what is what in the Sheet. Maybe, add a tab with explanations and/or more meaningful tab/column/row/etc. names.

Ideally, we want to have the complete results by Tuesday.

Also, if you can prepare the protocol for next Wednesday when I am in California for the second part of the test, that would be good. At that time, we have to be sure it works, which is why it would be good it you also test it yourself (from CT; can somebody not knowing about the details of your testing follow it and run the test?).

@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Nov 5, 2024

As there were still some incorrect results @PattonYin and @atlasharry identified the following issue as @PattonYin describes:

Sebastian Zimmeck, we checked the data flow in the extension, and found that the LengthHeuristic and the reqURL is the main cause.

Length:
According to the existing code and paper, we won't analyze any http message exceeding 100,000 characters. And our analysis revealed that, in many FN cases, the location is not identified because the message containing it exceeds the 100,000-character limit.
To fix this, we're thinking about changing the heuristic a bit by having it analyzing the first 100,000 characters of the message.
But I'm a little worried on the "consistency" issue, since according to the paper, we will directly skip such message.

reqURL:
However, removing that heuristic doesn't resolve the issue. Although the snippets are now extracted, the model predicts false frequently.
We found this is because of the reqURL appended at the beginning of the text snippet. After removing the reqUrl, the model predicts correctly. (Please Check Image 1)

Because of this, we're thinking about 2 potential fixes, 1. modify the lengthHeuristic so that it will still analyze the first 100,000 characters and 2. remove the reqUrl in the beginning. Only when both are implemented, the model successfully identifies the "region". (Please check image 2)

Just want to confirm if these fixes are acceptable. If these 2 fixes are okay, we can run a crawler test immediately, and see if that improves the model performance.

Additional Note: I belive we can safely remove the reqUrl because: this string came from "JSON.stringnify" (please check image 3), which have nothing to do with the snippet itself.

image

image

image

Thus, we decided to fix these two issues (especially, different from what I thought responses with 100,000+ characters were dropped instead of having the first 100,000 characters analyzed; the character limit is less an issue for requests as the code is specific to responses because there are not many requests with 100,000+ characters).

@SebastianZimmeck
Copy link
Member Author

Per @atlasharry, we make the following modification:

Hi Sebastian Zimmeck, we have done some initial crawl with these new changes and we realized removing the requestUrl would indeed causes some unintended consequences. For example, for a reuqestUrl "https://recsys-engine-prd.smiles.com.br/v2/recommendation?place=home_1&id=59fdc7db-8ab9-40bf-8645-74d9c7ed4591&lat=41.5551982&lon=-72.6519254&trace_id=82f83e85-5e39-42a6-92e8-c3f6ba9a0154&request_id=b65f013e-2aa0-411f-a92a-e66156dfb4fa" the Url itself contains the lon and lat. Removing the Url would skip some location information.
To fix that, we uses a seemly redundant but useful way to seperately feed reqestUrl, requestBody, responsedata, and a combination of all three(the combined data of all three is how the extension originally did) to the model. In this way, we have fix the original issue and avoid unintended consequences by providing redundancy. After this change, we also tested on the websites that originally extension failed to indetifies location informations, they are now all good.

image

Per @atlasharry, we also make the following improvement:

Another potential improvement we could have is to update the model by the new model I trained in the beginning of the semester. The original model seems to be a little overfitting the training set due to the training parameter and steps they set. Therefore, in cases like this picture shows, if the snippet has some irrelevent data that contains mostly ramdom characters( which it never sees in the training set), our original model would classifies as False. The new model would not have this problem. We have tested more than 10 FN snippets which the original model failed to identifies, the new model all correctly classifies them.

图像

@SebastianZimmeck
Copy link
Member Author

By next week @PattonYin and @atlasharry will perform the analysis as described under 2.2.1 above and present the performance results.

@atlasharry made the most excellent observation that we may not need to perform the analysis under 2.2.2 because PP's mechanism for identfying all location types (lat, ZIP, city, etc) is reliant on the IP address and that mechanism works the same way in every geographic location regardless of the specific IP address. Whether or not PP uses a VM IP address or real IP address does not matter.

@SebastianZimmeck
Copy link
Member Author

We are getting close to the end of the testing! Here is where we stand:

  • We do not need a new test set. The question came up because we made some modifications. However, those are modifications unrelated to the models. So we are not tuning our models to the data and, thus, do not need a new test set.
  • We do not need a geographical test (above "Compare crawler (i.e., crawler including model ...")). The reason is @atlasharry's point.
  • The classification performance for lat/long, city, and region is good. For ZIP code we have some discrepancies due to those not matching the ZIP code of the user. We accept this as an inherent limitation.

@PattonYin and @atlasharry will do the following by next week:

  • Analyze lat/long performance in one unified performance analysis as opposed to individual lat and long analyses.
  • Write a comment here with a table of the complete performance results for locations. This comment also includes links, e.g., to the Google Colab or other documents where the results are calculated, links to the testing protocol, ... , in short, everything related to this testing.
  • Perform tests on non-location items, that is, monetization and tracking items.

@atlasharry
Copy link
Member

atlasharry commented Nov 29, 2024

We have completed our analysis! This screenshot includes our analysis result on the new 100-site test list using the VM and crawler.

All the related analysis files can be found in this drive. It includes the ground truth Har files, the crawler results, the labels sheet to calculate the performance results below, and some code we used to calculate the results.

Here is the Google Doc with the testing procedure.

Category TP FN FP Precision Recall F1
Analytics 44 2 0 1.00 0.96 0.98
Advertising 123 0 0 1.00 1.00 1.00
Social 11 0 0 1.00 1.00 1.00
TrackingPixel 41 1 0 1.00 0.98 0.99
PossiblePixel 4 0 0 1.00 1.00 1.00
Browser Fingerprinting (Our Own Fingerprinting List) 2 0 0 1.00 1.00 1.00
Browser Fingerprinting (FingerprintingInvasive) 19 0 0 1.00 1.00 1.00
IPaddress 31 1 0 1.00 0.97 0.98
Region 32 2 1 0.97 0.94 0.96
City 21 2 2 0.91 0.91 0.91
Lng or Lat 15 3 0 1.00 0.83 0.91
ZipCode 9 2 0 1.00 0.82 0.90

@SebastianZimmeck
Copy link
Member Author

There are two tasks remaining before starting the crawl next week:

1. Finalize Classification Performance Analysis

There are a few small points @atlasharry and @PattonYin will take care of:

  • Make the Labels Sheet (on which the analysis results above are based on) more clear (e.g., headings, describe what is in the different tabs, etc.). Delete everything that is not needed and would confuse otherwise.
  • Calculate performance results without duplicate sites (e.g., Bristol University; not sure if there are others). How many total unique test sites do we have?
  • Include formulas in the Labels Sheet instead of manually tallying instances.
  • Add and modify any explanation in the testing procedure to reflect how the testing was performed.
  • I will read through the procedure and look at the analysis results and ask any remaining questions I have.

2. Estimate Crawl Performance and Modify Crawl Procedure/Settings as Necessary

@PattonYin and @atlasharry mentioned that we estimate 30% of sites do not analyze successfully. Reasons could be that a site requires more memory than the browser has or a human check. @atlasharry and @PattonYin will select a random subset of sites to estimate the percentage of crawlable sites and come up with some crawl procedures/settings modifications to improve performance (e.g., crawling sites repeatedly or increasing memory on the VM).

@atlasharry
Copy link
Member

atlasharry commented Dec 6, 2024

We have done some small-scale crawls with a random 344 sites from the tranco-1m list.

  • Total time spent around 8.1 hours, with an average speed of 43 sites per hour
  • 119 sites failed: 76 sites with the error page; 19 sites with InsecureCertificateError; 6 sites with HumanCheckError, 7 sites with TimeoutError; 10 sites WebDriverError; and 1 site with NoSuchFrameError.

After a manual inspection of 119 failed websites, we verified that 95 of them are indeed cannot be opened by our human check(The websites' fault, but not crawlers).
After we excluded those websites, we calculated that our crawler had a success rate of 90.3%

Category Details
Total sites crawled 344 sites
Total time spent 8.1 hours
Average speed 43 sites/hour
Total failures 119 sites
Failures by type: Error page 76 sites
Failures by type: InsecureCertificateError 19 sites
Failures by type: HumanCheckError 6 sites
Failures by type: TimeoutError 7 sites
Failures by type: WebDriverError 10 sites
Failures by type: NoSuchFrameError 1 site

@PattonYin
Copy link
Member

PattonYin commented Dec 6, 2024

  • We manually visited each website using our own browser and found many sites are unavailable even without crawler (i.e. accessed directly with our browser), so we excluded those in the computation of success rate.
  • We tested on both Los Angeles VM and Sydney VM, and here's the data of our investigation

For LA

Categories # of errors raised # of Sites Unavailable Without Crawler
WebDriverError: Reached Error Page 76 74
InsecureCertificateError 19 17
HumanCheckError 6 3
TimeoutError 7 2
WebDriverError 10 0
NoSuchFrameError 1 0

Number of true error instances: 24
Excluding the website with error accessing, the number of total remaining site 344 - 95 = 249
Success rate: (249 - 24) / 249 = 90.36%

For Sydney

Categories # of errors raised # of Sites Unavailable Without Crawler
WebDriverError: Reached Error Page 36 36
InsecureCertificateError 7 5
HumanCheckError 4 4
TimeoutError 4 1

Number of true error instances: 5
Excluding the website with error accessing, the number of total remaining site 141 - 51= 90
Success rate: (90 - 5) / 90 = 94.44%

For more information about all of our tested sites (e.g. which site loaded in broswer but failed in crawler), please check the driver: https://drive.google.com/drive/u/1/folders/1U704n1Qne9RvSjr_be81L4oTkAtwj-s-

@SebastianZimmeck
Copy link
Member Author

SebastianZimmeck commented Dec 7, 2024

Good results, @atlasharry and @PattonYin!

Are we logging the different errors, in particular, WebDriverError: Reached Error Page? I do not think we necessarily need to. I am just asking. We can use your results to extrapolate (or do a similar analysis once we have all the crawl data).

The testing procedure is good as well!

I will do some spot checks for the analysis results in the Labels Sheet and according to the har raw data.

@SebastianZimmeck
Copy link
Member Author

As discussed, @atlasharry and @PattonYin, it would be good if you can add in the Labels Sheet the # Count of Uniques Sites with at Least One Positive Ground Truth Instances.

Also, I spot-checked the following results, could you clarify?

  1. For amplitude

    Screenshot 2024-12-27 at 21 15 38

    I see one city (Los Angeles) TP and one city FN in the Labels sheet. The Privacy Pioneer Analysis results (entries 1-50.csv) contain only one city (row id 284) while the ground truth results (amplitude har file) contain three city entries (rows 50250, 66403, 67696). How do those results lead to one city TP and one city FN?

  2. For ninelineapparel

    Screenshot 2024-12-27 at 21 28 17

    I see one lat/long (34..., -118...) TP and one lat/long FN in the Labels sheet. The Privacy Pioneer Analysis results (entries last 50.csv) contain six lat/long rows (row ids 572, 573, 575, 576, 577, 578) while the ground truth results (ninelineapparel har file) contains two lat/long entries (rows 58209 and 68718). How do those results lead to one lat/long TP and one lat/long FN?

  3. For sites.toro

    Screenshot 2024-12-27 at 21 42 51

    I see one region (California) TP in the Labels sheet. The Privacy Pioneer Analysis results (entries last 50.csv) contain two region rows (ids 1216 and 1217) while the ground truth results (sites.toro har file) contain one California entry (row 9726); a third party request with the region California to termly. How do those results lead to one region TP?

For all three sites, I am trying to understand how you arrived at your Labels sheet results based on the counts in the ground truth and analysis data (e.g., are these instance counts vs site counts, how is first party vs third party playing into this, ...). If you could clarify how you arrived at the Labels sheet results in these three cases, that would be good.

@atlasharry
Copy link
Member

As discussed, @atlasharry and @PattonYin, it would be good if you can add in the Labels Sheet the # Count of Uniques Sites with at Least One Positive Ground Truth Instances.

Also, I spot-checked the following results, could you clarify?

  1. For amplitude
    Screenshot 2024-12-27 at 21 15 38
    I see one city (Los Angeles) TP and one city FN in the Labels sheet. The Privacy Pioneer Analysis results (entries 1-50.csv) contain only one city (row id 284) while the ground truth results (amplitude har file) contain three city entries (rows 50250, 66403, 67696). How do those results lead to one city TP and one city FN?
  2. For ninelineapparel
    Screenshot 2024-12-27 at 21 28 17
    I see one lat/long (34..., -118...) TP and one lat/long FN in the Labels sheet. The Privacy Pioneer Analysis results (entries last 50.csv) contain six lat/long rows (row ids 572, 573, 575, 576, 577, 578) while the ground truth results (ninelineapparel har file) contains two lat/long entries (rows 58209 and 68718). How do those results lead to one lat/long TP and one lat/long FN?
  3. For sites.toro
    Screenshot 2024-12-27 at 21 42 51
    I see one region (California) TP in the Labels sheet. The Privacy Pioneer Analysis results (entries last 50.csv) contain two region rows (ids 1216 and 1217) while the ground truth results (sites.toro har file) contain one California entry (row 9726); a third party request with the region California to termly. How do those results lead to one region TP?

For all three sites, I am trying to understand how you arrived at your Labels sheet results based on the counts in the ground truth and analysis data (e.g., are these instance counts vs site counts, how is first party vs third party playing into this, ...). If you could clarify how you arrived at the Labels sheet results in these three cases, that would be good.

Hi @SebastianZimmeck, I was in charge of analyzing the last sites you mentioned "sites.toro" and Patton was in charge of "ninelineapparel", and "amplitude".

  • For the "site.toro", I think PP only identifies one region row 1216 while 1217 row belongs to another website "bristol". And therefore I identified it as 1 TP. Can you take a look if that is the case on your end?
    image

@PattonYin
Copy link
Member

Hi @SebastianZimmeck.

  • For the "amplitude.com", I'm sorry that this is a mistake I made. When I was labeling the groundtruth for this website, I mistakenly took the last appearacne of the city (api2.amplitude.com/2/httpapi) as an entry in the header. After checking, it is indeed coming from the postData. I'll add 1 unit to the FN column.

  • For the "ninelineapparel.com". For the rows with id 572, 573, 575, 576, they are referring to "region" and "city" location collection, only row 577 and row 578 are "fineLocation" and "courseLocation". And since row 577 and 578 are referring to the same snippet from the same third party, that is counted as 1 Positive case, which corresponds to the "1" in TP column.
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
omnibus An issue that covers multiple connected (smaller) sub-issues testing An issue related to testing
Projects
None yet
Development

No branches or pull requests

8 participants