-
Notifications
You must be signed in to change notification settings - Fork 25
Simulating iOS location services and analytics usage
- To what extent does the information collected by Apple allow for mass surveillance and individual targeting?
- How does Apple use and store this information?
We have already collected and analyzed the pattern of analytics and location service requests that Apple sends in a previous report (WIP) on an individual basis. However, to determine methods for deanonymization across such a large scale as the number of Apple device users, we need a larger dataset.
Proposal: Use existing location datasets to simulate the behavior of an iPhone. This gets us real world behavior data without having to go through our own redundant data collection.
| Dataset | Individuals | Timespan | Anonymity | Privacy | Density | Diversity | Realisticness |
|---|---|---|---|---|---|---|---|
| YJMob100K by Yahoo | 100k | 90 days | High | Poor (Unethical) | Low (500x500m grids) | Medium (High #, Japan) | Low (Discretized) |
| Brightkite | 58k | 2 years | Low (PII) | Poor (Unethical) | Low (Voluntary check-in) | Unspecified | Low (Skewed data) |
| LBSLab | 467 | 11 days | Low | Good (Academic study) | High | Low (Students only) | Low (Skewed data) |
| Geolife by Microsoft | 182 | 3 years | Low (Unobfuscated) | Mixed (Unknown consent) | High | Low (Small user base) | Mixed (Outdated) |
| GeoLife+ | Variable | Variable | High (Simulated) | High (Privacy preserving) | Tunable | Low (Biased sample) | Mixed (Simulated) |
GeoLife+ is the chosen dataset due to its privacy-preserving nature, tunable parameters, and the ability to simulate a large number of agents. Unlike the other datasets, GeoLife+ uses patterns of life rather than real distinct data, mitigating privacy concerns associated with using real user data. While it may be biased based on the smaller sample of real data it is based off of, this is outweighed by the ethical considerations and the ability to simulate a large and diverse population. The simulated nature of the data also allows us to avoid the ethical issues associated with datasets collected without informed consent or containing personally identifiable information. Furthermore, the tunable parameters allow us to adjust for population density and landscape, making it suitable for simulating the challenges Apple faces in deanonymizing data across a large user base.
- Send multiple requests for analytics data sent
- First, request by account. If no data found, it implies they're not doing the processing to directly deanonymize the data.
- Next, request by UUIDs (e.g. 92DC565E-327D-4062-877A-9A5DBB1AE8B2) included with all analytics requests. If obtainable, it implies that data is kept and indexed. If data cannot be found, it means specific data is deleted after aggregation. If it is an unreasonably difficult request to fulfil, it implies long term storage but unindexed.
- 1.38 billion iPhone users worldwide
- Assuming everyone has analytics enabled, ~3 analytics batches a day, a low estimate of ~100kb per batch, requires a total of 414 terabytes of storage. That is not a feasible amount, so surely some aggregation and processing must be done. For example,
/pds/pdsends very dense points sampled every 10 seconds. On a per-user basis, that could be used to determine significant locations and other information
- Write simulation
- Do statisitics