Skip to content

Simulating iOS location services and analytics usage

Antonio Cheong edited this page Sep 18, 2025 · 3 revisions

What do we want to find out

  • To what extent does the information collected by Apple allow for mass surveillance and individual targeting?
  • How does Apple use and store this information?

How do we find out

We have already collected and analyzed the pattern of analytics and location service requests that Apple sends in a previous report (WIP) on an individual basis. However, to determine methods for deanonymization across such a large scale as the number of Apple device users, we need a larger dataset.

Proposal: Use existing location datasets to simulate the behavior of an iPhone. This gets us real world behavior data without having to go through our own redundant data collection.

Available datasets:

Dataset Individuals Timespan Anonymity Privacy Density Diversity Realisticness
YJMob100K by Yahoo 100k 90 days High Poor (Unethical) Low (500x500m grids) Medium (High #, Japan) Low (Discretized)
Brightkite 58k 2 years Low (PII) Poor (Unethical) Low (Voluntary check-in) Unspecified Low (Skewed data)
LBSLab 467 11 days Low Good (Academic study) High Low (Students only) Low (Skewed data)
Geolife by Microsoft 182 3 years Low (Unobfuscated) Mixed (Unknown consent) High Low (Small user base) Mixed (Outdated)
GeoLife+ Variable Variable High (Simulated) High (Privacy preserving) Tunable Low (Biased sample) Mixed (Simulated)

Chosen dataset, reasoning, and ethics

GeoLife+ is the chosen dataset due to its privacy-preserving nature, tunable parameters, and the ability to simulate a large number of agents. Unlike the other datasets, GeoLife+ uses patterns of life rather than real distinct data, mitigating privacy concerns associated with using real user data. While it may be biased based on the smaller sample of real data it is based off of, this is outweighed by the ethical considerations and the ability to simulate a large and diverse population. The simulated nature of the data also allows us to avoid the ethical issues associated with datasets collected without informed consent or containing personally identifiable information. Furthermore, the tunable parameters allow us to adjust for population density and landscape, making it suitable for simulating the challenges Apple faces in deanonymizing data across a large user base.

Using GDPR to infer data storage and indexing

  • Send multiple requests for analytics data sent
  • First, request by account. If no data found, it implies they're not doing the processing to directly deanonymize the data.
  • Next, request by UUIDs (e.g. 92DC565E-327D-4062-877A-9A5DBB1AE8B2) included with all analytics requests. If obtainable, it implies that data is kept and indexed. If data cannot be found, it means specific data is deleted after aggregation. If it is an unreasonably difficult request to fulfil, it implies long term storage but unindexed.

How much data would Apple have to store?

  • 1.38 billion iPhone users worldwide
  • Assuming everyone has analytics enabled, ~3 analytics batches a day, a low estimate of ~100kb per batch, requires a total of 414 terabytes of storage. That is not a feasible amount, so surely some aggregation and processing must be done. For example, /pds/pd sends very dense points sampled every 10 seconds. On a per-user basis, that could be used to determine significant locations and other information

TODO

  • Write simulation
  • Do statisitics

Clone this wiki locally