Simulating iOS location services and analytics usage

What do we want to find out

To what extent does the information collected by Apple allow for mass surveillance and individual targeting?
How does Apple use and store this information?

How do we find out

We have already collected and analyzed the pattern of analytics and location service requests that Apple sends in a previous report (WIP) on an individual basis. However, to determine methods for deanonymization across such a large scale as the number of Apple device users, we need a larger dataset.

Proposal: Use existing location datasets to simulate the behavior of an iPhone. This gets us real world behavior data without having to go through our own redundant data collection.

Available datasets:

Dataset	Individuals	Timespan	Anonymity	Privacy	Density	Diversity	Realisticness
YJMob100K by Yahoo	100k	90 days	High	Poor (Unethical)	Low (500x500m grids)	Medium (High #, Japan)	Low (Discretized)
Brightkite	58k	2 years	Low (PII)	Poor (Unethical)	Low (Voluntary check-in)	Unspecified	Low (Skewed data)
LBSLab	467	11 days	Low	Good (Academic study)	High	Low (Students only)	Low (Skewed data)
Geolife by Microsoft	182	3 years	Low (Unobfuscated)	Mixed (Unknown consent)	High	Low (Small user base)	Mixed (Outdated)
GeoLife+	Variable	Variable	High (Simulated)	High (Privacy preserving)	Tunable	Low (Biased sample)	Mixed (Simulated)

Chosen dataset, reasoning, and ethics

GeoLife+ is the chosen dataset due to its privacy-preserving nature, tunable parameters, and the ability to simulate a large number of agents. Unlike the other datasets, GeoLife+ uses patterns of life rather than real distinct data, mitigating privacy concerns associated with using real user data. While it may be biased based on the smaller sample of real data it is based off of, this is outweighed by the ethical considerations and the ability to simulate a large and diverse population. The simulated nature of the data also allows us to avoid the ethical issues associated with datasets collected without informed consent or containing personally identifiable information. Furthermore, the tunable parameters allow us to adjust for population density and landscape, making it suitable for simulating the challenges Apple faces in deanonymizing data across a large user base.

Using GDPR to infer data storage and indexing

Send multiple requests for analytics data sent
First, request by account. If no data found, it implies they're not doing the processing to directly deanonymize the data.
Next, request by UUIDs (e.g. 92DC565E-327D-4062-877A-9A5DBB1AE8B2) included with all analytics requests. If obtainable, it implies that data is kept and indexed. If data cannot be found, it means specific data is deleted after aggregation. If it is an unreasonably difficult request to fulfil, it implies long term storage but unindexed.

How much data would Apple have to store?

1.38 billion iPhone users worldwide
Assuming everyone has analytics enabled, ~3 analytics batches a day, a low estimate of ~100kb per batch, requires a total of 414 terabytes of storage. That is not a feasible amount, so surely some aggregation and processing must be done. For example, /pds/pd sends very dense points sampled every 10 seconds. On a per-user basis, that could be used to determine significant locations and other information

TODO

Write simulation

Do statisitics

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simulating iOS location services and analytics usage

What do we want to find out

How do we find out

Available datasets:

Chosen dataset, reasoning, and ethics

Using GDPR to infer data storage and indexing

How much data would Apple have to store?

TODO

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally