Random sampling

Explore this snippet with some demo data here.

Description

Taking a random sample of a table is often useful when the table is very large, and you are still working in an exploratory or investigative mode. A query against a small fraction of a table runs much more quickly, and consumes fewer resources. Here are some example queries to select a random or pseudo-random subset of rows.

Using rand()

The RAND function returns a value in [0, 1) (i.e. including zero but not 1). You can use it to sample tables, e.g.:

Return 1% of rows

-- one_percent
select * from spotify.spotify_daily_tracks where rand() < 0.01

day	track_id	rank	title	artist	streams
2018-01-09	6mrKP2jyIQmM0rw6fQryjr	9	Let You Down	NF	2610265
...	...	...	...	...	...

Return 10 rows

-- ten_rows
select * from spotify.spotify_daily_tracks order by rand() limit 10

day	track_id	rank	title	artist	streams
2018-01-09	6mrKP2jyIQmM0rw6fQryjr	9	Let You Down	NF	2610265
...	...	...	...	...	...
(only 10 rows returned)

Return approximately 10 rows (discouraged)

-- ten_rows_approx
select
  *
from spotify.spotify_daily_tracks
where rand() < 10 / (select count(*) from spotify.spotify_daily_tracks)

day	track_id	rank	title	artist	streams
2018-01-09	6mrKP2jyIQmM0rw6fQryjr	9	Let You Down	NF	2610265
...	...	...	...	...	...
(only 10 rows returned)

Using hashing

A hash is a deterministic mapping from one value to another, and so is not random, but can appear random 'enough' to produce a convincing sample. Use a supported hash function in BigQuery to produce a pseudo-random ordering of your table:

Farm fingerprint

-- farm_fingerprint
select * from spotify.spotify_artists order by farm_fingerprint(spotify_artists.artist_id) limit 10

artist_id	name	popularity	followers	updated_at	url
5wPoxI5si3eJsYYwyXV4Wi	N.E.R.D	64	647718	2021-03-14	https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
...	...	...	...	...	...
(only 10 rows returned)

MD5

-- md5
select * from spotify.spotify_artists order by md5(spotify_artists.artist_id) limit 10

artist_id	name	popularity	followers	updated_at	url
5wPoxI5si3eJsYYwyXV4Wi	N.E.R.D	64	647718	2021-03-14	https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
...	...	...	...	...	...
(only 10 rows returned)

SHA1

-- SHA1
select * from spotify.spotify_artists order by sha1(spotify_artists.artist_id) limit 10

artist_id	name	popularity	followers	updated_at	url
5wPoxI5si3eJsYYwyXV4Wi	N.E.R.D	64	647718	2021-03-14	https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
...	...	...	...	...	...
(only 10 rows returned)

SHA256

-- SHA256
select * from spotify.spotify_artists order by sha256(spotify_artists.artist_id) limit 10

artist_id	name	popularity	followers	updated_at	url
5wPoxI5si3eJsYYwyXV4Wi	N.E.R.D	64	647718	2021-03-14	https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
...	...	...	...	...	...
(only 10 rows returned)

SHA512

-- SHA512
select * from spotify.spotify_artists order by sha512(spotify_artists.artist_id) limit 10

artist_id	name	popularity	followers	updated_at	url
5wPoxI5si3eJsYYwyXV4Wi	N.E.R.D	64	647718	2021-03-14	https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
...	...	...	...	...	...
(only 10 rows returned)

Using TABLESAMPLE

The TABLESAMPLE clause is currently in preview status, but it has a major benefit - it doesn't require a full table scan, and therefore can be much cheaper and quicker than the methods above. The downside is that the percentage value must be a literal, so this query is less flexible than the ones above.

Tablesample

-- tablesample
select * from spotify.spotify_daily_tracks tablesample system (10 percent)

artist_id	name	popularity	followers	updated_at	url
5wPoxI5si3eJsYYwyXV4Wi	N.E.R.D	64	647718	2021-03-14	https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
...	...	...	...	...	...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!