Skip to content

Latest commit

 

History

History
125 lines (102 loc) · 5.13 KB

random-sampling.md

File metadata and controls

125 lines (102 loc) · 5.13 KB

Random sampling

Explore this snippet with some demo data here.

Description

Taking a random sample of a table is often useful when the table is very large, and you are still working in an exploratory or investigative mode. A query against a small fraction of a table runs much more quickly, and consumes fewer resources. Here are some example queries to select a random or pseudo-random subset of rows.

Using rand()

The RAND function returns a value in [0, 1) (i.e. including zero but not 1). You can use it to sample tables, e.g.:

Return 1% of rows

-- one_percent
select * from spotify.spotify_daily_tracks where rand() < 0.01
day track_id rank title artist streams
2018-01-09 6mrKP2jyIQmM0rw6fQryjr 9 Let You Down NF 2610265
... ... ... ... ... ...

Return 10 rows

-- ten_rows
select * from spotify.spotify_daily_tracks order by rand() limit 10
day track_id rank title artist streams
2018-01-09 6mrKP2jyIQmM0rw6fQryjr 9 Let You Down NF 2610265
... ... ... ... ... ...
(only 10 rows returned)

Return approximately 10 rows (discouraged)

-- ten_rows_approx
select
  *
from spotify.spotify_daily_tracks
where rand() < 10 / (select count(*) from spotify.spotify_daily_tracks)
day track_id rank title artist streams
2018-01-09 6mrKP2jyIQmM0rw6fQryjr 9 Let You Down NF 2610265
... ... ... ... ... ...
(only 10 rows returned)

Using hashing

A hash is a deterministic mapping from one value to another, and so is not random, but can appear random 'enough' to produce a convincing sample. Use a supported hash function in BigQuery to produce a pseudo-random ordering of your table:

Farm fingerprint

-- farm_fingerprint
select * from spotify.spotify_artists order by farm_fingerprint(spotify_artists.artist_id) limit 10
artist_id name popularity followers updated_at url
5wPoxI5si3eJsYYwyXV4Wi N.E.R.D 64 647718 2021-03-14 https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
... ... ... ... ... ...
(only 10 rows returned)

MD5

-- md5
select * from spotify.spotify_artists order by md5(spotify_artists.artist_id) limit 10
artist_id name popularity followers updated_at url
5wPoxI5si3eJsYYwyXV4Wi N.E.R.D 64 647718 2021-03-14 https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
... ... ... ... ... ...
(only 10 rows returned)

SHA1

-- SHA1
select * from spotify.spotify_artists order by sha1(spotify_artists.artist_id) limit 10
artist_id name popularity followers updated_at url
5wPoxI5si3eJsYYwyXV4Wi N.E.R.D 64 647718 2021-03-14 https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
... ... ... ... ... ...
(only 10 rows returned)

SHA256

-- SHA256
select * from spotify.spotify_artists order by sha256(spotify_artists.artist_id) limit 10
artist_id name popularity followers updated_at url
5wPoxI5si3eJsYYwyXV4Wi N.E.R.D 64 647718 2021-03-14 https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
... ... ... ... ... ...
(only 10 rows returned)

SHA512

-- SHA512
select * from spotify.spotify_artists order by sha512(spotify_artists.artist_id) limit 10
artist_id name popularity followers updated_at url
5wPoxI5si3eJsYYwyXV4Wi N.E.R.D 64 647718 2021-03-14 https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
... ... ... ... ... ...
(only 10 rows returned)

Using TABLESAMPLE

The TABLESAMPLE clause is currently in preview status, but it has a major benefit - it doesn't require a full table scan, and therefore can be much cheaper and quicker than the methods above. The downside is that the percentage value must be a literal, so this query is less flexible than the ones above.

Tablesample

-- tablesample
select * from spotify.spotify_daily_tracks tablesample system (10 percent)
artist_id name popularity followers updated_at url
5wPoxI5si3eJsYYwyXV4Wi N.E.R.D 64 647718 2021-03-14 https://open.spotify.com/artist/5wPoxI5si3eJsYYwyXV4Wi
... ... ... ... ... ...