What is the fastest way to produce a reproducible random sample in Amazon Redshift? -


i have amazon redshift table billion rows , want sample 100000 of them @ random. i've tried query identical

select browserid,  pageviews pv group browserid order md5('seed' || browserid) limit 100000; 

as described here, taking 2 or more hours run because sort operation dominates pull.

you can find out distribution of unique combinations of first n symbols in hash dataset this:

select  substring(md5('seed' || browserid) 1 2) ,count(1) pageviews pv group 1; 

and use relevant combination or multiple combinations in where clause filter entries before sorting happens. example, if see >100000 hashes start 'ab' this:

select [columns] pageviews pv substring(md5('seed' || browserid) 1 2)='ab' order md5('seed' || browserid) group browserid limit 100000; 

also if have many rows , want sampling task materialize hash in additional table column once , won't have calculate every time.


Comments