i have amazon redshift table billion rows , want sample 100000 of them @ random. i've tried query identical
select browserid, pageviews pv group browserid order md5('seed' || browserid) limit 100000;
as described here, taking 2 or more hours run because sort operation dominates pull.
you can find out distribution of unique combinations of first n symbols in hash dataset this:
select substring(md5('seed' || browserid) 1 2) ,count(1) pageviews pv group 1;
and use relevant combination or multiple combinations in where
clause filter entries before sorting happens. example, if see >100000 hashes start 'ab' this:
select [columns] pageviews pv substring(md5('seed' || browserid) 1 2)='ab' order md5('seed' || browserid) group browserid limit 100000;
also if have many rows , want sampling task materialize hash in additional table column once , won't have calculate every time.
Comments
Post a Comment