i have subscription table looks following. there 1 million unique ids. table lists subscription status when user began subscription service indicated 'sub' , when unsubscribed indicated 'usub'.
id year month status 2014 1 2014 2 2014 3 2014 4 sub 2014 5 2014 6 2014 7 2014 8 usub 2014 9 2014 10 2014 11 2014 12 b 2014 1 b 2014 2 b 2014 3 b 2014 4 b 2014 5 usub b 2014 6 b 2014 7 b 2014 8 b 2014 9 sub b 2014 10 b 2014 11 b 2014 12 . . . . . . . . . . . .
i looking fill in gap between each status updates. desired output table following:
id year month status 2014 1 usub 2014 2 usub 2014 3 usub 2014 4 sub 2014 5 sub 2014 6 sub 2014 7 sub 2014 8 usub 2014 9 usub 2014 10 usub 2014 11 usub 2014 12 usub b 2014 1 sub b 2014 2 sub b 2014 3 sub b 2014 4 sub b 2014 5 usub b 2014 6 usub b 2014 7 usub b 2014 8 usub b 2014 9 sub b 2014 10 sub b 2014 11 sub b 2014 12 sub . . . . . . . . . . . .
although first table shows both users have more 1 status change, given id not have more 1 status change. however, each id have @ least 1 status. ie. user subscribed on 2014/8 , never unsubscribed have status= sub on 2014/8.
you can procedural code, reading data in order, , using variable store "state" of previous record (with reset on each new id).
but sql not procedural.
so what kind of procedural wrapper can standard single-node database (e.g. mysql) , distributed system (e.g. hive)?
- mysql has stored procedure language supports cursors iterate on rows select query (google large number of tutorials this one taken @ random)
- hive 2.0 has kind of procedural language extension, it's not part of hive -- must run procedural script in specific tool (and not has hive 2.0 anyway) -- also, hive runs distributed queries on distributed file system, "shuffling" rows on single client node iterating single-threaded tool sounds clumsy -- investigate @ own risk
- hive has complicated , cumbersome way run arbitrary procedural code in distributed way (in each mapper or each reducer),
transform
clause (a hive traduction of legacy , poorly-named "hadoop streaming" feature); unfortunately the documentation cryptic (you'd better old book e.g. "programming hive" o'reilly) , when can make work, well, performance drop significant - or, may develop custom hive udf in java little twist -- i.e. storing state previous row in java variables , using current row
now, what kind of procedural logic need implement in custom query , custom code?
- first, need input query sort key, date
- with hive
transform
or hive udf, need distribute key every key has history processed whole, in single node (otherwise results make absolutely no sense!) - define 2 variables "previousrowkey" , "previousrowstate"
- on 1st row, or on change of key, "previousrowstate = usub" , "previousrowkey = currentkey"
- on change of state "previousrowstate = currentstate" else force "currentstate = previousrowstate"
Comments
Post a Comment