mysql - Fill in the missing values in a column determined by the column itself and date -

i have subscription table looks following. there 1 million unique ids. table lists subscription status when user began subscription service indicated 'sub' , when unsubscribed indicated 'usub'.

id year month status  2014   1      2014   2      2014   3      2014   4    sub  2014   5      2014   6      2014   7      2014   8    usub   2014   9      2014  10      2014  11      2014  12     b  2014   1     b  2014   2     b  2014   3     b  2014   4     b  2014   5    usub     b  2014   6     b  2014   7     b  2014   8     b  2014   9    sub b  2014  10     b  2014  11     b  2014  12     .    .    .     . .    .    .     . .    .    .     .

i looking fill in gap between each status updates. desired output table following:

id year month status  2014   1    usub  2014   2    usub  2014   3    usub  2014   4    sub  2014   5    sub  2014   6    sub  2014   7    sub  2014   8    usub  2014   9    usub  2014  10    usub  2014  11    usub  2014  12    usub b  2014   1    sub b  2014   2    sub b  2014   3    sub b  2014   4    sub b  2014   5    usub b  2014   6    usub b  2014   7    usub b  2014   8    usub b  2014   9    sub b  2014  10    sub b  2014  11    sub b  2014  12    sub .    .    .     . .    .    .     . .    .    .     .

although first table shows both users have more 1 status change, given id not have more 1 status change. however, each id have @ least 1 status. ie. user subscribed on 2014/8 , never unsubscribed have status= sub on 2014/8.

you can procedural code, reading data in order, , using variable store "state" of previous record (with reset on each new id).
but sql not procedural.

so what kind of procedural wrapper can standard single-node database (e.g. mysql) , distributed system (e.g. hive)?

mysql has stored procedure language supports cursors iterate on rows select query (google large number of tutorials this one taken @ random)
hive 2.0 has kind of procedural language extension, it's not part of hive -- must run procedural script in specific tool (and not has hive 2.0 anyway) -- also, hive runs distributed queries on distributed file system, "shuffling" rows on single client node iterating single-threaded tool sounds clumsy -- investigate @ own risk
hive has complicated , cumbersome way run arbitrary procedural code in distributed way (in each mapper or each reducer), transformclause (a hive traduction of legacy , poorly-named "hadoop streaming" feature); unfortunately the documentation cryptic (you'd better old book e.g. "programming hive" o'reilly) , when can make work, well, performance drop significant
or, may develop custom hive udf in java little twist -- i.e. storing state previous row in java variables , using current row

now, what kind of procedural logic need implement in custom query , custom code?

first, need input query sort key, date
with hive transform or hive udf, need distribute key every key has history processed whole, in single node (otherwise results make absolutely no sense!)
define 2 variables "previousrowkey" , "previousrowstate"
on 1st row, or on change of key, "previousrowstate = usub" , "previousrowkey = currentkey"
on change of state "previousrowstate = currentstate" else force "currentstate = previousrowstate"

Trigger

Search This Blog

mysql - Fill in the missing values in a column determined by the column itself and date -

Comments

Post a Comment