mysql - Fill in the missing values in a column determined by the column itself and date -


i have subscription table looks following. there 1 million unique ids. table lists subscription status when user began subscription service indicated 'sub' , when unsubscribed indicated 'usub'.

id year month status  2014   1      2014   2      2014   3      2014   4    sub  2014   5      2014   6      2014   7      2014   8    usub   2014   9      2014  10      2014  11      2014  12     b  2014   1     b  2014   2     b  2014   3     b  2014   4     b  2014   5    usub     b  2014   6     b  2014   7     b  2014   8     b  2014   9    sub b  2014  10     b  2014  11     b  2014  12     .    .    .     . .    .    .     . .    .    .     . 

i looking fill in gap between each status updates. desired output table following:

id year month status  2014   1    usub  2014   2    usub  2014   3    usub  2014   4    sub  2014   5    sub  2014   6    sub  2014   7    sub  2014   8    usub  2014   9    usub  2014  10    usub  2014  11    usub  2014  12    usub b  2014   1    sub b  2014   2    sub b  2014   3    sub b  2014   4    sub b  2014   5    usub b  2014   6    usub b  2014   7    usub b  2014   8    usub b  2014   9    sub b  2014  10    sub b  2014  11    sub b  2014  12    sub .    .    .     . .    .    .     . .    .    .     . 

although first table shows both users have more 1 status change, given id not have more 1 status change. however, each id have @ least 1 status. ie. user subscribed on 2014/8 , never unsubscribed have status= sub on 2014/8.

you can procedural code, reading data in order, , using variable store "state" of previous record (with reset on each new id).
but sql not procedural.

so what kind of procedural wrapper can standard single-node database (e.g. mysql) , distributed system (e.g. hive)?

  • mysql has stored procedure language supports cursors iterate on rows select query (google large number of tutorials this one taken @ random)
  • hive 2.0 has kind of procedural language extension, it's not part of hive -- must run procedural script in specific tool (and not has hive 2.0 anyway) -- also, hive runs distributed queries on distributed file system, "shuffling" rows on single client node iterating single-threaded tool sounds clumsy -- investigate @ own risk
  • hive has complicated , cumbersome way run arbitrary procedural code in distributed way (in each mapper or each reducer), transformclause (a hive traduction of legacy , poorly-named "hadoop streaming" feature); unfortunately the documentation cryptic (you'd better old book e.g. "programming hive" o'reilly) , when can make work, well, performance drop significant
  • or, may develop custom hive udf in java little twist -- i.e. storing state previous row in java variables , using current row

now, what kind of procedural logic need implement in custom query , custom code?

  • first, need input query sort key, date
  • with hive transform or hive udf, need distribute key every key has history processed whole, in single node (otherwise results make absolutely no sense!)
  • define 2 variables "previousrowkey" , "previousrowstate"
  • on 1st row, or on change of key, "previousrowstate = usub" , "previousrowkey = currentkey"
  • on change of state "previousrowstate = currentstate" else force "currentstate = previousrowstate"

Comments