scala - Spark streaming multiple KMeans with mapWithState -


hi i'm planning deployment spark heavy lifting of processing incoming data kafka apply streamingkmeans outlier detection.

however data incoming kafka topic arrives various sources, defining different data structures require different kmeans models (states). potentially every entry in incoming discrete rdd should pass though own kmeans model, based on "key" field (basically need single event processing).

can type of processing achieved spark? if yes, exploit spark parallelism in end? i'm quite newbie in spark , scala , feel i'm missing something.

thanks in advance.

update:

i'm looking mapwithstate operator seems solve problem. question is: can directly save streamingkmeans model state? otherwise have save centroids , instantiate new model in state update function seems expensive.

can type of processing achieved spark? if yes, exploit spark parallelism in end?

theoretically type of processing possible , can benefit distributed processing, not tools want use.

streamingkmeans model designed work on rdds , since spark doesn't support nested transformations cannot use inside stateful transformations.

if set of keys has low cardinality , values known front split rdds key , keep separate model per key.

if not, can replace streamingkmeans 3-rd party local , serializable k-means model , used combination mapwithstate or updatestatebykey. in general should more efficient using distributed versions without reducing overall parallelism.


Comments