scala - Spark: Randomly sampling with replacement a DataFrame with the same amount of sample for each class -


despite existing lot of seemingly similar questions none answers question.

i have dataframe processed in order fed decisiontreeclassifier , contains column label filled either 0.0 or 1.0.

i need bootstrap data set, randomly selecting replacement same amount of rows each values of label column.

i've looked @ doc , find dataframe.sample(...) , dataframestatfunctions.sampleby(...) issue number of sample retained not guaranteed , second 1 doesn't allow replacement! wouldn't issue on larger data set in around 50% of cases i'll have 1 of label values have less hundred rows , don't want skewed data.

despite best efforts, unable find clean solution problem , resolved myself. collecting whole dataframe , doing sampling "manually" in scala before recreating new dataframe train decisiontreeclassifier on. seem highly inefficient , cumbersome, rather stay dataframe , keep benefits coming structure.

here current implementation reference , know i'd do:

val nbsampleperclass = /* int value ranging between 50 , 10000 */  val onesdataframe = inputdataframe.filter("label > 0.0") val zeros = inputdataframe.except(onesdataframe).collect() val ones = onesdataframe.collect()  val nbzeros = zeros.count().toint val nbones = ones.count().toint  def randomindexes(maxindex: int) = (0 until nbsampleperclass).map(     _ => new scala.util.random().nextint(maxindex)).toseq  val zerossample = randomindexes(nbzeros).map(idx => zeros(idx)) val onessample = randomindexes(nbones).map(idx => ones(idx)) val samples = scala.collection.javaconversions.seqasjavalist(zerossample ++ onessample) val resdf = sqlcontext.createdataframe(samples, inputdataframe.schema) 

does know how implement such sampling while working dataframes? i'm pretty sure speed code! thank time.


Comments