scala - Spark: Randomly sampling with replacement a DataFrame with the same amount of sample for each class -
despite existing lot of seemingly similar questions none answers question.
i have dataframe
processed in order fed decisiontreeclassifier
, contains column label
filled either 0.0
or 1.0
.
i need bootstrap data set, randomly selecting replacement same amount of rows each values of label
column.
i've looked @ doc , find dataframe.sample(...)
, dataframestatfunctions.sampleby(...)
issue number of sample retained not guaranteed , second 1 doesn't allow replacement! wouldn't issue on larger data set in around 50% of cases i'll have 1 of label values have less hundred rows , don't want skewed data.
despite best efforts, unable find clean solution problem , resolved myself. collecting whole dataframe , doing sampling "manually" in scala before recreating new dataframe train decisiontreeclassifier
on. seem highly inefficient , cumbersome, rather stay dataframe , keep benefits coming structure.
here current implementation reference , know i'd do:
val nbsampleperclass = /* int value ranging between 50 , 10000 */ val onesdataframe = inputdataframe.filter("label > 0.0") val zeros = inputdataframe.except(onesdataframe).collect() val ones = onesdataframe.collect() val nbzeros = zeros.count().toint val nbones = ones.count().toint def randomindexes(maxindex: int) = (0 until nbsampleperclass).map( _ => new scala.util.random().nextint(maxindex)).toseq val zerossample = randomindexes(nbzeros).map(idx => zeros(idx)) val onessample = randomindexes(nbones).map(idx => ones(idx)) val samples = scala.collection.javaconversions.seqasjavalist(zerossample ++ onessample) val resdf = sqlcontext.createdataframe(samples, inputdataframe.schema)
does know how implement such sampling while working dataframes? i'm pretty sure speed code! thank time.
Comments
Post a Comment