i'm dealing dataset have students ratings of teachers. students rated same teacher more once. data subset following criteria:
1) keep unique student ids , ratings
2) in cases students rated teacher twice keep 1 rating, select rating keep randomly.
3) if possible i'd able run code in munging script @ top of every analysis file , ensure dataset created exaclty same each analysis (set seed?).
# data student.id <- c(1,1,2,3,3,4,5,6,7,7,7,8,9) teacher.id <- c(1,1,1,1,1,2,2,2,2,2,2,2,2) rating <- c(100,99,89,100,99,87,24,52,100,99,89,79,12) df <- data.frame(student.id,teacher.id,rating)
thanks guidance how move forward.
assuming each student.id applied 1 teacher, use following method.
# list containing data.frames each student mylist <- split(df, df$student.id) # take sample of each data.frame if more 1 observation or single observation # bind result data.frame set.seed(1234) do.call(rbind, lapply(mylist, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))
this returns
student.id teacher.id rating 1 1 1 100 2 2 1 89 3 3 1 99 4 4 2 87 5 5 2 24 6 6 2 52 7 7 2 99 8 8 2 79 9 9 2 12
if same student.id rates multiple teachers, method requires construction of new variable interaction
function:
# create new interaction variable df$stud.teach <- interaction(df$student.id, df$teacher.id) mylist <- split(df, df$stud.teach)
then remainder of code identical above.
a potentially faster method use data.table
library , rbindlist
.
library(data.table) # convert data.table setdt(df) mylist <- split(df, df$stud.teach) # put data.frame rbindlist rbindlist(lapply(mylist, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))
Comments
Post a Comment