r - Select random rows from duplicate IDS -


i'm dealing dataset have students ratings of teachers. students rated same teacher more once. data subset following criteria:

1) keep unique student ids , ratings

2) in cases students rated teacher twice keep 1 rating, select rating keep randomly.

3) if possible i'd able run code in munging script @ top of every analysis file , ensure dataset created exaclty same each analysis (set seed?).

# data student.id <- c(1,1,2,3,3,4,5,6,7,7,7,8,9) teacher.id <- c(1,1,1,1,1,2,2,2,2,2,2,2,2) rating <- c(100,99,89,100,99,87,24,52,100,99,89,79,12) df <- data.frame(student.id,teacher.id,rating) 

thanks guidance how move forward.

assuming each student.id applied 1 teacher, use following method.

# list containing data.frames each student mylist <- split(df, df$student.id)  # take sample of each data.frame if more 1 observation or single observation # bind result data.frame set.seed(1234) do.call(rbind, lapply(mylist, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x)) 

this returns

  student.id teacher.id rating 1          1          1    100 2          2          1     89 3          3          1     99 4          4          2     87 5          5          2     24 6          6          2     52 7          7          2     99 8          8          2     79 9          9          2     12 

if same student.id rates multiple teachers, method requires construction of new variable interaction function:

# create new interaction variable df$stud.teach <- interaction(df$student.id, df$teacher.id)  mylist <- split(df, df$stud.teach) 

then remainder of code identical above.


a potentially faster method use data.table library , rbindlist.

library(data.table) # convert data.table setdt(df)  mylist <- split(df, df$stud.teach)  # put data.frame rbindlist rbindlist(lapply(mylist, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x)) 

Comments