i'm new r , needed find index of elements in list particular values. tried speed operation using foreach , compared 2 ways of doing this. in code below 'a' list , 'b' values indices want retrieve within 'a':
library("iterators") library("foreach") library("doparallel") <- 1:200000 b <- sample(1:200000,100000,replace=true) registerdoparallel() getdoparworkers(); #to see number of cores system.time(unlist(lapply(b,function(x) which(a==x)))) system.time(foreach(i<-iter(b),.combine='c') %dopar% { which(a==b) })
output:
loading required package: parallel [1] 32 user system elapsed 124.648 7.460 132.114 user system elapsed 402.076 59.164 55.260
i'm wondering: 1) naively, why operation slow? haven't checked i'd think scripting language same thing faster. 2) shouldn't operation scale in parallel, still seems take longer expected since have 32 cores available. 3) in reality iterating on rows of matrix i.e. foreach(i<-iter(b,by='row'),.combine='c') %dopar% { #stuff }. understanding approach best since not send entire matrix each core. there way confirm checking data each core receiving?
you 1e5 * 2e5 comparisons. it's not surprising takes time.
each individual which(a==x)
not slow , have lot of parallelization overhead if send each iteration separately workers. better send bunches of iterations. on non-windows system can mclapply
:
a <- 1:20000 b <- sample(1:20000,10000,replace=true) library(parallel) system.time(res1 <- unlist(lapply(b,function(x) which(a==x)))) # user system elapsed #0.597 0.178 0.789 system.time(res2 <- unlist(mclapply(b,function(x) which(a==x), mc.preschedule = true, mc.cores = 3))) # user system elapsed #0.004 0.022 0.325 all.equal(res1, res2) #[1] true
regarding 3): there might better ways parallelization (best matrix algebra), depends on #stuff
.
Comments
Post a Comment