r - Count occurrences around a genomic region given in a data frame -

i need count mutations in genome occur @ spots or rather ranges. mutations have genomic position (chromosome , basepair, e.g. chr1, 10658324). range or spot, respectively, defined 10000 basepairs up- , downstream (+-) of given position in genome. both, positions of mutations , position of "spots" stored in data frames.

example:

set.seed(1)  chr <- 1 pos <- as.integer(runif(5000 , 0, 1e8)) mutations <- data.frame(pos, chr)  chr <- 1 pos <- as.integer(runif(50 , 0, 1e8)) spots <- data.frame(pos, chr)

so question asking is: how many mutations present +-10k basepairs around positions given in "spots". (e.g. if spot 100k, range 90k-110k) real data of course contain 24 chromosomes, sake of simplicity can focus on 1 chromosome now. final data should contain "spot" , number of mutations in it's vicinity, ideally in data frame or matrix.

many in advance suggestions or help!

here's first attempt, pretty shure there way more elegant way of doing it.

w <- 10000  #setting range 10k basepairs loop <- spots$pos  #creating vector of positions loop through out <- data.frame(0,0) colnames(out) <- c("pos", "count")  (l in loop) {   temp <- nrow(filter(mutations, pos>=l-w, pos<=l+w))   temp2 <- cbind(l,temp)   colnames(temp2) <- c("pos", "count")   out <- rbind(out, temp2) }  out <- out[-1,]

using data.table foverlaps, aggregate:

library(data.table) #set flank myflank <- 100000  #convert ranges flank spotsrange <- data.table(   chr = spots$chr,   start = spots$pos - myflank,   end = spots$pos + myflank,   posspot = spots$pos,   key = c("chr", "start", "end"))  #convert ranges start end same pos mutationsrange <- data.table(   chr = mutations$chr,   start = mutations$pos,   end = mutations$pos,   key = c("chr", "start", "end"))  #merge overlap res <- foverlaps(mutationsrange, spotsrange, nomatch = 0)  #count mutations rescnt <- data.frame(table(res$posspot)) colnames(rescnt) <- c("pos", "mutationcount") merge(spots, rescnt, = "pos") #         pos chr mutationcount # 1   3439618   1            10 # 2   3549952   1            15 # 3   4375314   1            11 # 4   7337370   1            13 # ...

Trigger

Search This Blog

r - Count occurrences around a genomic region given in a data frame -

Comments

Post a Comment