-->

Combining IRanges objects and maintaining mcols

2019-08-24 00:21发布

问题:

I'll start with an example, and then describe the logic I'm trying to use.

I have two normal IRanges objects that span the same total range, but may do so in a different number of ranges. Each IRanges has one mcol, but that mcol is different across IRanges.

a
#IRanges object with 1 range and 1 metadata column:
#          start       end     width | on_betalac
#      <integer> <integer> <integer> |  <logical>
#  [1]         1       167       167 |      FALSE
b
#IRanges object with 3 ranges and 1 metadata column:
#          start       end     width |  on_other
#      <integer> <integer> <integer> | <logical>
#  [1]         1       107       107 |     FALSE
#  [2]       108       112         5 |      TRUE
#  [3]       113       167        55 |     FALSE

You can see both of these IRanges span 1 to 167, but a has one range and b has three. I would like to combine them to get output like this:

my_great_function(a, b)
#IRanges object with 3 ranges and 2 metadata columns:
#          start       end     width | on_betalac  on_other
#      <integer> <integer> <integer> |  <logical> <logical>
#  [1]         1       107       107 |     FALSE     FALSE
#  [2]       108       112         5 |     FALSE      TRUE
#  [3]       113       167        55 |     FALSE     FALSE

The output is a like a disjoin of the inputs, but it keeps the mcols, and even spreads them so that the output range has the same value of the mcol as the input range that led to it.

回答1:

Option 1: Using IRanges::findOverlaps

m <- findOverlaps(b, a)
c <- b[queryHits(m)]
mcols(c) <- cbind(mcols(c), mcols(a[subjectHits(m)]))
#IRanges object with 3 ranges and 2 metadata columns:
#          start       end     width |  on_other on_betacalc
#      <integer> <integer> <integer> | <logical>   <logical>
#  [1]         1       107       107 |     FALSE       FALSE
#  [2]       108       112         5 |      TRUE       FALSE
#  [3]       113       167        55 |     FALSE       FALSE

The resulting object c is a IRanges object with two metadata columns.

Option 2: Using IRanges::mergeByOverlaps

c <- mergeByOverlaps(b, a)
c
#DataFrame with 3 rows and 4 columns
#          b  on_other         a on_betacalc
#  <IRanges> <logical> <IRanges>   <logical>
#1     1-107     FALSE     1-167       FALSE
#2   108-112      TRUE     1-167       FALSE
#3   113-167     FALSE     1-167       FALSE

The resulting output object is a DataFrame with IRanges columns and original metadata columns as additional columns.

Option 3: Using data.table::foverlaps

library(data.table)
a.dt <- as.data.table(cbind.data.frame(a, mcols(a)))[, width := NULL]
b.dt <- as.data.table(cbind.data.frame(b, mcols(b)))[, width := NULL]

setkey(b.dt, start, end)
foverlaps(a.dt, b.dt, type = "any")[, `:=`(i.start = NULL, i.end = NULL)][]
   start end on_other on_betacalc
1:     1 107    FALSE       FALSE
2:   108 112     TRUE       FALSE
3:   113 167    FALSE       FALSE

The resulting object is a data.table.

Option 4: Using fuzzyjoin::interval_left_join

library(fuzzyjoin)
a.df <- cbind.data.frame(a, mcols(a))
b.df <- cbind.data.frame(b, mcols(b))
interval_left_join(b.df, a.df, by = c("start", "end"))
#  start.x end.x width.x on_other start.y end.y width.y on_betacalc
#1       1   107     107    FALSE       1   167     167       FALSE
#2     108   112       5     TRUE       1   167     167       FALSE
#3     113   167      55    FALSE       1   167     167       FALSE

The resulting object is a data.frame.


Sample data

library(IRanges)
a <- IRanges(1, 167)
mcols(a)$on_betacalc = F

b <- IRanges(c(1, 108, 113), c(107, 112, 167))
mcols(b)$on_other <- c(F, T, F)


回答2:

Here's what I've been able to come up with. Not as elegant as MauritsEvers, but maybe useful to others in some way.

combine_exposures <- function(...) {

  cd <- c(...)
  mc <- mcols(cd)
  dj <- disjoin(x = cd, with.revmap = TRUE)
  r <- mcols(dj)$revmap

  d <- as.data.frame(matrix(nrow = length(dj), ncol = ncol(mc)))
  names(d) <- names(mc)

  for (i in 1:length(dj)) {
    d[i,] <- sapply(X = 1:ncol(mc), FUN = function(j) { mc[r[[i]][j], j] })
  }

  mcols(dj) <- d
  return(dj)
}

here is dput(c(e1, e2, e3, e4)) (e1, e2, e3, and e4 are some example IRanges that all span 1,167):

new("IRanges", start = c(1L, 1L, 108L, 113L, 1L, 1L), width = c(167L, 
107L, 5L, 55L, 167L, 167L), NAMES = NULL, elementType = "ANY", 
    elementMetadata = new("DataFrame", rownames = NULL, nrows = 6L, 
        listData = list(on_betalac = c(FALSE, NA, NA, NA, NA, 
        NA), on_other = c(NA, FALSE, TRUE, FALSE, NA, NA), on_pen = c(NA, 
        NA, NA, NA, FALSE, NA), on_quin = c(NA, NA, NA, NA, NA, 
        FALSE)), elementType = "ANY", elementMetadata = NULL, 
        metadata = list()), metadata = list())


标签: r iranges