foverlaps.RdA fast binary-search based overlap join of two data.tables.
This is very much inspired by findOverlaps function from the Bioconductor
package IRanges (see link below under See Also).
Usually, x is a very large data.table with small interval ranges, and
y is much smaller keyed data.table with relatively larger
interval spans. For a usage in genomics, see the examples section.
NOTE: This is still under development, meaning it is stable, but some features are yet to be implemented. Also, some arguments and/or the function name itself could be changed.
foverlaps(x, y, by.x = if (!is.null(key(x))) key(x) else key(y), by.y = key(y), maxgap = 0L, minoverlap = 1L, type = c("any", "within", "start", "end", "equal"), mult = c("all", "first", "last"), nomatch = getOption("datatable.nomatch", NA), which = FALSE, verbose = getOption("datatable.verbose"))
| x, y |
|
|---|---|
| by.x, by.y | A vector of column names (or numbers) to compute the overlap
joins. The last two columns in both |
| maxgap | It should be a non-negative integer value, >= 0. Default is 0 (no
gap). For intervals |
| minoverlap | It should be a positive integer value, > 0. Default is 1. For
intervals |
| type | Default value is The types shown here are identical in functionality to the function
NB: |
| mult | When multiple rows in |
| nomatch | When a row (with interval say, |
| which | When |
| verbose |
|
Very briefly, foverlaps() collapses the two-column interval in y
to one-column of unique values to generate a lookup table, and
then performs the join depending on the type of overlap, using the
already available binary search feature of data.table. The time
(and space) required to generate the lookup is therefore proportional
to the number of unique values present in the interval columns of y
when combined together.
Overlap joins takes advantage of the fact that y is sorted to speed-up
finding overlaps. Therefore y has to be keyed (see ?setkey)
prior to running foverlaps(). A key on x is not necessary,
although it might speed things further. The columns in by.x
argument should correspond to the columns specified in by.y. The last
two columns should be the interval columns in both by.x and
by.y. The first interval column in by.x should always be <= the
second interval column in by.x, and likewise for by.y. The
storage.mode of the interval columns must be either double
or integer. It therefore works with bit64::integer64 type as well.
The lookup generation step could be quite time consuming if the number
of unique values in y are too large (ex: in the order of tens of millions).
There might be improvements possible by constructing lookup using RLE, which is
a pending feature request. However most scenarios will not have too many unique
values for y.
A new data.table by joining over the interval columns (along with other
additional identifier columns) specified in by.x and by.y.
NB: When which=TRUE: a) mult="first" or "last" returns a
vector of matching row numbers in y, and b) when
mult="all" returns a data.table with two columns with the first
containing row numbers of x and the second column with corresponding
row numbers of y.
nomatch=NA or 0 also influences whether non-matching rows are returned
or not, as explained above.
require(data.table) ## simple example: x = data.table(start=c(5,31,22,16), end=c(8,50,25,18), val2 = 7:10) y = data.table(start=c(10, 20, 30), end=c(15, 35, 45), val1 = 1:3) setkey(y, start, end) foverlaps(x, y, type="any", which=TRUE) ## return overlap indices#> xid yid #> 1: 1 NA #> 2: 2 2 #> 3: 2 3 #> 4: 3 2 #> 5: 4 NAfoverlaps(x, y, type="any") ## return overlap join#> start end val1 i.start i.end val2 #> 1: NA NA NA 5 8 7 #> 2: 20 35 2 31 50 8 #> 3: 30 45 3 31 50 8 #> 4: 20 35 2 22 25 9 #> 5: NA NA NA 16 18 10foverlaps(x, y, type="any", mult="first") ## returns only first match#> start end val1 i.start i.end val2 #> 1: NA NA NA 5 8 7 #> 2: 20 35 2 31 50 8 #> 3: 20 35 2 22 25 9 #> 4: NA NA NA 16 18 10foverlaps(x, y, type="within") ## matches iff 'x' is within 'y'#> start end val1 i.start i.end val2 #> 1: NA NA NA 5 8 7 #> 2: NA NA NA 31 50 8 #> 3: 20 35 2 22 25 9 #> 4: NA NA NA 16 18 10## with extra identifiers (ex: in genomics) x = data.table(chr=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60)) y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), end=c(4, 18, 55), geneid=letters[1:3]) setkey(y, chr, start, end) foverlaps(x, y, type="any", which=TRUE)#> xid yid #> 1: 1 NA #> 2: 2 2 #> 3: 3 3 #> 4: 4 3 #> 5: 5 3foverlaps(x, y, type="any")#> chr start end geneid i.start i.end #> 1: Chr1 NA NA <NA> 5 11 #> 2: Chr1 15 18 b 10 20 #> 3: Chr2 1 55 c 1 4 #> 4: Chr2 1 55 c 25 52 #> 5: Chr2 1 55 c 50 60foverlaps(x, y, type="any", nomatch=NULL)#> chr start end geneid i.start i.end #> 1: Chr1 15 18 b 10 20 #> 2: Chr2 1 55 c 1 4 #> 3: Chr2 1 55 c 25 52 #> 4: Chr2 1 55 c 50 60foverlaps(x, y, type="within", which=TRUE)#> xid yid #> 1: 1 NA #> 2: 2 NA #> 3: 3 3 #> 4: 4 3 #> 5: 5 NAfoverlaps(x, y, type="within")#> chr start end geneid i.start i.end #> 1: Chr1 NA NA <NA> 5 11 #> 2: Chr1 NA NA <NA> 10 20 #> 3: Chr2 1 55 c 1 4 #> 4: Chr2 1 55 c 25 52 #> 5: Chr2 NA NA <NA> 50 60foverlaps(x, y, type="start")#> chr start end geneid i.start i.end #> 1: Chr1 NA NA <NA> 5 11 #> 2: Chr1 NA NA <NA> 10 20 #> 3: Chr2 1 55 c 1 4 #> 4: Chr2 NA NA <NA> 25 52 #> 5: Chr2 NA NA <NA> 50 60## x and y have different column names - specify by.x x = data.table(seq=c("Chr1", "Chr1", "Chr2", "Chr2", "Chr2"), start=c(5,10, 1, 25, 50), end=c(11,20,4,52,60)) y = data.table(chr=c("Chr1", "Chr1", "Chr2"), start=c(1, 15,1), end=c(4, 18, 55), geneid=letters[1:3]) setkey(y, chr, start, end) foverlaps(x, y, by.x=c("seq", "start", "end"), type="any", which=TRUE)#> xid yid #> 1: 1 NA #> 2: 2 2 #> 3: 3 3 #> 4: 4 3 #> 5: 5 3