Merge two data.tables

Fast merge of two data.tables. The data.table method behaves very similarly to that of data.frames except that, by default, it attempts to merge

at first based on the shared key columns, and if there are none,
then based on key columns of the first argument x, and if there are none,
then based on the common columns between the two data.tables.

Set the by, or by.x and by.y arguments explicitly to override this default.

# S3 method for data.table
merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE,
all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE,
allow.cartesian=getOption("datatable.allow.cartesian"),  # default FALSE
...)

Arguments

x, y	`data table`s. `y` is coerced to a `data.table` if it isn't one already.
by	A vector of shared column names in `x` and `y` to merge on. This defaults to the shared key columns between the two tables. If `y` has no key columns, this defaults to the key of `x`.
by.x, by.y	Vectors of column names in `x` and `y` to merge on.
all	logical; `all = TRUE` is shorthand to save setting both `all.x = TRUE` and `all.y = TRUE`.
all.x	logical; if `TRUE`, then extra rows will be added to the output, one for each row in `x` that has no matching row in `y`. These rows will have 'NA's in those columns that are usually filled with values from `y`. The default is `FALSE`, so that only rows with data from both `x` and `y` are included in the output.
all.y	logical; analogous to `all.x` above.
sort	logical. If `TRUE` (default), the merged `data.table` is sorted by setting the key to the `by / by.x` columns. If `FALSE`, the result is not sorted.
suffixes	A `character(2)` specifying the suffixes to be used for making non-`by` column names unique. The suffix behaviour works in a similar fashion as the `merge.data.frame` method does.
no.dups	logical indicating that `suffixes` are also appended to non-`by.y` column names in `y` when they have the same column name as any `by.x`.
allow.cartesian	See `allow.cartesian` in `[.data.table`.
...	Not used at this time.

Details

merge is a generic function in base R. It dispatches to either the merge.data.frame method or merge.data.table method depending on the class of its first argument. Note that, unlike SQL, NA is matched against NA (and NaN against NaN) while merging.

In versions <= v1.9.4, if the specified columns in by were not the key (or head of the key) of x or y, then a copy is first re-keyed prior to performing the merge. This was less performant as well as memory inefficient. The concept of secondary keys (implemented in v1.9.4) was used to overcome this limitation from v1.9.6+. No deep copies are made any more, thereby improving performance and memory efficiency. Also, there is better control for providing the columns to merge on with the help of the newly implemented by.x and by.y arguments.

For a more data.table-centric way of merging two data.tables, see [.data.table; e.g., x[y, ...]. See FAQ 1.11 for a detailed comparison of merge and x[y, ...].

If any column names provided to by.x also occur in names(y) but not in by.y, then this data.table method will add the suffixes to those column names. As of R v3.4.3, the data.frame method will not (leading to duplicate column names in the result) but a patch has been proposed (see r-devel thread here) which is looking likely to be accepted for a future version of R.

Value

A new data.table based on the merged data tables, and sorted by the columns set (or inferred for) the by argument if argument sort is set to TRUE.

Examples

(dt1 <- data.table(A = letters[1:10], X = 1:10, key = "A"))
#>     A  X
#>  1: a  1
#>  2: b  2
#>  3: c  3
#>  4: d  4
#>  5: e  5
#>  6: f  6
#>  7: g  7
#>  8: h  8
#>  9: i  9
#> 10: j 10
(dt2 <- data.table(A = letters[5:14], Y = 1:10, key = "A"))
#>     A  Y
#>  1: e  1
#>  2: f  2
#>  3: g  3
#>  4: h  4
#>  5: i  5
#>  6: j  6
#>  7: k  7
#>  8: l  8
#>  9: m  9
#> 10: n 10
merge(dt1, dt2)
#>    A  X Y
#> 1: e  5 1
#> 2: f  6 2
#> 3: g  7 3
#> 4: h  8 4
#> 5: i  9 5
#> 6: j 10 6
merge(dt1, dt2, all = TRUE)
#>     A  X  Y
#>  1: a  1 NA
#>  2: b  2 NA
#>  3: c  3 NA
#>  4: d  4 NA
#>  5: e  5  1
#>  6: f  6  2
#>  7: g  7  3
#>  8: h  8  4
#>  9: i  9  5
#> 10: j 10  6
#> 11: k NA  7
#> 12: l NA  8
#> 13: m NA  9
#> 14: n NA 10

(dt1 <- data.table(A = letters[rep(1:3, 2)], X = 1:6, key = "A"))
#>    A X
#> 1: a 1
#> 2: a 4
#> 3: b 2
#> 4: b 5
#> 5: c 3
#> 6: c 6
(dt2 <- data.table(A = letters[rep(2:4, 2)], Y = 6:1, key = "A"))
#>    A Y
#> 1: b 6
#> 2: b 3
#> 3: c 5
#> 4: c 2
#> 5: d 4
#> 6: d 1
merge(dt1, dt2, allow.cartesian=TRUE)
#>    A X Y
#> 1: b 2 6
#> 2: b 2 3
#> 3: b 5 6
#> 4: b 5 3
#> 5: c 3 5
#> 6: c 3 2
#> 7: c 6 5
#> 8: c 6 2

(dt1 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(1:3, 2)], X = 1:6, key = "A,B"))
#>    A B X
#> 1: 1 a 1
#> 2: 1 a 4
#> 3: 1 b 2
#> 4: 1 b 5
#> 5: 1 c 3
#> 6: 2 c 6
(dt2 <- data.table(A = c(rep(1L, 5), 2L), B = letters[rep(2:4, 2)], Y = 6:1, key = "A,B"))
#>    A B Y
#> 1: 1 b 6
#> 2: 1 b 3
#> 3: 1 c 5
#> 4: 1 c 2
#> 5: 1 d 4
#> 6: 2 d 1
merge(dt1, dt2)
#>    A B X Y
#> 1: 1 b 2 6
#> 2: 1 b 2 3
#> 3: 1 b 5 6
#> 4: 1 b 5 3
#> 5: 1 c 3 5
#> 6: 1 c 3 2
merge(dt1, dt2, by="B", allow.cartesian=TRUE)
#>    B A.x X A.y Y
#> 1: b   1 2   1 6
#> 2: b   1 2   1 3
#> 3: b   1 5   1 6
#> 4: b   1 5   1 3
#> 5: c   1 3   1 5
#> 6: c   1 3   1 2
#> 7: c   2 6   1 5
#> 8: c   2 6   1 2

# test it more:
d1 <- data.table(a=rep(1:2,each=3), b=1:6, key="a,b")
d2 <- data.table(a=0:1, bb=10:11, key="a")
d3 <- data.table(a=0:1, key="a")
d4 <- data.table(a=0:1, b=0:1, key="a,b")

merge(d1, d2)
#>    a b bb
#> 1: 1 1 11
#> 2: 1 2 11
#> 3: 1 3 11
merge(d2, d1)
#>    a bb b
#> 1: 1 11 1
#> 2: 1 11 2
#> 3: 1 11 3
merge(d1, d2, all=TRUE)
#>    a  b bb
#> 1: 0 NA 10
#> 2: 1  1 11
#> 3: 1  2 11
#> 4: 1  3 11
#> 5: 2  4 NA
#> 6: 2  5 NA
#> 7: 2  6 NA
merge(d2, d1, all=TRUE)
#>    a bb  b
#> 1: 0 10 NA
#> 2: 1 11  1
#> 3: 1 11  2
#> 4: 1 11  3
#> 5: 2 NA  4
#> 6: 2 NA  5
#> 7: 2 NA  6

merge(d3, d1)
#>    a b
#> 1: 1 1
#> 2: 1 2
#> 3: 1 3
merge(d1, d3)
#>    a b
#> 1: 1 1
#> 2: 1 2
#> 3: 1 3
merge(d1, d3, all=TRUE)
#>    a  b
#> 1: 0 NA
#> 2: 1  1
#> 3: 1  2
#> 4: 1  3
#> 5: 2  4
#> 6: 2  5
#> 7: 2  6
merge(d3, d1, all=TRUE)
#>    a  b
#> 1: 0 NA
#> 2: 1  1
#> 3: 1  2
#> 4: 1  3
#> 5: 2  4
#> 6: 2  5
#> 7: 2  6

merge(d1, d4)
#>    a b
#> 1: 1 1
merge(d1, d4, by="a", suffixes=c(".d1", ".d4"))
#>    a b.d1 b.d4
#> 1: 1    1    1
#> 2: 1    2    1
#> 3: 1    3    1
merge(d4, d1)
#>    a b
#> 1: 1 1
merge(d1, d4, all=TRUE)
#>    a b
#> 1: 0 0
#> 2: 1 1
#> 3: 1 2
#> 4: 1 3
#> 5: 2 4
#> 6: 2 5
#> 7: 2 6
merge(d4, d1, all=TRUE)
#>    a b
#> 1: 0 0
#> 2: 1 1
#> 3: 1 2
#> 4: 1 3
#> 5: 2 4
#> 6: 2 5
#> 7: 2 6

# new feature, no need to set keys anymore
set.seed(1L)
d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6)
d2 <- data.table(a=2:0, z=10:12)
merge(d1, d2, by="a")
#>    a z.x z.y
#> 1: 1   1  11
#> 2: 1   5  11
#> 3: 2   2  10
#> 4: 2   3  10
merge(d1, d2, by="a", all=TRUE)
#>    a z.x z.y
#> 1: 0  NA  12
#> 2: 1   1  11
#> 3: 1   5  11
#> 4: 2   2  10
#> 5: 2   3  10
#> 6: 3   4  NA
#> 7: 3   6  NA

# new feature, using by.x and by.y arguments
setnames(d2, "a", "b")
merge(d1, d2, by.x="a", by.y="b")
#>    a z.x z.y
#> 1: 1   1  11
#> 2: 1   5  11
#> 3: 2   2  10
#> 4: 2   3  10
merge(d1, d2, by.x="a", by.y="b", all=TRUE)
#>    a z.x z.y
#> 1: 0  NA  12
#> 2: 1   1  11
#> 3: 1   5  11
#> 4: 2   2  10
#> 5: 2   3  10
#> 6: 3   4  NA
#> 7: 3   6  NA
merge(d2, d1, by.x="b", by.y="a")
#>    b z.x z.y
#> 1: 1  11   1
#> 2: 1  11   5
#> 3: 2  10   2
#> 4: 2  10   3

Arguments

Details

Value

See also

Examples

Contents