merge.RdFast merge of two data.tables. The data.table method behaves
very similarly to that of data.frames except that, by default, it attempts to merge
at first based on the shared key columns, and if there are none,
then based on key columns of the first argument x, and if there
are none,
then based on the common columns between the two data.tables.
Set the by, or by.x and by.y arguments explicitly to override this default.
# S3 method for data.table merge(x, y, by = NULL, by.x = NULL, by.y = NULL, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x", ".y"), no.dups = TRUE, allow.cartesian=getOption("datatable.allow.cartesian"), # default FALSE ...)
| x, y |
|
|---|---|
| by | A vector of shared column names in |
| by.x, by.y | Vectors of column names in |
| all | logical; |
| all.x | logical; if |
| all.y | logical; analogous to |
| sort | logical. If |
| suffixes | A |
| no.dups | logical indicating that |
| allow.cartesian | See |
| ... | Not used at this time. |
merge is a generic function in base R. It dispatches to either the
merge.data.frame method or merge.data.table method depending on
the class of its first argument. Note that, unlike SQL, NA is
matched against NA (and NaN against NaN) while merging.
In versions <= v1.9.4, if the specified columns in by were not the
key (or head of the key) of x or y, then a copy is
first re-keyed prior to performing the merge. This was less performant as well as memory
inefficient. The concept of secondary keys (implemented in v1.9.4) was
used to overcome this limitation from v1.9.6+. No deep copies are made
any more, thereby improving performance and memory efficiency. Also, there is better
control for providing the columns to merge on with the help of the newly implemented
by.x and by.y arguments.
For a more data.table-centric way of merging two data.tables, see
[.data.table; e.g., x[y, ...]. See FAQ 1.11 for a detailed
comparison of merge and x[y, ...].
If any column names provided to by.x also occur in names(y) but not in by.y,
then this data.table method will add the suffixes to those column names. As of
R v3.4.3, the data.frame method will not (leading to duplicate column names in the result) but a patch has
been proposed (see r-devel thread here)
which is looking likely to be accepted for a future version of R.
A new data.table based on the merged data tables, and sorted by the
columns set (or inferred for) the by argument if argument sort is
set to TRUE.
#> A X #> 1: a 1 #> 2: b 2 #> 3: c 3 #> 4: d 4 #> 5: e 5 #> 6: f 6 #> 7: g 7 #> 8: h 8 #> 9: i 9 #> 10: j 10#> A Y #> 1: e 1 #> 2: f 2 #> 3: g 3 #> 4: h 4 #> 5: i 5 #> 6: j 6 #> 7: k 7 #> 8: l 8 #> 9: m 9 #> 10: n 10merge(dt1, dt2)#> A X Y #> 1: e 5 1 #> 2: f 6 2 #> 3: g 7 3 #> 4: h 8 4 #> 5: i 9 5 #> 6: j 10 6merge(dt1, dt2, all = TRUE)#> A X Y #> 1: a 1 NA #> 2: b 2 NA #> 3: c 3 NA #> 4: d 4 NA #> 5: e 5 1 #> 6: f 6 2 #> 7: g 7 3 #> 8: h 8 4 #> 9: i 9 5 #> 10: j 10 6 #> 11: k NA 7 #> 12: l NA 8 #> 13: m NA 9 #> 14: n NA 10#> A X #> 1: a 1 #> 2: a 4 #> 3: b 2 #> 4: b 5 #> 5: c 3 #> 6: c 6#> A Y #> 1: b 6 #> 2: b 3 #> 3: c 5 #> 4: c 2 #> 5: d 4 #> 6: d 1merge(dt1, dt2, allow.cartesian=TRUE)#> A X Y #> 1: b 2 6 #> 2: b 2 3 #> 3: b 5 6 #> 4: b 5 3 #> 5: c 3 5 #> 6: c 3 2 #> 7: c 6 5 #> 8: c 6 2#> A B X #> 1: 1 a 1 #> 2: 1 a 4 #> 3: 1 b 2 #> 4: 1 b 5 #> 5: 1 c 3 #> 6: 2 c 6#> A B Y #> 1: 1 b 6 #> 2: 1 b 3 #> 3: 1 c 5 #> 4: 1 c 2 #> 5: 1 d 4 #> 6: 2 d 1merge(dt1, dt2)#> A B X Y #> 1: 1 b 2 6 #> 2: 1 b 2 3 #> 3: 1 b 5 6 #> 4: 1 b 5 3 #> 5: 1 c 3 5 #> 6: 1 c 3 2merge(dt1, dt2, by="B", allow.cartesian=TRUE)#> B A.x X A.y Y #> 1: b 1 2 1 6 #> 2: b 1 2 1 3 #> 3: b 1 5 1 6 #> 4: b 1 5 1 3 #> 5: c 1 3 1 5 #> 6: c 1 3 1 2 #> 7: c 2 6 1 5 #> 8: c 2 6 1 2# test it more: d1 <- data.table(a=rep(1:2,each=3), b=1:6, key="a,b") d2 <- data.table(a=0:1, bb=10:11, key="a") d3 <- data.table(a=0:1, key="a") d4 <- data.table(a=0:1, b=0:1, key="a,b") merge(d1, d2)#> a b bb #> 1: 1 1 11 #> 2: 1 2 11 #> 3: 1 3 11merge(d2, d1)#> a bb b #> 1: 1 11 1 #> 2: 1 11 2 #> 3: 1 11 3merge(d1, d2, all=TRUE)#> a b bb #> 1: 0 NA 10 #> 2: 1 1 11 #> 3: 1 2 11 #> 4: 1 3 11 #> 5: 2 4 NA #> 6: 2 5 NA #> 7: 2 6 NAmerge(d2, d1, all=TRUE)#> a bb b #> 1: 0 10 NA #> 2: 1 11 1 #> 3: 1 11 2 #> 4: 1 11 3 #> 5: 2 NA 4 #> 6: 2 NA 5 #> 7: 2 NA 6merge(d3, d1)#> a b #> 1: 1 1 #> 2: 1 2 #> 3: 1 3merge(d1, d3)#> a b #> 1: 1 1 #> 2: 1 2 #> 3: 1 3merge(d1, d3, all=TRUE)#> a b #> 1: 0 NA #> 2: 1 1 #> 3: 1 2 #> 4: 1 3 #> 5: 2 4 #> 6: 2 5 #> 7: 2 6merge(d3, d1, all=TRUE)#> a b #> 1: 0 NA #> 2: 1 1 #> 3: 1 2 #> 4: 1 3 #> 5: 2 4 #> 6: 2 5 #> 7: 2 6merge(d1, d4)#> a b #> 1: 1 1#> a b.d1 b.d4 #> 1: 1 1 1 #> 2: 1 2 1 #> 3: 1 3 1merge(d4, d1)#> a b #> 1: 1 1merge(d1, d4, all=TRUE)#> a b #> 1: 0 0 #> 2: 1 1 #> 3: 1 2 #> 4: 1 3 #> 5: 2 4 #> 6: 2 5 #> 7: 2 6merge(d4, d1, all=TRUE)#> a b #> 1: 0 0 #> 2: 1 1 #> 3: 1 2 #> 4: 1 3 #> 5: 2 4 #> 6: 2 5 #> 7: 2 6# new feature, no need to set keys anymore set.seed(1L) d1 <- data.table(a=sample(rep(1:3,each=2)), z=1:6) d2 <- data.table(a=2:0, z=10:12) merge(d1, d2, by="a")#> a z.x z.y #> 1: 1 1 11 #> 2: 1 5 11 #> 3: 2 2 10 #> 4: 2 3 10merge(d1, d2, by="a", all=TRUE)#> a z.x z.y #> 1: 0 NA 12 #> 2: 1 1 11 #> 3: 1 5 11 #> 4: 2 2 10 #> 5: 2 3 10 #> 6: 3 4 NA #> 7: 3 6 NA# new feature, using by.x and by.y arguments setnames(d2, "a", "b") merge(d1, d2, by.x="a", by.y="b")#> a z.x z.y #> 1: 1 1 11 #> 2: 1 5 11 #> 3: 2 2 10 #> 4: 2 3 10merge(d1, d2, by.x="a", by.y="b", all=TRUE)#> a z.x z.y #> 1: 0 NA 12 #> 2: 1 1 11 #> 3: 1 5 11 #> 4: 2 2 10 #> 5: 2 3 10 #> 6: 3 4 NA #> 7: 3 6 NAmerge(d2, d1, by.x="b", by.y="a")#> b z.x z.y #> 1: 1 11 1 #> 2: 1 11 5 #> 3: 2 10 2 #> 4: 2 10 3