duplicated.Rdduplicated returns a logical vector indicating which rows of a
data.table are duplicates of a row with smaller subscripts.
unique returns a data.table with duplicated rows removed, by
columns specified in by argument. When no by then duplicated
rows by all columns are removed.
anyDuplicated returns the index i of the first duplicated
entry if there is one, and 0 otherwise.
uniqueN is equivalent to length(unique(x)) when x is an
atomic vector, and nrow(unique(x)) when x is a data.frame
or data.table. The number of unique rows are computed directly without
materialising the intermediate unique data.table and is therefore faster and
memory efficient.
# S3 method for data.table duplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) # S3 method for data.table unique(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) # S3 method for data.table anyDuplicated(x, incomparables=FALSE, fromLast=FALSE, by=seq_along(x), ...) uniqueN(x, by=if (is.list(x)) seq_along(x) else NULL, na.rm=FALSE)
| x | A data.table. |
|---|---|
| ... | Not used at this time. |
| incomparables | Not used. Here for S3 method consistency. |
| fromLast | logical indicating if duplication should be considered from
the reverse side, i.e., the last (or rightmost) of identical elements would
correspond to |
| by |
|
| na.rm | Logical (default is |
Because data.tables are usually sorted by key, tests for duplication are
especially quick when only the keyed columns are considered. Unlike
unique.data.frame, paste is not used to ensure
equality of floating point data. It is instead accomplished directly and is
therefore quite fast. data.table provides setNumericRounding to
handle cases where limitations in floating point representation is undesirable.
v1.9.4 introduces anyDuplicated method for data.tables and is
similar to base in functionality. It also implements the logical argument
fromLast for all three functions, with default value FALSE.
duplicated returns a logical vector of length nrow(x)
indicating which rows are duplicates.
unique returns a data table with duplicated rows removed.
anyDuplicated returns a integer value with the index of first duplicate.
If none exists, 0L is returned.
uniqueN returns the number of unique elements in the vector,
data.frame or data.table.
setNumericRounding, data.table,
duplicated, unique, all.equal,
fsetdiff, funion, fintersect,
fsetequal
DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B") duplicated(DT)#> [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUEunique(DT)#> A B C #> 1: 1 1 1 #> 2: 1 1 2 #> 3: 1 2 2 #> 4: 2 2 1 #> 5: 2 2 2 #> 6: 2 3 1 #> 7: 2 3 2 #> 8: 3 3 1 #> 9: 3 4 2 #> 10: 3 4 1duplicated(DT, by="B")#> [1] FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUEunique(DT, by="B")#> A B C #> 1: 1 1 1 #> 2: 1 2 2 #> 3: 2 3 1 #> 4: 3 4 2#> [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE#> A B C #> 1: 1 1 1 #> 2: 1 1 2 #> 3: 2 2 1 #> 4: 2 2 2 #> 5: 3 3 1 #> 6: 3 4 2DT = data.table(a=c(2L,1L,2L), b=c(1L,2L,1L)) # no key unique(DT) # rows 1 and 2 (row 3 is a duplicate of row 1)#> a b #> 1: 2 1 #> 2: 1 2#> a b #> 1: 3.142 1 #> 2: 4.200 1 #> 3: 1.223 1DT = data.table(a=tan(pi*(1/4 + 1:10)), b=rep(1,10)) # example from ?all.equal length(unique(DT$a)) # 10 strictly unique floating point values#> [1] 10#> [1] TRUE#> [1] 10#> [1] FALSE#> [1] FALSE# fromLast=TRUE DT <- data.table(A = rep(1:3, each=4), B = rep(1:4, each=3), C = rep(1:2, 6), key = "A,B") duplicated(DT, by="B", fromLast=TRUE)#> [1] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSEunique(DT, by="B", fromLast=TRUE)#> A B C #> 1: 1 1 1 #> 2: 2 2 2 #> 3: 3 3 1 #> 4: 3 4 2#> [1] 2#> [1] TRUE#> [1] 6# uniqueN, unique rows on all columns uniqueN(DT)#> [1] 10# uniqueN while grouped by "A" DT[, .(uN=uniqueN(.SD)), by=A]#> A uN #> 1: 1 3 #> 2: 2 4 #> 3: 3 3# uniqueN's na.rm=TRUE x = sample(c(NA, NaN, runif(3)), 10, TRUE) uniqueN(x, na.rm = FALSE) # 5, default#> [1] 4uniqueN(x, na.rm=TRUE) # 3#> [1] 3