dcast.data.table.Rd
dcast.data.table
is data.table
's long-to-wide reshaping tool. In the spirit of data.table
, it is very fast and memory efficient, making it well-suited to handling large data sets in RAM. More importantly, it is capable of handling very large data quite efficiently in terms of memory usage. dcast.data.table
can also cast multiple value.var
columns and accepts multiple functions to fun.aggregate
. See Examples for more.
# S3 method for data.table dcast(data, formula, fun.aggregate = NULL, sep = "_", ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess(data), verbose = getOption("datatable.verbose"))
data | A |
---|---|
formula | A formula of the form LHS ~ RHS to cast, see Details. |
fun.aggregate | Should the data be aggregated before casting? If the formula doesn't identify a single observation for each cell, then aggregation defaults to To use multiple aggregation functions, pass a |
sep | Character vector of length 1, indicating the separating character in variable names generated during casting. Default is |
... | Any other arguments that may be passed to the aggregating function. |
margins | Not implemented yet. Should take variable names to compute margins on. A value of |
subset | Specified if casting should be done on a subset of the data. Ex: |
fill | Value with which to fill missing cells. If |
drop |
|
value.var | Name of the column whose values will be filled to cast. Function Cast multiple |
verbose | Not used yet. May be dropped in the future or used to provide informative messages through the console. |
The cast formula takes the form LHS ~ RHS
, ex: var1 + var2 ~ var3
. The order of entries in the formula is essential. There are two special variables: .
represents no variable, while ...
represents all variables not otherwise mentioned in formula
; see Examples.
When not all combinations of LHS & RHS values are present in the data, some or all (in accordance with drop
) missing combinations will replaced with the value specified by fill
. Note that fill
will be converted to the class of value.var
; see Examples.
dcast
also allows value.var
columns of type list
.
When variable combinations in formula
don't identify a unique value, fun.aggregate
will have to be specified, which defaults to length
. For the formula var1 ~ var2
, this means there are some (var1, var2)
combinations in the data corresponding to multiple rows (i.e. x
is not unique by (var1, var2)
.
The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where value.var
is a list, the function should be able to handle a list input and provide a single value or list of length one as output.
If the formula's LHS contains the same column more than once, ex: dcast(DT, x+x~ y)
, then the answer will have duplicate names. In those cases, the duplicate names are renamed using make.unique
so that key can be set without issues.
Names for columns that are being cast are generated in the same order (separated by an underscore, _
) from the (unique) values in each column mentioned in the formula RHS.
From v1.9.4
, dcast
tries to preserve attributes wherever possible.
From v1.9.6
, it is possible to cast multiple value.var
columns and also cast by providing multiple fun.aggregate
functions. Multiple fun.aggregate
functions should be provided as a list
, for e.g., list(mean, sum, function(x) paste(x, collapse="")
. value.var
can be either a character vector or list of length one, or a list of length equal to length(fun.aggregate)
. When value.var
is a character vector or a list of length one, each function mentioned under fun.aggregate
is applied to every column specified under value.var
column. When value.var
is a list of length equal to length(fun.aggregate)
each element of fun.aggregate
is applied to each element of value.var
column.
Historical note: dcast.data.table
was originally designed as an enhancement to reshape2::dcast
in terms of computing and memory efficiency. reshape2
has since been deprecated, and dcast
has had a generic defined within data.table
since v1.9.6
in 2015, at which point the dependency between the packages became more etymological than programmatic. We thank the reshape2
authors for the inspiration.
A keyed data.table
that has been cast. The key columns are equal to the variables in the formula
LHS in the same order.
ChickWeight = as.data.table(ChickWeight) setnames(ChickWeight, tolower(names(ChickWeight))) DT <- melt(as.data.table(ChickWeight), id=2:4) # calls melt.data.table # dcast is an S3 method in data.table from v1.9.6 dcast(DT, time ~ variable, fun=mean) # using partial matching of argument#> time weight #> 1: 0 41.06000 #> 2: 2 49.22000 #> 3: 4 59.95918 #> 4: 6 74.30612 #> 5: 8 91.24490 #> 6: 10 107.83673 #> 7: 12 129.24490 #> 8: 14 143.81250 #> 9: 16 168.08511 #> 10: 18 190.19149 #> 11: 20 209.71739 #> 12: 21 218.68889dcast(DT, diet ~ variable, fun=mean)#> diet weight #> 1: 1 102.6455 #> 2: 2 122.6167 #> 3: 3 142.9500 #> 4: 4 135.2627dcast(DT, diet+chick ~ time, drop=FALSE)#> diet chick 0 2 4 6 8 10 12 14 16 18 20 21 #> 1: 1 18 39 35 NA NA NA NA NA NA NA NA NA NA #> 2: 1 16 41 45 49 51 57 51 54 NA NA NA NA NA #> 3: 1 15 41 49 56 64 68 68 67 68 NA NA NA NA #> 4: 1 13 41 48 53 60 65 67 71 70 71 81 91 96 #> 5: 1 9 42 51 59 68 85 96 90 92 93 100 100 98 #> --- #> 196: 4 49 40 53 64 85 108 128 152 166 184 203 233 237 #> 197: 4 46 40 52 62 82 101 120 144 156 173 210 231 238 #> 198: 4 50 41 54 67 84 105 122 155 175 205 234 264 264 #> 199: 4 42 42 49 63 84 103 126 160 174 204 234 269 281 #> 200: 4 48 39 50 62 80 104 125 154 170 222 261 303 322dcast(DT, diet+chick ~ time, drop=FALSE, fill=0)#> diet chick 0 2 4 6 8 10 12 14 16 18 20 21 #> 1: 1 18 39 35 0 0 0 0 0 0 0 0 0 0 #> 2: 1 16 41 45 49 51 57 51 54 0 0 0 0 0 #> 3: 1 15 41 49 56 64 68 68 67 68 0 0 0 0 #> 4: 1 13 41 48 53 60 65 67 71 70 71 81 91 96 #> 5: 1 9 42 51 59 68 85 96 90 92 93 100 100 98 #> --- #> 196: 4 49 40 53 64 85 108 128 152 166 184 203 233 237 #> 197: 4 46 40 52 62 82 101 120 144 156 173 210 231 238 #> 198: 4 50 41 54 67 84 105 122 155 175 205 234 264 264 #> 199: 4 42 42 49 63 84 103 126 160 174 204 234 269 281 #> 200: 4 48 39 50 62 80 104 125 154 170 222 261 303 322# using subset dcast(DT, chick ~ time, fun=mean, subset=.(time < 10 & chick < 20))#> chick 0 2 4 6 8 #> 1: 18 39 35 NaN NaN NaN #> 2: 16 41 45 49 51 57 #> 3: 15 41 49 56 64 68 #> 4: 13 41 48 53 60 65 #> 5: 9 42 51 59 68 85# drop argument, #1512 DT <- data.table(v1 = c(1.1, 1.1, 1.1, 2.2, 2.2, 2.2), v2 = factor(c(1L, 1L, 1L, 3L, 3L, 3L), levels=1:3), v3 = factor(c(2L, 3L, 5L, 1L, 2L, 6L), levels=1:6), v4 = c(3L, 2L, 2L, 5L, 4L, 3L)) # drop=TRUE dcast(DT, v1 + v2 ~ v3) # default is drop=TRUE#>#> v1 v2 1 2 3 5 6 #> 1: 1.1 1 NA 3 2 2 NA #> 2: 2.2 3 5 4 NA NA 3dcast(DT, v1 + v2 ~ v3, drop=FALSE) # all missing combinations of both LHS and RHS#>#> v1 v2 1 2 3 4 5 6 #> 1: 1.1 1 NA 3 2 NA 2 NA #> 2: 1.1 2 NA NA NA NA NA NA #> 3: 1.1 3 NA NA NA NA NA NA #> 4: 2.2 1 NA NA NA NA NA NA #> 5: 2.2 2 NA NA NA NA NA NA #> 6: 2.2 3 5 4 NA NA NA 3#>#> v1 v2 1 2 3 5 6 #> 1: 1.1 1 NA 3 2 2 NA #> 2: 1.1 2 NA NA NA NA NA #> 3: 1.1 3 NA NA NA NA NA #> 4: 2.2 1 NA NA NA NA NA #> 5: 2.2 2 NA NA NA NA NA #> 6: 2.2 3 5 4 NA NA 3#>#> v1 v2 1 2 3 4 5 6 #> 1: 1.1 1 NA 3 2 NA 2 NA #> 2: 2.2 3 5 4 NA NA NA 3# using . and ... DT <- data.table(v1 = rep(1:2, each = 6), v2 = rep(rep(1:3, 2), each = 2), v3 = rep(1:2, 6), v4 = rnorm(6)) dcast(DT, ... ~ v3, value.var = "v4") #same as v1 + v2 ~ v3, value.var = "v4"#> v1 v2 1 2 #> 1: 1 1 -0.2713771 -0.1364510 #> 2: 1 2 0.2413600 -0.9840585 #> 3: 1 3 -0.5286298 -1.4844445 #> 4: 2 1 -0.2713771 -0.1364510 #> 5: 2 2 0.2413600 -0.9840585 #> 6: 2 3 -0.5286298 -1.4844445dcast(DT, v1 + v2 + v3 ~ ., value.var = "v4")#> v1 v2 v3 . #> 1: 1 1 1 -0.2713771 #> 2: 1 1 2 -0.1364510 #> 3: 1 2 1 0.2413600 #> 4: 1 2 2 -0.9840585 #> 5: 1 3 1 -0.5286298 #> 6: 1 3 2 -1.4844445 #> 7: 2 1 1 -0.2713771 #> 8: 2 1 2 -0.1364510 #> 9: 2 2 1 0.2413600 #> 10: 2 2 2 -0.9840585 #> 11: 2 3 1 -0.5286298 #> 12: 2 3 2 -1.4844445## for each combination of (v1, v2), add up all values of v4 dcast(DT, v1 + v2 ~ ., value.var = "v4", fun.aggregate = sum)#> v1 v2 . #> 1: 1 1 -0.4078281 #> 2: 1 2 -0.7426985 #> 3: 1 3 -2.0130743 #> 4: 2 1 -0.4078281 #> 5: 2 2 -0.7426985 #> 6: 2 3 -2.0130743# fill and types dcast(DT, v2 ~ v3, value.var = 'v1', fill = 0L) # 0L --> 0#>#> v2 1 2 #> 1: 1 2 2 #> 2: 2 2 2 #> 3: 3 2 2dcast(DT, v2 ~ v3, value.var = 'v4', fill = 1.1) # 1.1 --> 1L#>#> v2 1 2 #> 1: 1 2 2 #> 2: 2 2 2 #> 3: 3 2 2# multiple value.var and multiple fun.aggregate DT = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE), z=sample(letters[1:2], 20,TRUE), d1 = runif(20), d2=1L) # multiple value.var dcast(DT, x + y ~ z, fun=sum, value.var=c("d1","d2"))#> x y d1_a d1_b d2_a d2_b #> 1: 1 1 1.43682427 0.9854205 2 1 #> 2: 1 2 0.70465202 0.1695072 1 1 #> 3: 2 1 0.87082789 0.0000000 1 0 #> 4: 2 2 0.00000000 0.0742473 0 1 #> 5: 3 1 0.49326962 0.2193234 2 2 #> 6: 3 2 0.08148138 0.0000000 1 0 #> 7: 4 1 1.07577703 1.1836827 2 2 #> 8: 4 2 0.67587545 0.7127355 1 1 #> 9: 5 2 0.57778167 0.5192365 1 1#> x y d1_sum_a d1_sum_b d1_mean_a d1_mean_b #> 1: 1 1 1.43682427 0.9854205 0.71841214 0.9854205 #> 2: 1 2 0.70465202 0.1695072 0.70465202 0.1695072 #> 3: 2 1 0.87082789 0.0000000 0.87082789 NaN #> 4: 2 2 0.00000000 0.0742473 NaN 0.0742473 #> 5: 3 1 0.49326962 0.2193234 0.24663481 0.1096617 #> 6: 3 2 0.08148138 0.0000000 0.08148138 NaN #> 7: 4 1 1.07577703 1.1836827 0.53788852 0.5918414 #> 8: 4 2 0.67587545 0.7127355 0.67587545 0.7127355 #> 9: 5 2 0.57778167 0.5192365 0.57778167 0.5192365# multiple fun.agg and value.var (all combinations) dcast(DT, x + y ~ z, fun=list(sum, mean), value.var=c("d1", "d2"))#> x y d1_sum_a d1_sum_b d2_sum_a d2_sum_b d1_mean_a d1_mean_b d2_mean_a #> 1: 1 1 1.43682427 0.9854205 2 1 0.71841214 0.9854205 1 #> 2: 1 2 0.70465202 0.1695072 1 1 0.70465202 0.1695072 1 #> 3: 2 1 0.87082789 0.0000000 1 0 0.87082789 NaN 1 #> 4: 2 2 0.00000000 0.0742473 0 1 NaN 0.0742473 NaN #> 5: 3 1 0.49326962 0.2193234 2 2 0.24663481 0.1096617 1 #> 6: 3 2 0.08148138 0.0000000 1 0 0.08148138 NaN 1 #> 7: 4 1 1.07577703 1.1836827 2 2 0.53788852 0.5918414 1 #> 8: 4 2 0.67587545 0.7127355 1 1 0.67587545 0.7127355 1 #> 9: 5 2 0.57778167 0.5192365 1 1 0.57778167 0.5192365 1 #> d2_mean_b #> 1: 1 #> 2: 1 #> 3: NaN #> 4: 1 #> 5: 1 #> 6: NaN #> 7: 1 #> 8: 1 #> 9: 1# multiple fun.agg and value.var (one-to-one) dcast(DT, x + y ~ z, fun=list(sum, mean), value.var=list("d1", "d2"))#> x y d1_sum_a d1_sum_b d2_mean_a d2_mean_b #> 1: 1 1 1.43682427 0.9854205 1 1 #> 2: 1 2 0.70465202 0.1695072 1 1 #> 3: 2 1 0.87082789 0.0000000 1 NaN #> 4: 2 2 0.00000000 0.0742473 NaN 1 #> 5: 3 1 0.49326962 0.2193234 1 1 #> 6: 3 2 0.08148138 0.0000000 1 NaN #> 7: 4 1 1.07577703 1.1836827 1 1 #> 8: 4 2 0.67587545 0.7127355 1 1 #> 9: 5 2 0.57778167 0.5192365 1 1