split.RdSplit method for data.table. Faster and more flexible. Be aware that processing list of data.tables will be generally much slower than manipulation in single data.table by group using by argument, read more on data.table.
# S3 method for data.table split(x, f, drop = FALSE, by, sorted = FALSE, keep.by = TRUE, flatten = TRUE, ..., verbose = getOption("datatable.verbose"))
| x | data.table |
|---|---|
| f | factor or list of factors. Same as |
| drop | logical. Default |
| by | character vector. Column names on which split should be made. For |
| sorted | When default |
| keep.by | logical default |
| flatten | logical default |
| ... | passed to data.frame way of processing when using |
| verbose | logical default |
Argument f is just for consistency in usage to data.frame method. Recommended is to use by argument instead, it will be faster, more flexible, and by default will preserve order according to order in data.
List of data.tables. If using flatten FALSE and length(by) > 1L then recursively nested lists having data.tables as leafs of grouping according to by argument.
set.seed(123) DT = data.table(x1 = rep(letters[1:2], 6), x2 = rep(letters[3:5], 4), x3 = rep(letters[5:8], 3), y = rnorm(12)) DT = DT[sample(.N)] DF = as.data.frame(DT) # split consistency with data.frame: `x, f, drop` all.equal( split(DT, list(DT$x1, DT$x2)), lapply(split(DF, list(DF$x1, DF$x2)), setDT) )#> [1] TRUE#> $a.e #> x1 x2 x3 y #> 1: a e g 1.5587083 #> 2: a e e -0.6868529 #> #> $b.d #> x1 x2 x3 y #> 1: b d h -1.2650612 #> 2: b d f -0.2301775 #> #> $b.c #> x1 x2 x3 y #> 1: b c f -0.44566197 #> 2: b c h 0.07050839 #> #> $a.c #> x1 x2 x3 y #> 1: a c g 0.4609162 #> 2: a c e -0.5604756 #> #> $b.e #> x1 x2 x3 y #> 1: b e f 1.7150650 #> 2: b e h 0.3598138 #> #> $a.d #> x1 x2 x3 y #> 1: a d g 1.2240818 #> 2: a d e 0.1292877 #>#> $a #> $a$e #> x1 x2 x3 y #> 1: a e g 1.5587083 #> 2: a e e -0.6868529 #> #> $a$c #> x1 x2 x3 y #> 1: a c g 0.4609162 #> 2: a c e -0.5604756 #> #> $a$d #> x1 x2 x3 y #> 1: a d g 1.2240818 #> 2: a d e 0.1292877 #> #> #> $b #> $b$d #> x1 x2 x3 y #> 1: b d h -1.2650612 #> 2: b d f -0.2301775 #> #> $b$c #> x1 x2 x3 y #> 1: b c f -0.44566197 #> 2: b c h 0.07050839 #> #> $b$e #> x1 x2 x3 y #> 1: b e f 1.7150650 #> 2: b e h 0.3598138 #> #># dealing with factors fdt = DT[, c(lapply(.SD, as.factor), list(y=y)), .SDcols=x1:x3] fdf = as.data.frame(fdt) sdf = split(fdf, list(fdf$x1, fdf$x2)) all.equal( split(fdt, by=c("x1", "x2"), sorted=TRUE), lapply(sdf[sort(names(sdf))], setDT) )#> [1] TRUE# factors having unused levels, drop FALSE, TRUE fdt = DT[, .(x1 = as.factor(c(as.character(x1), "c"))[-13L], x2 = as.factor(c("a", as.character(x2)))[-1L], x3 = as.factor(c("a", as.character(x3), "z"))[c(-1L,-14L)], y = y)] fdf = as.data.frame(fdt) sdf = split(fdf, list(fdf$x1, fdf$x2)) all.equal( split(fdt, by=c("x1", "x2"), sorted=TRUE), lapply(sdf[sort(names(sdf))], setDT) )#> [1] TRUEsdf = split(fdf, list(fdf$x1, fdf$x2), drop=TRUE) all.equal( split(fdt, by=c("x1", "x2"), sorted=TRUE, drop=TRUE), lapply(sdf[sort(names(sdf))], setDT) )#> [1] TRUE