This is a general purpose complement to the specialised
manipulation functions filter()
, select()
, mutate()
,
summarise()
and arrange()
. You can use do()
to perform arbitrary computation, returning either a data frame or
arbitrary objects which will be stored in a list. This is particularly
useful when working with models: you can fit models per group with
do()
and then flexibly extract components with either another
do()
or summarise()
.
For an empty data frame, the expressions will be evaluated once, even in the presence of a grouping. This makes sure that the format of the resulting data frame is the same for both empty and non-empty input.
do(.data, ...)
.data | a tbl |
---|---|
... | Expressions to apply to each group. If named, results will be
stored in a new column. If unnamed, should return a data frame. You can
use |
do()
always returns a data frame. The first columns in the data frame
will be the labels, the others will be computed from ...
. Named
arguments become list-columns, with one element for each group; unnamed
elements must be data frames and labels will be duplicated accordingly.
Groups are preserved for a single unnamed input. This is different to
summarise()
because do()
generally does not reduce the
complexity of the data, it just expresses it in a special way. For
multiple named inputs, the output is grouped by row with
rowwise()
. This allows other verbs to work in an intuitive
way.
do()
is marked as questioning as of dplyr 0.8.0, and may be advantageously
replaced by group_map()
.
If you're familiar with plyr, do()
with named arguments is basically
equivalent to plyr::dlply()
, and do()
with a single unnamed argument
is basically equivalent to plyr::ldply()
. However, instead of storing
labels in a separate attribute, the result is always a data frame. This
means that summarise()
applied to the result of do()
can
act like ldply()
.
#> # A tibble: 6 x 11 #> # Groups: cyl [3] #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 #> 3 21 6 160 110 3.9 2.62 16.5 0 1 4 4 #> 4 21 6 160 110 3.9 2.88 17.0 0 1 4 4 #> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> 6 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4#> Source: local data frame [3 x 2] #> Groups: <by row> #> #> # A tibble: 3 x 2 #> cyl mod #> * <dbl> <list> #> 1 4 <lm> #> 2 6 <lm> #> 3 8 <lm>#> # A tibble: 3 x 1 #> rsq #> <dbl> #> 1 0.648 #> 2 0.0106 #> 3 0.270#> Source: local data frame [6 x 1] #> Groups: <by row> #> #> # A tibble: 6 x 1 #> coef #> * <dbl> #> 1 40.9 #> 2 -0.135 #> 3 19.1 #> 4 0.00361 #> 5 22.0 #> 6 -0.0196#> Source: local data frame [6 x 5] #> Groups: <by row> #> #> # A tibble: 6 x 5 #> var Estimate Std..Error t.value Pr...t.. #> * <fct> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 40.9 3.59 11.4 0.00000120 #> 2 disp -0.135 0.0332 -4.07 0.00278 #> 3 (Intercept) 19.1 2.91 6.55 0.00124 #> 4 disp 0.00361 0.0156 0.232 0.826 #> 5 (Intercept) 22.0 3.35 6.59 0.0000259 #> 6 disp -0.0196 0.00932 -2.11 0.0568models <- by_cyl %>% do( mod_linear = lm(mpg ~ disp, data = .), mod_quad = lm(mpg ~ poly(disp, 2), data = .) ) models#> Source: local data frame [3 x 3] #> Groups: <by row> #> #> # A tibble: 3 x 3 #> cyl mod_linear mod_quad #> * <dbl> <list> <list> #> 1 4 <lm> <lm> #> 2 6 <lm> <lm> #> 3 8 <lm> <lm>compare <- models %>% do(aov = anova(.$mod_linear, .$mod_quad)) # compare %>% summarise(p.value = aov$`Pr(>F)`) if (require("nycflights13")) { # You can use it to do any arbitrary computation, like fitting a linear # model. Let's explore how carrier departure delays vary over the time carriers <- group_by(flights, carrier) group_size(carriers) mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .)) mods %>% do(as.data.frame(coef(.$mod))) mods %>% summarise(rsq = summary(mod)$r.squared) if (FALSE) { # This longer example shows the progress bar in action by_dest <- flights %>% group_by(dest) %>% filter(n() > 100) library(mgcv) by_dest %>% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .)) } }#>