Do anything — do • dplyr

This is a general purpose complement to the specialised manipulation functions filter(), select(), mutate(), summarise() and arrange(). You can use do() to perform arbitrary computation, returning either a data frame or arbitrary objects which will be stored in a list. This is particularly useful when working with models: you can fit models per group with do() and then flexibly extract components with either another do() or summarise().

For an empty data frame, the expressions will be evaluated once, even in the presence of a grouping. This makes sure that the format of the resulting data frame is the same for both empty and non-empty input.

do(.data, ...)

Arguments

.data	a tbl
...	Expressions to apply to each group. If named, results will be stored in a new column. If unnamed, should return a data frame. You can use `.` to refer to the current group. You can not mix named and unnamed arguments.

Value

do() always returns a data frame. The first columns in the data frame will be the labels, the others will be computed from .... Named arguments become list-columns, with one element for each group; unnamed elements must be data frames and labels will be duplicated accordingly.

Groups are preserved for a single unnamed input. This is different to summarise() because do() generally does not reduce the complexity of the data, it just expresses it in a special way. For multiple named inputs, the output is grouped by row with rowwise(). This allows other verbs to work in an intuitive way.

Details

Alternative

do() is marked as questioning as of dplyr 0.8.0, and may be advantageously replaced by group_map().

Connection to plyr

If you're familiar with plyr, do() with named arguments is basically equivalent to plyr::dlply(), and do() with a single unnamed argument is basically equivalent to plyr::ldply(). However, instead of storing labels in a separate attribute, the result is always a data frame. This means that summarise() applied to the result of do() can act like ldply().

Examples

by_cyl <- group_by(mtcars, cyl)
do(by_cyl, head(., 2))
#> # A tibble: 6 x 11
#> # Groups:   cyl [3]
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#> 2  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 3  21       6  160    110  3.9   2.62  16.5     0     1     4     4
#> 4  21       6  160    110  3.9   2.88  17.0     0     1     4     4
#> 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
#> 6  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4

models <- by_cyl %>% do(mod = lm(mpg ~ disp, data = .))
models
#> Source: local data frame [3 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 2
#>     cyl mod   
#> * <dbl> <list>
#> 1     4 <lm>  
#> 2     6 <lm>  
#> 3     8 <lm>  

summarise(models, rsq = summary(mod)$r.squared)
#> # A tibble: 3 x 1
#>      rsq
#>    <dbl>
#> 1 0.648 
#> 2 0.0106
#> 3 0.270 
models %>% do(data.frame(coef = coef(.$mod)))
#> Source: local data frame [6 x 1]
#> Groups: <by row>
#> 
#> # A tibble: 6 x 1
#>       coef
#> *    <dbl>
#> 1 40.9    
#> 2 -0.135  
#> 3 19.1    
#> 4  0.00361
#> 5 22.0    
#> 6 -0.0196 
models %>% do(data.frame(
  var = names(coef(.$mod)),
  coef(summary(.$mod)))
)
#> Source: local data frame [6 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 6 x 5
#>   var         Estimate Std..Error t.value   Pr...t..
#> * <fct>          <dbl>      <dbl>   <dbl>      <dbl>
#> 1 (Intercept) 40.9        3.59     11.4   0.00000120
#> 2 disp        -0.135      0.0332   -4.07  0.00278   
#> 3 (Intercept) 19.1        2.91      6.55  0.00124   
#> 4 disp         0.00361    0.0156    0.232 0.826     
#> 5 (Intercept) 22.0        3.35      6.59  0.0000259 
#> 6 disp        -0.0196     0.00932  -2.11  0.0568    

models <- by_cyl %>% do(
  mod_linear = lm(mpg ~ disp, data = .),
  mod_quad = lm(mpg ~ poly(disp, 2), data = .)
)
models
#> Source: local data frame [3 x 3]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 3
#>     cyl mod_linear mod_quad
#> * <dbl> <list>     <list>  
#> 1     4 <lm>       <lm>    
#> 2     6 <lm>       <lm>    
#> 3     8 <lm>       <lm>    
compare <- models %>% do(aov = anova(.$mod_linear, .$mod_quad))
# compare %>% summarise(p.value = aov$`Pr(>F)`)

if (require("nycflights13")) {
# You can use it to do any arbitrary computation, like fitting a linear
# model. Let's explore how carrier departure delays vary over the time
carriers <- group_by(flights, carrier)
group_size(carriers)

mods <- do(carriers, mod = lm(arr_delay ~ dep_time, data = .))
mods %>% do(as.data.frame(coef(.$mod)))
mods %>% summarise(rsq = summary(mod)$r.squared)

if (FALSE) {
# This longer example shows the progress bar in action
by_dest <- flights %>% group_by(dest) %>% filter(n() > 100)
library(mgcv)
by_dest %>% do(smooth = gam(arr_delay ~ s(dep_time) + month, data = .))
}
}
#> Loading required package: nycflights13