The work on making factors used more respectfully originates from the issue 341, which has been waiting for attention for the past four years. #341
identified the need to take care of empty groups.
Empty groups can arise from two situations: - when one of the grouping variable in group_by()
is a factor and one of its levels has no data, e.g.
tibble(
x = 1:2,
f = factor(c("a", "b"), levels = c("a", "b", "c"))
) %>%
group_by(f)
#> # A tibble: 2 x 2
#> # Groups: f [2]
#> x f
#> <int> <fct>
#> 1 1 a
#> 2 2 b
The factor f
has 3 levels, but only two are present in the data.
filter()
ed out, e.g.tibble( x = 1:3, f = factor(c("a", "b", "c"))) %>%
group_by(f) %>%
filter(x < 2)
#> # A tibble: 1 x 2
#> # Groups: f [1]
#> x f
#> <int> <fct>
#> 1 1 a
In that case, the grouped data before the filter has one row per level of f
, and the filter only keeps the first row so makes 2 empty groups.
Older versions of dplyr
did not make empty groups, because: - group_by()
was building the grouping metadata only from the rows of the data, i.e. ignoring the conceptual grouping structure. - filter()
was making a lazily grouped tibble, recording only the names of grouping variables, without producing the metadata, which was automatically made by a subsequent group_by
whenever this was necessary in the future.
A new grouping algorithm, inspired from tidyr::complete
is used in dplyr
0.8.0 to solve the first issue. The algorithm recursively goes through the grouping variables. When a grouping variable is a factor, the groups are made from its levels. On any other variable (character
, integer
, …) the groups are made from the unique values.
Let’s have a look at some examples, we’ll use tally()
to reveal the grouping structure and counts of groups:
df <- tibble(
x = c(1,2,1,2),
f = factor(c("a", "b", "a", "b"), levels = c("a", "b", "c"))
)
df %>%
group_by(f) %>%
tally()
#> # A tibble: 2 x 2
#> f n
#> <fct> <int>
#> 1 a 2
#> 2 b 2
In this first example, we group by a factor, so we get as many groups as the number of factors.
df %>%
group_by(f, x) %>%
tally()
#> # A tibble: 2 x 3
#> # Groups: f [2]
#> f x n
#> <fct> <dbl> <int>
#> 1 a 1 2
#> 2 b 2 2
Here we group by the factor f
and the numeric vector x
. Again we get 3 groups, because for the levels “a” and “b” of f
, there is only one value of x
. The third group, associated with the level “c” sets the value of x
to NA
out of thin air. We call this a sentinel NA and we might make it obvious later that this is not the same as if we had a missing value in the data.
df %>%
group_by(x, f) %>%
tally()
#> # A tibble: 2 x 3
#> # Groups: x [2]
#> x f n
#> <dbl> <fct> <int>
#> 1 1 a 2
#> 2 2 b 2
In this case, we get more groups, and consequently more empty groups, because of the recursive slicing, first we find 2 unique values for the variable x
(1 and 2), then we group by the factor f
and therefore get 3 groups (because 3 levels) for each unique value of x
.
filter()
has been reworked to respect the grouping stucture and gains the .preserve
argument to control which groups to keep.
When .preserve
is set to TRUE
(the default) the groups of the filtered tibble are the same as the groups of the original tibble.
df %>%
group_by(x, f) %>%
filter(x == 1) %>%
tally()
#> # A tibble: 1 x 3
#> # Groups: x [1]
#> x f n
#> <dbl> <fct> <int>
#> 1 1 a 2
df %>%
group_by(f, x) %>%
filter(x == 1) %>%
tally()
#> # A tibble: 1 x 3
#> # Groups: f [1]
#> f x n
#> <fct> <dbl> <int>
#> 1 a 1 2
When .preserve
is set to FALSE
the grouping structure is recalculated after the filtering.
df %>%
group_by(x, f) %>%
filter(x == 1, .preserve = FALSE) %>%
tally()
#> # A tibble: 1 x 3
#> # Groups: x [1]
#> x f n
#> <dbl> <fct> <int>
#> 1 1 a 2
Here we only get 3 groups, from the 3 levels of f
within the unique value of x
df %>%
group_by(f, x) %>%
filter(x == 1, .preserve = FALSE) %>%
tally()
#> # A tibble: 1 x 3
#> # Groups: f [1]
#> f x n
#> <fct> <dbl> <int>
#> 1 a 1 2
In that case, we get 3 groups, but the values of x
are slightly different, i.e. the value of x
associated with the level “b” in the empty group is a sentinel NA.
Previous versions of dplyr
used a messy collection of attributes in the “grouped_df” class, which did not make it easy to reason about. dplyr
0.8.0 structures all the grouping information in a tibble with n+1
columns (where n
is the number of grouping variables) in the “groups” attribute.
df %>%
group_by(f, x) %>%
attr("groups")
#> # A tibble: 2 x 3
#> f x .rows
#> <fct> <dbl> <list>
#> 1 a 1 <int [2]>
#> 2 b 2 <int [2]>
The first columns identify the data for each of the group, one row per group. This is equivalent to the “labels” attribute used in previous versions of dplyr
.
The last column, always called .rows
is a list column of integer vectors (possibly of length 0 for empty groups) identifying the indices of all the rows in the data that belong to the group. This is equivalent to the “indices” attribute used in previous versions.
This grouping stucture tibble (maybe a gribble) can be retrieved by accessing the groups
attribute, or preferably by using the group_data()
generic, which has methods for ungrouped and row wise data too.
group_data(df)
#> # A tibble: 1 x 1
#> .rows
#> <list>
#> 1 <int [4]>
group_data(group_by(df, f))
#> # A tibble: 2 x 2
#> f .rows
#> <fct> <list>
#> 1 a <int [2]>
#> 2 b <int [2]>
group_data(rowwise(df))
#> # A tibble: 4 x 1
#> .rows
#> <list>
#> 1 <int [1]>
#> 2 <int [1]>
#> 3 <int [1]>
#> 4 <int [1]>
Similarly, the indices themselves can be retrieved using group_rows()
:
group_rows(df)
#> [[1]]
#> [1] 1 2 3 4
group_rows(group_by(df, f))
#> [[1]]
#> [1] 1 3
#>
#> [[2]]
#> [1] 2 4
group_rows(rowwise(df))
#> [[1]]
#> [1] 1
#>
#> [[2]]
#> [1] 2
#>
#> [[3]]
#> [1] 3
#>
#> [[4]]
#> [1] 4
Having a consistent representation of the grouping structure makes it easier to reason about, and might open opportunities to discuss alternative ways of grouping
pending
The initial goal for hybrid evaluation was to bypass potentially expensive R evaluation, and replace it with more efficient C++ code. Unfortunately, there are situations where hybrid evaluation creates problems.
There are two forms of hybrid evaluation in dplyr at the moment: full hybrid evaluation and hybrid folding.
When the entire (summarise or mutate) expression can be handled, e.g. in group_by(...) %>% summarise(m = mean(x))
the mean
hybrid handler takes care of everything, i.e. calculate the mean of x for each group and structure that into a numeric vector.
This does not need to allocate memory for each subset of x
or the result of mean(x)
. In addition, because it is dispatched internally, it does not need to pay the expensive price of S3 dispatch of the mean
generic function from base::
.
This is where hybrid evaluation really makes a difference. Currently this is driven by a set of C++ classes inheriting from the virtual class Result
, which is used for summary functions (such as mean
) and window functions (such as lead
).
The proposal here is to rebase hybrid handlers on two virtual class (maybe templates) instead of one: - template <int RTYPE> Window<RTYPE>
would give a vector of type RTYPE of the right size. - template <int RTYPE> Summary<RTYPE>
would summarise into of value of the right type.
mutate
and summarise
would recognise expressions that are hybridable, and use the information to allocate the result then iterate through the groups to fill the result.
This needs careful refactoring. We believe that this will make the code much simpler, with the consequence that it will be easier to write new hybrid handlers, i.e. we can imagine something like x == 2
to be handled hybridly in filter
by using a class deriving from Window<LGLSXP>
.
This is where hybrid evaluation creates problems, because it is sometimes too eager, and generally cannot faithfully mimic standard R evaulation. The original idea was to handle parts of the expression using the hybrid handlers, e.g. in the expression %>% group_by(...) %>% summarise(m = 1 + mean(x))
we would handle mean(x)
with the hybrid handler for mean
, fold that into the expression and then fall back to r evaluation once we can no longer hybrid evaluate anything.
Folding cannot be done once and for all groups, it is performed (including going through the expressions) for each group which has a price, we have to end with an R evaluation anyway, and then after that we still have no idea of what the result will be, so we collect and coerce the result with care.
This has been the source of most of the “surprises” and also comes at a huge cost in terms of code complexity, and therefore maintainability.
The proposal here is to totally abandon hybrid folding and replace it with an approach based on regular R evaluation. Expressions would be evaluated in an environment in which the names of the columns are mapped to their subsets in the current group, and where functions such as n()
and row_number()
produce the desired result.
Letting go of hybrid folding and making it easier to implement full hybrid handlers will make hybrid evaluation simpler, more robust and less surprising.