overview

This is a complete redesign of how we evaluate expression in dplyr. We no longer attempt to evaluate part of an expression. We now either:

  • recognize the entire expression, e.g. n() or mean(x) and use C++ code to evaluate it (this is what we call hybrid evaluation now, but I guess another term would be better.
  • if not, we use standard evaluation in a suitable environment

data mask

When used internally in the c++ code, a tibble become one of the 3 classes GroupedDataFrame, RowwiseDataFrame or NaturalDataFrame. Most internal code is templated by these classes, e.g. summarise is:

The DataMask<SlicedTibble> template class is used by both hybrid and standard evaluation to extract the relevant information from the columns (original columns or columns that have just been made by mutate() or summarise())

standard evaluation

meta information about the groups

The functions n(), row_number() and group_indices() when called without arguments lack contextual information, i.e. the current group size and index, so they look for that information a the special environment

The DataMask class is responsible for updating the variables ..group_size and ..group_number

all other functions can just be called with standard evaluation in the data mask

active and resolved bindings

When doing standard evaluation, we need to install a data mask that evaluates the symbols from the data to the relevant subset. The simple solution would be to update the data mask at each iteration with subsets for all the variables but that would be potentially expensive and a waste, as we might not need all of the variables at a given time, e.g. in this case:

iris %>% group_by(Species) %>% summarise(Sepal.Length = +mean(Sepal.Length))

We only need to materialize Sepal.Length, we don’t need the other variables.

DataMask installs an active binding for each variable in one of (the top) the environment in the data mask ancestry, the active binding function is generated by this function so that it holds an index and a pointer to the data mask in its enclosure.

When hit, the active binding calls the materialize_binding function :

The DataMask<>::materialize(idx) method returns the materialized subset, but also: - install the result in the bottom environment of the data mask, so that it mask the active binding. The point is to call the active binding only once. - remembers that the binding at position idx has been materialized, so that before evaluating the same expression in the next group, it is proactively materialized, because it is very likely that we need the same variables for all groups

When we move to the next expression to evaluate, DataMask forgets about the materialized bindings so that the active binding can be triggered again as needed.

use case of the DataMask class

  • before evaluating expressions, construct a DataMask from a tibble
  • before evaluating a new expression, we need to rechain(parent_env) to prepare the data mask to evaluate expression with a given parent environment. This “forgets” about the materialized bindings.
  • before evaluating the expression ona new group, the indices are updated, this includes rematerializing the already materialized bindings

hybrid evaluation

Expression

When attempting to evaluate an expression with the hybrid evaluator, we first construct an Expression object. This class has methods to quickly check if the expression can be managed, e.g.

This checks that the call matches sum(<column>) or base::sum(<column>) where <column> is a column from the data mask.

In that example, the Expression class checks that: - the first argument is not named - the first argument is a column from the data

Otherwise it means it is an expression that we can’t handle, so we return R_UnboundValue which is the hybrid evaluation way to signal it gives up on handling the expression, and that it should be evaluated with standard evaluation.

Expression has the following methods:

  • inline bool is_fun(SEXP symbol, SEXP pkg, SEXP ns) : are we calling fun ? If so does fun curently resolve to the function we intend to (it might not if the function is masked, which allows to do trghings like this:)
  • bool is_valid() const : is the expression valid. the Expressio, constructor rules out a few expressions that hjave no chance of being handled, such as pkg::fun() when pkg is none of dplyr, stats or base

  • SEXP value(int i) const : the expression at position i

  • bool is_named(int i, SEXP symbol) const : is the i’th argument named symbol

  • bool is_scalar_logical(int i, bool& test) const : is the i’th argument a scalar logical, we need this for handling e.g. na.rm = TRUE

  • bool is_scalar_int(int i, int& out) const is the i’th argument a scalar int, we need this for n = <int>

  • bool is_column(int i, Column& column) const is the i’th argument a column.

hybrid_do

The hybrid_do function uses methods from Expression to quickly assess if it can handle the expression and then calls the relevant function from dplyr::hybrid:: to create the result at once:

The functions in the C++ dplyr::hybrid:: namespace create objects whose classes hold: - the type of output they create - the information they need (e.g. the column, the value of na.rm, …)

These classes all have these methods: - summarise() to return a result of the same size as the number of groups. This is used when op is a Summary. This returns R_UnboundValue to give up when the class can’t do that, e.g. the classes behind lag - window() to return a result of the same size as the number of rows in the original data set.

The classes typically don’t provide these methods directly, but rather inherit, via CRTP one of: - HybridVectorScalarResult, so that the class only has to provide a process method, for example the Count class:

HybridVectorScalarResult uses the result of process in both summarise() and window()

  • HybridVectorVectorResult expects a fill method, e.g. implementation of ntile(n=<int>) uses this class that derive from HybridVectorVectorResult.

The result of fill is only used in window(). The summarise() method simpliy returns R_UnboundValue to give up.