This is a complete redesign of how we evaluate expression in dplyr. We no longer attempt to evaluate part of an expression. We now either:
When used internally in the c++ code, a tibble become one of the 3 classes GroupedDataFrame
, RowwiseDataFrame
or NaturalDataFrame
. Most internal code is templated by these classes, e.g. summarise
is:
// [[Rcpp::export]]
SEXP summarise_impl(DataFrame df, QuosureList dots) {
check_valid_colnames(df);
if (is<RowwiseDataFrame>(df)) {
return summarise_grouped<RowwiseDataFrame>(df, dots);
} else if (is<GroupedDataFrame>(df)) {
return summarise_grouped<GroupedDataFrame>(df, dots);
} else {
return summarise_grouped<NaturalDataFrame>(df, dots);
}
}
The DataMask<SlicedTibble>
template class is used by both hybrid and standard evaluation to extract the relevant information from the columns (original columns or columns that have just been made by mutate()
or summarise()
)
The functions n()
, row_number()
and group_indices()
when called without arguments lack contextual information, i.e. the current group size and index, so they look for that information a the special environment
The DataMask class is responsible for updating the variables ..group_size
and ..group_number
// update the data context variables, these are used by n(), ...
get_context_env()["..group_size"] = indices.size();
get_context_env()["..group_number"] = indices.group() + 1;
all other functions can just be called with standard evaluation in the data mask
When doing standard evaluation, we need to install a data mask that evaluates the symbols from the data to the relevant subset. The simple solution would be to update the data mask at each iteration with subsets for all the variables but that would be potentially expensive and a waste, as we might not need all of the variables at a given time, e.g. in this case:
We only need to materialize Sepal.Length
, we don’t need the other variables.
DataMask
installs an active binding for each variable in one of (the top) the environment in the data mask ancestry, the active binding function is generated by this function so that it holds an index and a pointer to the data mask in its enclosure.
.make_active_binding_fun <- function(index, subsets){
function() {
materialize_binding(index, subsets)
}
}
When hit, the active binding calls the materialize_binding function :
// [[Rcpp::export]]
SEXP materialize_binding(int idx, XPtr<DataMaskBase> mask) {
return mask->materialize(idx);
}
The DataMask<>::materialize(idx)
method returns the materialized subset, but also: - install the result in the bottom environment of the data mask, so that it mask the active binding. The point is to call the active binding only once. - remembers that the binding at position idx
has been materialized, so that before evaluating the same expression in the next group, it is proactively materialized, because it is very likely that we need the same variables for all groups
When we move to the next expression to evaluate, DataMask
forgets about the materialized bindings so that the active binding can be triggered again as needed.
use case of the DataMask class
rechain(parent_env)
to prepare the data mask to evaluate expression with a given parent environment. This “forgets” about the materialized bindings.Hybrid evaluation also uses the DataMask<>
class, but it only needs to quickly retrieve the data for an entire column. This is what the maybe_get_subset_binding
method does.
// returns a pointer to the ColumnBinding if it exists
// this is mostly used by the hybrid evaluation
const ColumnBinding<SlicedTibble>* maybe_get_subset_binding(const SymbolString& symbol) const {
int pos = symbol_map.find(symbol);
if (pos >= 0) {
return &column_bindings[pos];
} else {
return 0;
}
}
when the symbol map contains the binding, we get a ColumnBinding<SlicedTibble>*
. These objects hold these fields:
// is this a summary binding, i.e. does it come from summarise
bool summary;
// symbol of the binding
SEXP symbol;
// data. it is own either by the original data frame or by the
// accumulator, so no need for additional protection here
SEXP data;
hybrid evaluation only needs summary
and data
.
When attempting to evaluate an expression with the hybrid evaluator, we first construct an Expression
object. This class has methods to quickly check if the expression can be managed, e.g.
// sum( <column> ) and base::sum( <column> )
if (expression.is_fun(s_sum, s_base, ns_base)) {
Column x;
if (expression.is_unnamed(0) && expression.is_column(0, x)) {
return sum_(data, x, /* na.rm = */ false, op);
} else {
return R_UnboundValue;
}
}
This checks that the call matches sum(<column>)
or base::sum(<column>)
where <column>
is a column from the data mask.
In that example, the Expression
class checks that: - the first argument is not named - the first argument is a column from the data
Otherwise it means it is an expression that we can’t handle, so we return R_UnboundValue
which is the hybrid evaluation way to signal it gives up on handling the expression, and that it should be evaluated with standard evaluation.
Expression has the following methods:
inline bool is_fun(SEXP symbol, SEXP pkg, SEXP ns)
: are we calling fun
? If so does fun
curently resolve to the function we intend to (it might not if the function is masked, which allows to do trghings like this:)bool is_valid() const
: is the expression valid. the Expressio, constructor rules out a few expressions that hjave no chance of being handled, such as pkg::fun() when pkg
is none of dplyr
, stats
or base
SEXP value(int i) const
: the expression at position i
bool is_named(int i, SEXP symbol) const
: is the i’th argument named symbol
bool is_scalar_logical(int i, bool& test) const
: is the i’th argument a scalar logical, we need this for handling e.g. na.rm = TRUE
bool is_scalar_int(int i, int& out) const
is the i’th argument a scalar int, we need this for n = <int>
bool is_column(int i, Column& column) const
is the i’th argument a column.
The hybrid_do
function uses methods from Expression
to quickly assess if it can handle the expression and then calls the relevant function from dplyr::hybrid::
to create the result at once:
if (expression.is_fun(s_sum, s_base, ns_base)) {
// sum( <column> ) and base::sum( <column> )
Column x;
if (expression.is_unnamed(0) && expression.is_column(0, x)) {
return sum_(data, x, /* na.rm = */ false, op);
}
} else if (expression.is_fun(s_mean, s_base, ns_base)) {
// mean( <column> ) and base::mean( <column> )
Column x;
if (expression.is_unnamed(0) && expression.is_column(0, x)) {
return mean_(data, x, false, op);
}
} else if ...
The functions in the C++ dplyr::hybrid::
namespace create objects whose classes hold: - the type of output they create - the information they need (e.g. the column, the value of na.rm, …)
These classes all have these methods: - summarise()
to return a result of the same size as the number of groups. This is used when op is a Summary
. This returns R_UnboundValue
to give up when the class can’t do that, e.g. the classes behind lag
- window()
to return a result of the same size as the number of rows in the original data set.
The classes typically don’t provide these methods directly, but rather inherit, via CRTP one of: - HybridVectorScalarResult
, so that the class only has to provide a process
method, for example the Count
class:
template <typename SlicedTibble>
class Count : public HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > {
public:
typedef HybridVectorScalarResult<INTSXP, SlicedTibble, Count<SlicedTibble> > Parent ;
Count(const SlicedTibble& data) : Parent(data) {}
int process(const typename SlicedTibble::slicing_index& indices) const {
return indices.size();
}
} ;
HybridVectorScalarResult
uses the result of process
in both summarise()
and window()
HybridVectorVectorResult
expects a fill
method, e.g. implementation of ntile(n=<int>)
uses this class that derive from HybridVectorVectorResult.template <typename SlicedTibble>
class Ntile1 : public HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1<SlicedTibble> > {
public:
typedef HybridVectorVectorResult<INTSXP, SlicedTibble, Ntile1> Parent;
Ntile1(const SlicedTibble& data, int ntiles_): Parent(data), ntiles(ntiles_) {}
void fill(const typename SlicedTibble::slicing_index& indices, Rcpp::IntegerVector& out) const {
int m = indices.size();
for (int j = m - 1; j >= 0; j--) {
out[ indices[j] ] = (int)floor((ntiles * j) / m) + 1;
}
}
private:
int ntiles;
};
The result of fill
is only used in window()
. The summarise()
method simpliy returns R_UnboundValue
to give up.