Thank you for your interest in contributing to broom! This document is a work in progress describing the conventions that you should follow when adding tidiers to broom.
General guidelines:
covr::report()
.tidy
, glance
and augment
methods must return tibbles.NEWS.md
to reflect the changes you’ve madestyler
package to reformat your code according to these conventions, and the lintr
package to check that your code meets the conventions.dplyr
and tidyr
over older packages such as plyr
and reshape2
.tidyr::spread
than tidyr::gather
data after it’s been tidied.DESCRIPTION
.If you are just getting into open source development, broom
is an excellent place to get started and we are more than happy to help. We recommend you start contributing by improving the documentation, writing issues with reproducible errors, or taking on issues tagged beginner-friendly
.
Ideally, tidying methods should live in the packages of their associated modelling functions. That is, if you have some object my_object
produced bymy_package
, the functions tidy.my_object
, glance.my_object
and augment.my_object
should live in my_package
, provided there are sensible ways to define these tidiers for my_object
.
We are currently working on an appropriate way to split tidiers into several domain specific tidying packages. For now, if you don’t own my_package
, you should add the tidiers to broom
. There are some exceptions:
broom.mixed
tidytext
broomstick
biobroom
We will keep you updated as we work towards a final solution.
NOTE: okay to write tidyverse
code to tidy and wrap it in a function. encouraged, in fact.
We encourage you to develop new tidiers using your favorite tidyverse tools. Pipes are welcome, as is any code that you might write for tidyverse-style interactive data manipulation.
If you are implementing a new tidier, we recommend taking a look at the internals of the tidying methods for betareg
and rq
objects and using those as a starting point.
You should also be aware of the following helper functions:
finish_glance()
augment_columns()
fix_data_frame()
validate_augment_input()
All new tidiers should be fully documented following the tidyverse code documentation guidelines. Documentation should use full sentences with appropriate punctation. Documentation should also contain at least one but potentially several examples of how the tidiers can be used.
Documentation should be written in R markdown as much as possible.
There’ll be a major overhaul of documentation later this summer, at which point this portion of the vignette will also get some major updates.
Your tests should include:
The tests for augment
are rapidly evolving at the moment, and we’ll follow up with more details on them soon.
If any of your tests use random number generation, you should call set.seed()
in the body of the test.
In general, we prefer informative errors to magical behaviors or untested success.
devtools::test()
your_package
to the Suggests section of broom’s DESCRIPTION.skip_if_not_installed("my_package")
at the beginning of any test that uses my_package
.devtools::install_github("tidyverse/broom", dependencies = TRUE)
.You should test new tidiers on a representative set of my_object
objects. At a minimum, you should have a test for each distinct type of fit that appears in the examples for a particular model (if we working with stats::arima
models, the tidiers should work for seasonal and non-seasonal models).
It’s important to test your tidiers for fits estimated with different algorithms (i.e. stats::arima
tidier should be tested for method = "CSS-ML"
, method = "ML"
and method = "ML"
). As another example, good tests for glm
tidying methods would test tidiers on glm
objects fit for all acceptable values of family
.
In short: be sure that you’ve tested your tidiers on models fit with all the major modelling options (both statistical options, and estimation options).
devtools::check()
devtools::spell_check()
goodpractice::gp()
broom
doesn’t currently pass all of these. If you are adding new tidiers at the moment, it’s enough for these to throw no warnings for the files you’ve changed.
The big picture:
glance
should provide a summary of model-level information as a tibble
with exactly one row. This includes goodness of fit measures such as deviance, AIC, BIC, etc.augment
should provide a summary of observation-level information as a tibble
with one row per observation. This summary should preserve the observations. Additional information might include leverage, cluster assignments or fitted values.tidy
should provide a summary of component-level information as a tibble
with one row for each model component. Examples of model components include: regression coefficients, cluster centers, etc.Oftentimes it doesn’t make sense to define one or more of these methods for a particular model. In this case, just implement the methods that do make sense.
glance
The glance(x, ...)
method accepts a model object x
and returns a tibble with exactly one row containing model level summary information.
Output should not include the name of the modelling function or any arguments given to the modelling function. For example, glance(glm_object)
does not contain a family
column.
In some cases, you may wish to provide model level diagnostics not returned by the original object. If these are easy to compute, feel free to add them. However, broom
is not an appropriate place to implement complex or time consuming calculations.
glance
should always return the same columns in the same order for an object x
of class my_object
. If a summary metric such as AIC
is not defined in certain circumstances, use NA
.
augment
The augment(x, data = NULL, ...)
method accepts a model object and optionally a data frame data
and adds columns of observation level information to data
. augment
returns a tibble
with the same number of rows as data
.
The data
argument can be any of the following:
data.frame
containing both the original predictors and the original responsestibble
containing both the the original predictors and the original responsesdata
argument is specified, augment
should try to reconstruct the original data as much as possible from the model object. This may not always be possible, and often it will not be possible to recover columns not used by the model.Any other inputs should result in an error. This will eventually be checked by the validate_augment_input()
function.
Many augment
methods will also provide an optional newdata
argument that should also default to NULL
. Users should only ever specify one of data
or newdata
. Providing both data
and newdata
should result in an error. newdata
should accept both data.frame
s and tibble
s and should be tested with both.
Data given to the data
argument must have both the original predictors and the original response. Data given to the newdata
argument only needs to have the original predictors. This is important because there may be important information associated with training data that is not associated with test data, for example, leverages (.hat
below) in the case in linear regression:
model <- lm(speed ~ dist, data = cars)
augment(model, data = cars)
#> # A tibble: 50 x 9
#> speed dist .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 2 8.62 0.844 -4.62 0.0716 3.11 0.0888 -1.52
#> 2 4 10 9.94 0.729 -5.94 0.0534 3.06 0.106 -1.93
#> 3 7 4 8.95 0.815 -1.95 0.0667 3.18 0.0146 -0.638
#> 4 7 22 11.9 0.578 -4.93 0.0335 3.10 0.0437 -1.59
#> 5 8 16 10.9 0.650 -2.93 0.0424 3.16 0.0200 -0.950
#> 6 9 10 9.94 0.729 -0.940 0.0534 3.19 0.00264 -0.306
#> 7 10 18 11.3 0.625 -1.26 0.0392 3.18 0.00340 -0.409
#> 8 10 26 12.6 0.536 -2.59 0.0289 3.17 0.0103 -0.832
#> 9 10 34 13.9 0.473 -3.91 0.0225 3.14 0.0181 -1.25
#> 10 11 17 11.1 0.637 -0.0986 0.0407 3.19 0.0000216 -0.0319
#> # … with 40 more rows
augment(model, newdata = cars)
#> # A tibble: 50 x 4
#> speed dist .fitted .se.fit
#> <dbl> <dbl> <dbl> <dbl>
#> 1 4 2 8.62 0.844
#> 2 4 10 9.94 0.729
#> 3 7 4 8.95 0.815
#> 4 7 22 11.9 0.578
#> 5 8 16 10.9 0.650
#> 6 9 10 9.94 0.729
#> 7 10 18 11.3 0.625
#> 8 10 26 12.6 0.536
#> 9 10 34 13.9 0.473
#> 10 11 17 11.1 0.637
#> # … with 40 more rows
This means that many augment(model, data = original_data)
should provide .fitted
and .resid
columns in most cases, whereas augment(model, data = test_data)
only needs to a .fitted
column, even if the response is present in test_data
.
If the data
or newdata
is specified as a data.frame
with rownames, augment
should return them in a column called .rownames
.
For observations where no fitted values or summaries are available (where there’s missing data, for example) return NA
.
Added column names should begin with .
to avoid overwriting columns in the original data.
tidy
The tidy(x, ...)
method accepts a model object x
and returns a tibble with one row per model component. A model component might be a single term in a regression, a single test, or one cluster/class. Exactly what a component is varies across models but is usually self-evident.
Sometimes a model will have different types of components. For example, in mixed models, there is different information associated with fixed effects and random effects, since this information doesn’t have the same interpretation, it doesn’t make sense to summarize the fixed and random effects in the same table. In cases like this you should add an argument that allows the user to specify which type of information they want. For example, you might implement an interface along the lines of:
Common arguments to tidy methods:
conf.int
: logical indicating whether or not to calculate confidence/credible intervals. should default to FALSE
conf.level
: the confidence level to use for the interval when conf.int = TRUE
exponentiate
: logical indicating whether or not model terms should be presented on an exponential scale (typical for logistic regression)quick
: logical indicating whether to use a faster tidy
method that returns less information about each component, typically only term
and estimate
columns