Select distinct/unique rows

Retain only unique/distinct rows from an input tbl. This is similar to unique.data.frame(), but considerably faster.

distinct(.data, ..., .keep_all = FALSE)

Arguments

.data	a tbl
...	Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.
.keep_all	If `TRUE`, keep all variables in `.data`. If a combination of `...` is not distinct, this keeps the first row of values.

Details

Comparing list columns is not fully supported. Elements in list columns are compared by reference. A warning will be given when trying to include list columns in the computation. This behavior is kept for compatibility reasons and may change in a future version. See examples.

Examples

df <- tibble(
  x = sample(10, 100, rep = TRUE),
  y = sample(10, 100, rep = TRUE)
)
nrow(df)
#> [1] 100
nrow(distinct(df))
#> [1] 65
nrow(distinct(df, x, y))
#> [1] 65

distinct(df, x)
#> # A tibble: 10 x 1
#>        x
#>    <int>
#>  1     6
#>  2     7
#>  3    10
#>  4     5
#>  5     8
#>  6     2
#>  7     1
#>  8     9
#>  9     4
#> 10     3
distinct(df, y)
#> # A tibble: 10 x 1
#>        y
#>    <int>
#>  1     5
#>  2     1
#>  3     9
#>  4    10
#>  5     4
#>  6     6
#>  7     3
#>  8     8
#>  9     2
#> 10     7

# Can choose to keep all other variables as well
distinct(df, x, .keep_all = TRUE)
#> # A tibble: 10 x 2
#>        x     y
#>    <int> <int>
#>  1     6     5
#>  2     7     5
#>  3    10     5
#>  4     5    10
#>  5     8     4
#>  6     2     6
#>  7     1     5
#>  8     9     9
#>  9     4     8
#> 10     3     6
distinct(df, y, .keep_all = TRUE)
#> # A tibble: 10 x 2
#>        x     y
#>    <int> <int>
#>  1     6     5
#>  2     7     1
#>  3     6     9
#>  4     5    10
#>  5     8     4
#>  6     2     6
#>  7     8     3
#>  8     6     8
#>  9     7     2
#> 10     9     7

# You can also use distinct on computed variables
distinct(df, diff = abs(x - y))
#> # A tibble: 10 x 1
#>     diff
#>    <int>
#>  1     1
#>  2     2
#>  3     6
#>  4     5
#>  5     3
#>  6     4
#>  7     0
#>  8     9
#>  9     8
#> 10     7

# The same behaviour applies for grouped data frames
# except that the grouping variables are always included
df <- tibble(
  g = c(1, 1, 2, 2),
  x = c(1, 1, 2, 1)
) %>% group_by(g)
df %>% distinct()
#> # A tibble: 3 x 2
#> # Groups:   g [2]
#>       g     x
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2
#> 3     2     1
df %>% distinct(x)
#> # A tibble: 3 x 2
#> # Groups:   g [2]
#>       x     g
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2
#> 3     1     2

# Values in list columns are compared by reference, this can lead to
# surprising results
tibble(a = as.list(c(1, 1, 2))) %>% glimpse() %>% distinct()
#> Observations: 3
#> Variables: 1
#> $ a <list> [1, 1, 2]
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `a`
#> # A tibble: 3 x 1
#>   a        
#>   <list>   
#> 1 <dbl [1]>
#> 2 <dbl [1]>
#> 3 <dbl [1]>
tibble(a = as.list(1:2)[c(1, 1, 2)]) %>% glimpse() %>% distinct()
#> Observations: 3
#> Variables: 1
#> $ a <list> [1, 1, 2]
#> Warning: distinct() does not fully support columns of type `list`.
#> List elements are compared by reference, see ?distinct for details.
#> This affects the following columns:
#> - `a`
#> # A tibble: 2 x 1
#>   a        
#>   <list>   
#> 1 <int [1]>
#> 2 <int [1]>

Arguments

Details

Examples

Contents