Manual Dplyr
Manual Dplyr
URL https://dplyr.tidyverse.org,
https://github.com/tidyverse/dplyr
BugReports https://github.com/tidyverse/dplyr/issues
Depends R (¿= 3.3.0)
Imports ellipsis,
generics,
glue (¿= 1.3.2),
lifecycle (¿= 1.0.0),
magrittr (¿= 1.5),
methods,
R6,
rlang (¿= 0.4.10),
tibble (¿= 2.1.3),
tidyselect (¿= 1.1.0),
utils,
vctrs (¿= 0.3.5)
Suggests bench,
broom,
callr,
covr,
DBI,
dbplyr (¿= 1.4.3),
knitr,
Lahman,
lobstr,
microbenchmark,
nycflights13,
purrr,
rmarkdown,
RMySQL,
1
2 R topics documented:
RPostgreSQL,
RSQLite,
testthat (¿= 3.0.2),
tidyr,
withr
VignetteBuilder knitr
Encoding UTF-8
LazyData true
Roxygen list(markdown = TRUE)
RoxygenNote 7.1.1
Config/testthat/edition 3
R topics documented:
across . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
all vars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
arrange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
arrange all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
auto copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
band members . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
between . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
bind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
case when . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
coalesce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
compute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
copy to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
cumall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
c across . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
desc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
distinct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
distinct all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
explain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
filter-joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
filter all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
group by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
group by all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
group cols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
group map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
group split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
group trim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
ident . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
if else . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
lead-lag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
mutate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
mutate-joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
mutate all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
na if . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
across 3
near . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
nest join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
nth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
n distinct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
order by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
pull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
recode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
relocate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
rename . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
rowwise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
scoped . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
setops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
slice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
sql . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
starwars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
storms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
summarise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
summarise all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
tbl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
with groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Index 86
Description
across() makes it easy to apply the same transformation to multiple columns, allowing
you to use select() semantics inside in ”data-masking” functions like summarise() and
mutate(). See vignette("colwise") for more details.
if any() and if all() apply the same predicate function to a selection of columns and
combine the results into a single logical vector.
across() supersedes the family of ”scoped variants” like summarise at(), summarise if(),
and summarise all().
Usage
across(.cols = everything(), .fns = NULL, ..., .names = NULL)
Arguments
.fns Functions to apply to each of the selected columns. Possible values are:
NULL, to returns the columns untransformed.
A function, e.g. mean.
A purrr-style lambda, e.g. ˜ mean(.x,na.rm = TRUE)
A list of functions/lambdas, e.g. list(mean = mean, n miss = ˜ sum(is.na(.x))
Within these functions you can use cur column() and cur group() to
access the current column and grouping keys respectively.
... Additional arguments for the function calls in .fns.
.names A glue specification that describes how to name the output columns.
This can use {.col} to stand for the selected column name, and {.fn}
to stand for the name of the function being applied. The default (NULL)
is equivalent to "{.col}" for the single function case and "{.col} {.fn}"
for the case where a list is used for .fns.
cols, .cols ¡tidy-select¿ Columns to transform. Because across() is used within
functions like summarise() and mutate(), you can’t select or compute
upon grouping variables.
Value
across() returns a tibble with one column for each column in .cols and each function in
.fns.
if any() and if all() return a logical vector.
See Also
c across() for a function that returns a vector
Examples
# across() -----------------------------------------------------------------
# Different ways to select the same set of columns
# See <https://tidyselect.r-lib.org/articles/syntax.html> for details
iris %>%
as_tibble() %>%
mutate(across(c(Sepal.Length, Sepal.Width), round))
iris %>%
as_tibble() %>%
mutate(across(c(1, 2), round))
iris %>%
as_tibble() %>%
mutate(across(1:Sepal.Width, round))
iris %>%
as_tibble() %>%
mutate(across(where(is.double) & !c(Petal.Length, Petal.Width), round))
# A purrr-style formula
iris %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), ˜mean(.x, na.rm = TRUE)))
iris %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), list(mean = mean, sd = sd)))
# When the list is not named, .fn is replaced by the function's position
iris %>%
group_by(Species) %>%
summarise(across(starts_with("Sepal"), list(mean, sd), .names = "{.col}.fn{.fn}"))
Description
[Superseded]
all vars() and any vars() were only needed for the scoped verbs, which have been super-
seded by the use of across() in an existing verb. See vignette("colwise") for details.
These quoting functions signal to scoped filtering verbs (e.g. filter if() or filter all())
that a predicate expression should be applied to all relevant variables. The all vars()
variant takes the intersection of the predicate expressions with & while the any vars()
variant takes the union with |.
Usage
all_vars(expr)
any_vars(expr)
Arguments
expr ¡data-masking¿ An expression that returns a logical vector, using . to
refer to the ”current” variable.
See Also
vars() for other quoting functions that you can use with scoped verbs.
6 arrange
Description
arrange() orders the rows of a data frame by the values of selected columns.
Unlike other dplyr verbs, arrange() largely ignores grouping; you need to explicitly mention
grouping variables (or use .by group = TRUE) in order to group by them, and functions of
variables are evaluated once per data frame, not once per group.
Usage
arrange(.data, ..., .by_group = FALSE)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... ¡data-masking¿ Variables, or functions of variables. Use desc() to sort a
variable in descending order.
.by group If TRUE, will sort first by grouping variable. Applies to grouped data
frames only.
Details
Locales:
The sort order for character vectors will depend on the collating sequence of the locale in
use: see locales().
Missing values:
Unlike base sorting with sort(), NA are:
always sorted to the end for local data, even when wrapped with desc().
treated differently for remote data, depending on the backend.
Value
An object of the same type as .data. The output has the following properties:
All rows appear in the output, but (usually) in a different place.
Columns are not modified.
Groups are not modified.
Data frame attributes are preserved.
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
arrange all 7
See Also
Other single table verbs: filter(), mutate(), rename(), select(), slice(), summarise()
Examples
arrange(mtcars, cyl, disp)
arrange(mtcars, desc(disp))
Description
[Superseded]
Scoped verbs ( if, at, all) have been superseded by the use of across() in an existing
verb. See vignette("colwise") for details.
These scoped variants of arrange() sort a data frame by a selection of variables. Like
arrange(), you can modify the variables before ordering with the .funs argument.
Usage
arrange_all(.tbl, .funs = list(), ..., .by_group = FALSE)
Arguments
.tbl A tbl object.
.funs A function fun, a quosure style lambda ˜ fun(.) or a list of either form.
... Additional arguments for the function calls in .funs. These are evaluated
only once, with tidy dots support.
8 auto copy
.by group If TRUE, will sort first by grouping variable. Applies to grouped data
frames only.
.vars A list of columns generated by vars(), a character vector of column
names, a numeric vector of column positions, or NULL.
.predicate A predicate function to be applied to the columns or a logical vector.
The variables for which .predicate is or returns TRUE are selected. This
argument is passed to rlang::as function() and thus supports quosure-
style lambda functions and strings representing function names.
Grouping variables
The grouping variables that are part of the selection participate in the sorting of the data
frame.
Examples
df <- as_tibble(mtcars)
arrange_all(df)
# ->
arrange(df, across())
arrange_all(df, desc)
# ->
arrange(df, across(everything(), desc))
Description
Usage
Arguments
Description
These data sets describe band members of the Beatles and Rolling Stones. They are toy
data sets that can be displayed in their entirety on a slide (e.g. to demonstrate a join).
Usage
band_members
band_instruments
band_instruments2
Format
Each is a tibble with two variables and three observations
Details
band instruments and band instruments2 contain the same data but use different column
names for the first column of the data set. band instruments uses name, which matches the
name of the key column of band members; band instruments2 uses artist, which does not.
Examples
band_members
band_instruments
band_instruments2
Description
This is a shortcut for x >= left & x <= right, implemented efficiently in C++ for local
values, and translated to the appropriate SQL for remote tables.
Usage
between(x, left, right)
Arguments
x A numeric vector of values
left, right Boundary values (must be scalars).
10 bind
Examples
between(1:12, 7, 9)
x <- rnorm(1e2)
x[between(x, -1, 1)]
Description
This is an efficient implementation of the common pattern of do.call(rbind,dfs) or
do.call(cbind,dfs) for binding many data frames into one.
Usage
bind_rows(..., .id = NULL)
bind_cols(
...,
.name_repair = c("unique", "universal", "check_unique", "minimal")
)
Arguments
... Data frames to combine.
Each argument can either be a data frame, a list that could be a data
frame, or a list of data frames.
When row-binding, columns are matched by name, and any missing columns
will be filled with NA.
When column-binding, rows are matched by position, so all data frames
must have the same number of rows. To match by value, not position,
see mutate-joins.
.id Data frame identifier.
When .id is supplied, a new column of identifiers is created to link each
row to its original data frame. The labels are taken from the named
arguments to bind rows(). When a list of data frames is supplied, the
labels are taken from the names of the list. If no names are found a
numeric sequence is used instead.
.name repair One of "unique", "universal", or "check unique". See vctrs::vec as names()
for the meaning of these options.
Details
The output of bind rows() will contain a column if that column appears in any of the
inputs.
bind 11
Value
bind rows() and bind cols() return the same type as the first input, either a data frame,
tbl df, or grouped df.
Examples
one <- starwars[1:4, ]
two <- starwars[9:12, ]
# When you supply a column name with the `.id` argument, a new
# column is created to link each row to its original data frame
bind_rows(list(one, two), .id = "id")
bind_rows(list(a = one, b = two), .id = "id")
bind_rows("group 1" = one, "group 2" = two, .id = "groups")
## End(Not run)
bind_cols(one, two)
bind_cols(list(one, two))
12 case when
Description
This function allows you to vectorise multiple if else() statements. It is an R equivalent
of the SQL CASE WHEN statement. If no cases match, NA is returned.
Usage
case_when(...)
Arguments
... ¡dynamic-dots¿ A sequence of two-sided formulas. The left hand side
(LHS) determines which values match this case. The right hand side
(RHS) provides the replacement value.
The LHS must evaluate to a logical vector. The RHS does not need to be
logical, but all RHSs must evaluate to the same type of vector.
Both LHS and RHS may have the same length of either 1 or n. The value
of n must be consistent across all cases. The case of n == 0 is treated as
a variant of n != 1.
NULL inputs are ignored.
Value
A vector of length 1 or n, matching the length of the logical input or output vectors, with
the type (and attributes) of the first RHS. Inconsistent lengths or types will generate an
error.
Examples
x <- 1:50
case_when(
x %% 35 == 0 ˜ "fizz buzz",
x %% 5 == 0 ˜ "fizz",
x %% 7 == 0 ˜ "buzz",
TRUE ˜ as.character(x)
)
x %% 35 == 0 ˜ "fizz buzz"
)
# Note that NA values in the vector x do not get special treatment. If you want
# to explicitly handle NA values you can use the `is.na` function:
x[2:4] <- NA_real_
case_when(
x %% 35 == 0 ˜ "fizz buzz",
x %% 5 == 0 ˜ "fizz",
x %% 7 == 0 ˜ "buzz",
is.na(x) ˜ "nope",
TRUE ˜ as.character(x)
)
# All RHS values need to be of the same type. Inconsistent types will throw an error.
# This applies also to NA values used in RHS: NA is logical, use
# typed values like NA_real_, NA_complex, NA_character_, NA_integer_ as appropriate.
case_when(
x %% 35 == 0 ˜ NA_character_,
x %% 5 == 0 ˜ "fizz",
x %% 7 == 0 ˜ "buzz",
TRUE ˜ as.character(x)
)
case_when(
x %% 35 == 0 ˜ 35,
x %% 5 == 0 ˜ 5,
x %% 7 == 0 ˜ 7,
TRUE ˜ NA_real_
)
## End(Not run)
starwars %>%
mutate(type = case_character_type(height, mass, species, robots = FALSE)) %>%
pull(type)
Description
Given a set of vectors, coalesce() finds the first non-missing value at each position. This
is inspired by the SQL COALESCE function which does the same thing for NULLs.
Usage
coalesce(...)
compute 15
Arguments
... ¡dynamic-dots¿ Vectors. Inputs should be recyclable (either be length 1
or same length as the longest vector) and coercible to a common type. If
data frames, they are coalesced column by column.
Value
A vector the same length as the first ... argument with missing values replaced by the first
non-missing value.
See Also
na if() to replace specified values with a NA. tidyr::replace na() to replace NA with a
value
Examples
# Use a single value to replace all missing values
x <- sample(c(1:5, NA, NA, NA))
coalesce(x, 0L)
Description
compute() stores results in a remote temporary table. collect() retrieves data into a local
tibble. collapse() is slightly different: it doesn’t force computation, but instead forces
generation of the SQL query. This is sometimes needed to work around bugs in dplyr’s
SQL generation.
All functions preserve grouping and ordering.
Usage
compute(x, name = random_table_name(), ...)
collect(x, ...)
collapse(x, ...)
16 context
Arguments
x A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
name Name of temporary table on database.
... Other arguments passed on to methods
Methods
These functions are generics, which means that packages can provide implementations
(methods) for other classes. See the documentation of individual methods for extra argu-
ments and differences in behaviour.
Methods available in currently loaded packages:
See Also
copy to(), the opposite of collect(): it takes a local data frame and uploads it to the
remote source.
Examples
if (require(dbplyr)) {
mtcars2 <- src_memdb() %>%
copy_to(mtcars, name = "mtcars2-cc", overwrite = TRUE)
Description
These functions return information about the ”current” group or ”current” variable, so only
work inside specific contexts like summarise() and mutate()
cur data() gives the current data for the current group (excluding grouping variables).
cur data all() gives the current data for the current group (including grouping vari-
ables)
cur group() gives the group keys, a tibble with one row and one column for each
grouping variable.
cur group id() gives a unique numeric identifier for the current group.
cur group rows() gives the row indices for the current group.
cur column() gives the name of the current column (in across() only).
See group data() for equivalent functions that return values for all groups.
Usage
n()
cur_data()
cur_data_all()
cur_group()
cur_group_id()
cur_group_rows()
cur_column()
data.table
If you’re familiar with data.table:
cur data() ¡-¿ .SD
cur group id() ¡-¿ .GRP
cur group() ¡-¿ .BY
cur group rows() ¡-¿ .I
Examples
df <- tibble(
g = sample(rep(letters[1:3], 1:3)),
x = runif(6),
y = runif(6)
)
gf <- df %>% group_by(g)
Description
This function uploads a local data frame into a remote data source, creating the table
definition as needed. Wherever possible, the new object will be temporary, limited to the
current connection to the source.
Usage
copy_to(dest, df, name = deparse(substitute(df)), overwrite = FALSE, ...)
Arguments
dest remote data source
df local data frame
name name for new remote table.
overwrite If TRUE, will overwrite an existing table with name name. If FALSE, will
throw an error if name already exists.
... other parameters passed to methods.
Value
a tbl object in the remote source
See Also
collect() for the opposite action; downloading remote data into a local dbl.
Examples
## Not run:
iris2 <- dbplyr::src_memdb() %>% copy_to(iris, overwrite = TRUE)
iris2
## End(Not run)
Description
count() lets you quickly count the unique values of one or more variables: df %>% count(a,b)
is roughly equivalent to df %>% group by(a,b) %>% summarise(n = n()). count() is paired
with tally(), a lower-level helper that is equivalent to df %>% summarise(n = n()). Supply
wt to perform weighted counts, switching the summary from n = n() to n = sum(wt).
add count() and add tally() are equivalents to count() and tally() but use mutate()
instead of summarise() so that they add a new column with group-wise counts.
count 19
Usage
count(x, ..., wt = NULL, sort = FALSE, name = NULL)
Arguments
x A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr).
... ¡data-masking¿ Variables to group by.
wt ¡data-masking¿ Frequency weights. Can be NULL or a variable:
If NULL (the default), counts the number of rows in each group.
If a variable, computes sum(wt) for each group.
sort If TRUE, will show the largest groups at the top.
name The name of the new column in the output.
If omitted, it will default to n. If there’s already a column called n, it will
error, and require you to specify the name.
.drop For count(): if FALSE will include counts for empty groups (i.e. for levels
of factors that don’t exist in the data). Deprecated in add count() since
it didn’t actually affect the output.
Value
An object of the same type as .data. count() and add count() group transiently, so the
output has the same groups as the input.
Examples
# count() is a convenient way to get a sense of the distribution of
# values in a dataset
starwars %>% count(species)
starwars %>% count(species, sort = TRUE)
starwars %>% count(sex, gender, sort = TRUE)
starwars %>% count(birth_decade = round(birth_year, -1))
# both count() and tally() have add_ variants that work like
# mutate() instead of summarise
df %>% add_count(gender, wt = runs)
df %>% add_tally(wt = runs)
Description
dplyr provides cumall(), cumany(), and cummean() to complete R’s set of cumulative func-
tions.
Usage
cumall(x)
cumany(x)
cummean(x)
Arguments
x For cumall() and cumany(), a logical vector; for cummean() an integer or
numeric vector.
Value
A vector the same length as x.
Examples
# `cummean()` returns a numeric/integer vector of the same length
# as the input vector.
x <- c(1, 3, 5, 2, 2)
cummean(x)
cumsum(x) / seq_along(x)
cumany(x == 3)
Description
c across() is designed to work with rowwise() to make it easy to perform row-wise aggre-
gations. It has two differences from c():
It uses tidy select semantics so you can easily select multiple variables. See vignette("rowwise")
for more details.
It uses vctrs::vec c() in order to give safer outputs.
Usage
c_across(cols = everything())
Arguments
cols ¡tidy-select¿ Columns to transform. Because across() is used within
functions like summarise() and mutate(), you can’t select or compute
upon grouping variables.
See Also
across() for a function that returns a tibble.
Examples
df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))
df %>%
rowwise() %>%
mutate(
sum = sum(c_across(w:z)),
sd = sd(c_across(w:z))
)
22 distinct
Description
Transform a vector into a format that will be sorted in descending order. This is useful
within arrange().
Usage
desc(x)
Arguments
x vector to transform
Examples
desc(1:10)
desc(factor(letters))
Description
Select only unique/distinct rows from a data frame. This is similar to unique.data.frame()
but considerably faster.
Usage
distinct(.data, ..., .keep_all = FALSE)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... ¡data-masking¿ Optional variables to use when determining uniqueness.
If there are multiple rows for a given combination of inputs, only the first
row will be preserved. If omitted, will use all variables.
.keep all If TRUE, keep all variables in .data. If a combination of ... is not distinct,
this keeps the first row of values.
distinct 23
Value
An object of the same type as .data. The output has the following properties:
Rows are a subset of the input but appear in the same order.
Columns are not modified if ... is empty or .keep all is TRUE. Otherwise, distinct()
first calls mutate() to create new columns.
Groups are not modified.
Data frame attributes are preserved.
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
Examples
df <- tibble(
x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)
)
nrow(df)
nrow(distinct(df))
nrow(distinct(df, x, y))
distinct(df, x)
distinct(df, y)
# Grouping -------------------------------------------------
# The same behaviour applies for grouped data frames,
# except that the grouping variables are always included
df <- tibble(
g = c(1, 1, 2, 2),
x = c(1, 1, 2, 1)
) %>% group_by(g)
df %>% distinct(x)
24 distinct all
Description
[Superseded]
Scoped verbs ( if, at, all) have been superseded by the use of across() in an existing
verb. See vignette("colwise") for details.
These scoped variants of distinct() extract distinct rows by a selection of variables. Like
distinct(), you can modify the variables before ordering with the .funs argument.
Usage
distinct_all(.tbl, .funs = list(), ..., .keep_all = FALSE)
Arguments
.tbl A tbl object.
.funs A function fun, a quosure style lambda ˜ fun(.) or a list of either form.
... Additional arguments for the function calls in .funs. These are evaluated
only once, with tidy dots support.
.keep all If TRUE, keep all variables in .data. If a combination of ... is not distinct,
this keeps the first row of values.
.vars A list of columns generated by vars(), a character vector of column
names, a numeric vector of column positions, or NULL.
.predicate A predicate function to be applied to the columns or a logical vector.
The variables for which .predicate is or returns TRUE are selected. This
argument is passed to rlang::as function() and thus supports quosure-
style lambda functions and strings representing function names.
Grouping variables
The grouping variables that are part of the selection are taken into account to determine
distinct rows.
Examples
df <- tibble(x = rep(2:5, each = 2) / 2, y = rep(2:3, each = 4) / 2)
distinct_all(df)
# ->
distinct(df, across())
distinct_at(df, vars(x,y))
# ->
distinct(df, across(c(x, y)))
explain 25
distinct_if(df, is.numeric)
# ->
distinct(df, across(where(is.numeric)))
# You can supply a function that will be applied before extracting the distinct values
# The variables of the sorted tibble keep their original values.
distinct_all(df, round)
# ->
distinct(df, across(everything(), round))
Description
This is a generic function which gives more details about an object than print(), and is
more focused on human readable output than str().
Usage
explain(x, ...)
show_query(x, ...)
Arguments
x An object to explain
... Other parameters possibly used by generic
Value
The first argument, invisibly.
Databases
Explaining a tbl sql will run the SQL EXPLAIN command which will describe the query
plan. This requires a little bit of knowledge about how EXPLAIN works for your database,
but is very useful for diagnosing performance problems.
Examples
if (require("dbplyr")) {
Description
The filter() function is used to subset a data frame, retaining all rows that satisfy your
conditions. To be retained, the row must produce a value of TRUE for all conditions. Note
that when a condition evaluates to NA the row will be dropped, unlike base subsetting with
[.
Usage
filter(.data, ..., .preserve = FALSE)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... ¡data-masking¿ Expressions that return a logical value, and are defined
in terms of the variables in .data. If multiple expressions are included,
they are combined with the & operator. Only rows for which all conditions
evaluate to TRUE are kept.
.preserve Relevant when the .data input is grouped. If .preserve = FALSE (the
default), the grouping structure is recalculated based on the resulting
data, otherwise the grouping is kept as is.
Details
The filter() function is used to subset the rows of .data, applying the expressions in ...
to the column values to determine which rows should be retained. It can be applied to
both grouped and ungrouped data (see group by() and ungroup()). However, dplyr is not
yet smart enough to optimise the filtering operation on grouped datasets that do not need
grouped calculations. For this reason, filtering is often considerably faster on ungrouped
data.
Value
An object of the same type as .data. The output has the following properties:
Rows are a subset of the input, but appear in the same order.
Columns are not modified.
The number of groups may be reduced (if .preserve is not TRUE).
Data frame attributes are preserved.
filter 27
Grouped tibbles
Because filtering expressions are computed within groups, they may yield different results
on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking
function is involved. Compare this ungrouped filtering:
In the ungrouped version, filter() compares the value of mass in each row to the global
average (taken over the whole data set), keeping only the rows with mass greater than this
global average. In contrast, the grouped version calculates the average mass separately
for each gender group, and keeps rows with mass greater than the relevant within-gender
average.
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
See Also
Other single table verbs: arrange(), mutate(), rename(), select(), slice(), summarise()
Examples
# Filtering by one criterion
filter(starwars, species == "Human")
filter(starwars, mass > 1000)
# When multiple expressions are used, they are combined using &
filter(starwars, hair_color == "none", eye_color == "black")
#
# The following filters rows where `mass` is greater than the
# global average:
starwars %>% filter(mass > mean(mass, na.rm = TRUE))
# Whereas this keeps rows with `mass` greater than the gender
# average:
starwars %>% group_by(gender) %>% filter(mass > mean(mass, na.rm = TRUE))
# To refer to column names that are stored as strings, use the `.data` pronoun:
vars <- c("mass", "height")
cond <- c(80, 150)
starwars %>%
filter(
.data[[vars[[1]]]] > cond[[1]],
.data[[vars[[2]]]] > cond[[2]]
)
# Learn more in ?dplyr_data_masking
Description
Filtering joins filter rows from x based on the presence or absence of matches in y:
Usage
semi_join(x, y, by = NULL, copy = FALSE, ...)
Arguments
x, y A pair of data frames, data frame extensions (e.g. a tibble), or lazy data
frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
by A character vector of variables to join by.
If NULL, the default, * join() will perform a natural join, using all variables
in common across x and y. A message lists the variables so that you can
check they’re correct; suppress the message by supplying by explicitly.
To join by different variables on x and y, use a named vector. For example,
by = c("a" = "b") will match x$a to y$b.
filter-joins 29
Value
An object of the same type as x. The output has the following properties:
Rows are a subset of the input, but appear in the same order.
Columns are not modified.
Data frame attributes are preserved.
Groups are taken from x. The number of groups may be reduced.
Methods
These function are generics, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
Methods available in currently loaded packages:
See Also
Other joins: mutate-joins, nest join()
Examples
# "Filtering" joins keep cases from the LHS
band_members %>% semi_join(band_instruments)
band_members %>% anti_join(band_instruments)
Description
[Superseded]
Scoped verbs ( if, at, all) have been superseded by the use of across() in an existing
verb. See vignette("colwise") for details.
These scoped filtering verbs apply a predicate expression to a selection of variables. The
predicate expression should be quoted with all vars() or any vars() and should mention
the pronoun . to refer to variables.
Usage
filter_all(.tbl, .vars_predicate, .preserve = FALSE)
Arguments
.tbl A tbl object.
.vars predicate
A quoted predicate expression as returned by all vars() or any vars().
Can also be a function or purrr-like formula. In this case, the intersection
of the results is taken by default and there’s currently no way to request
the union.
.preserve when FALSE (the default), the grouping structure is recalculated based on
the resulting data, otherwise it is kept as is.
.predicate A predicate function to be applied to the columns or a logical vector.
The variables for which .predicate is or returns TRUE are selected. This
argument is passed to rlang::as function() and thus supports quosure-
style lambda functions and strings representing function names.
.vars A list of columns generated by vars(), a character vector of column
names, a numeric vector of column positions, or NULL.
Grouping variables
The grouping variables that are part of the selection are taken into account to determine
filtered rows.
Examples
# While filter() accepts expressions with specific variables, the
# scoped filter verbs take an expression with the pronoun `.` and
# replicate it over all variables. This expression should be quoted
# with all_vars() or any_vars():
all_vars(is.na(.))
any_vars(is.na(.))
group by 31
# Or the union:
filter_all(mtcars, any_vars(. > 150))
# ->
filter(mtcars, if_any(everything(), ˜ . > 150))
Description
Most data operations are done on groups defined by variables. group by() takes an ex-
isting tbl and converts it into a grouped tbl where operations are performed ”by group”.
ungroup() removes grouping.
Usage
group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))
ungroup(x, ...)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... In group by(), variables or computations to group by. In ungroup(),
variables to remove from the grouping.
.add When FALSE, the default, group by() will override existing groups. To
add to the existing groups, use .add = TRUE.
This argument was previously called add, but that prevented creating a
new grouping variable called add, and conflicts with our naming conven-
tions.
32 group by
.drop Drop groups formed by factor levels that don’t appear in the data? The
default is TRUE except when .data has been previously grouped with .drop
= FALSE. See group by drop default() for details.
x A tbl()
Value
A grouped data frame with class grouped df, unless the combination of ... and add yields
a empty set of grouping columns, in which case a tibble will be returned.
Methods
These function are generics, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
Methods available in currently loaded packages:
group by(): no methods found.
ungroup(): no methods found.
See Also
Other grouping functions: group map(), group nest(), group split(), group trim()
Examples
by_cyl <- mtcars %>% group_by(cyl)
# grouping doesn't change how the data looks (apart from listing
# how it's grouped):
by_cyl
# when factors are involved and .drop = FALSE, groups can be empty
tbl <- tibble(
x = 1:10,
y = factor(rep(c("a", "c"), each = 5), levels = c("a", "b", "c"))
)
tbl %>%
group_by(y, .drop = FALSE) %>%
group_rows()
Description
[Superseded]
Scoped verbs ( if, at, all) have been superseded by the use of across() in an existing
verb. See vignette("colwise") for details.
These scoped variants of group by() group a data frame by a selection of variables. Like
group by(), they have optional mutate semantics.
Usage
group_by_all(
.tbl,
.funs = list(),
...,
.add = FALSE,
.drop = group_by_drop_default(.tbl)
)
group_by_at(
.tbl,
.vars,
.funs = list(),
...,
.add = FALSE,
.drop = group_by_drop_default(.tbl)
)
group_by_if(
.tbl,
.predicate,
34 group by all
.funs = list(),
...,
.add = FALSE,
.drop = group_by_drop_default(.tbl)
)
Arguments
.tbl A tbl object.
.funs A function fun, a quosure style lambda ˜ fun(.) or a list of either form.
... Additional arguments for the function calls in .funs. These are evaluated
only once, with tidy dots support.
.add See group by()
.drop Drop groups formed by factor levels that don’t appear in the data? The
default is TRUE except when .data has been previously grouped with .drop
= FALSE. See group by drop default() for details.
.vars A list of columns generated by vars(), a character vector of column
names, a numeric vector of column positions, or NULL.
.predicate A predicate function to be applied to the columns or a logical vector.
The variables for which .predicate is or returns TRUE are selected. This
argument is passed to rlang::as function() and thus supports quosure-
style lambda functions and strings representing function names.
Grouping variables
Existing grouping variables are maintained, even if not included in the selection.
Examples
# Group a data frame by all variables:
group_by_all(mtcars)
# ->
mtcars %>% group_by(across())
Description
This selection helpers matches grouping variables. It can be used in select() or vars()
selections.
Usage
group_cols(vars = NULL, data = NULL)
Arguments
vars Deprecated; please use data instead.
data For advanced use only. The default NULL automatically finds the ”current”
data frames.
See Also
groups() and group vars() for retrieving the grouping variables outside selection contexts.
Examples
gdf <- iris %>% group_by(Species)
gdf %>% select(group_cols())
Description
[Experimental]
group map(), group modify() and group walk() are purrr-style functions that can be used
to iterate on grouped tibbles.
Usage
group_map(.data, .f, ..., .keep = FALSE)
Arguments
.data A grouped tibble
.f A function or formula to apply to each group.
If a function, it is used as is. It should have at least 2 formal arguments.
If a formula, e.g. ˜ head(.x), it is converted to a function.
In the formula, you can use
. or .x to refer to the subset of rows of .tbl for the given group
.y to refer to the key, a one row tibble with one column per grouping
variable that identifies the group
... Additional arguments passed on to .f
.keep are the grouping variables kept in .x
Details
Use group modify() when summarize() is too limited, in terms of what you need to do and
return for each group. group modify() is good for ”data frame in, data frame out”. If that
is too limited, you need to use a nested or split workflow. group modify() is an evolution
of do(), if you have used that before.
Each conceptual group of the data frame is exposed to the function .f with two pieces of
information:
The subset of the data for the group, exposed as .x.
The key, a tibble with exactly one row and columns for each grouping variable, exposed
as .y.
For completeness, group modify(), group map and group walk() also work on ungrouped
data frames, in that case the function is applied to the entire data frame (exposed as .x),
and .y is a one row tibble with no column, consistently with group keys().
Value
group modify() returns a grouped tibble. In that case .f must return a data frame.
group map() returns a list of results from calling .f on each group.
group walk() calls .f for side effects and returns the input .tbl, invisibly.
See Also
Other grouping functions: group by(), group nest(), group split(), group trim()
Examples
# return a list
mtcars %>%
group_by(cyl) %>%
group_map(˜ head(.x, 2L))
# a list of vectors
iris %>%
group_by(Species) %>%
group_map(˜ quantile(.x$Petal.Length, probs = c(0.25, 0.5, 0.75)))
iris %>%
group_by(Species) %>%
group_modify(˜ {
.x %>%
purrr::map_dfc(fivenum) %>%
mutate(nms = c("min", "Q1", "median", "Q3", "max"))
})
Description
[Experimental] group split() works like base::split() but
it uses the grouping structure from group by() and therefore is subject to the data
mask
38 group split
it does not name the elements of the list based on the grouping as this typically loses
information and is confusing.
group keys() explains the grouping structure, by returning a data frame that has one row
per group and one column per grouping variable.
Usage
Arguments
.tbl A tbl
... Grouping specification, forwarded to group by()
.keep Should the grouping columns be kept
Value
group split() returns a list of tibbles. Each tibble contains the rows of .tbl for the
associated group and all the columns, including the grouping variables.
group keys() returns a tibble with one row per group, and one column per grouping
variable
The primary use case for group split() is with already grouped data frames, typically a
result of group by(). In this case group split() only uses the first argument, the grouped
tibble, and warns when ... is used.
Because some of these groups may be empty, it is best paired with group keys() which
identifies the representatives of each grouping variable for the group.
When used on ungrouped data frames, group split() and group keys() forwards the ...
to group by() before the split, therefore the ... are subject to the data mask.
Using these functions on an ungrouped data frame only makes sense if you need only one
or the other, because otherwise the grouping algorithm is performed each time.
group split() returns a list of one-row tibbles is returned, and the ... are ignored and
warned against
See Also
Other grouping functions: group by(), group map(), group nest(), group trim()
group trim 39
Examples
# ----- use case 1 : on an already grouped tibble
ir <- iris %>%
group_by(Species)
group_split(ir)
group_keys(ir)
# this can be useful if the grouped data has been altered before the split
ir <- iris %>%
group_by(Species) %>%
filter(Sepal.Length > mean(Sepal.Length))
group_split(ir)
group_keys(ir)
iris %>%
group_keys(Species)
Description
[Experimental] Drop unused levels of all factors that are used as grouping variables, then
recalculates the grouping structure.
group trim() is particularly useful after a filter() that is intended to select a subset of
groups.
Usage
group_trim(.tbl, .drop = group_by_drop_default(.tbl))
Arguments
.tbl A grouped data frame
.drop See group by()
Value
A grouped data frame
See Also
Other grouping functions: group by(), group map(), group nest(), group split()
40 if else
Examples
iris %>%
group_by(Species) %>%
filter(Species == "setosa", .preserve = TRUE) %>%
group_trim()
Description
ident() takes unquoted strings and flags them as identifiers. ident q() assumes its input
has already been quoted, and ensures it does not get quoted again. This is currently used
only for for schema.table.
Usage
ident(...)
Arguments
... A character vector, or name-value pairs
Examples
# Identifiers are escaped with "
if (requireNamespace("dbplyr", quietly = TRUE)) {
ident("x")
}
if else Vectorised if
Description
Compared to the base ifelse(), this function is more strict. It checks that true and false
are the same type. This strictness makes the output type more predictable, and makes it
somewhat faster.
Usage
if_else(condition, true, false, missing = NULL)
Arguments
condition Logical vector
true, false Values to use for TRUE and FALSE values of condition. They must be
either the same length as condition, or length 1. They must also be
the same type: if else() checks that they have the same type and same
class. All other attributes are taken from true.
missing If not NULL, will be used to replace missing values.
lead-lag 41
Value
Where condition is TRUE, the matching value from true, where it’s FALSE, the matching
value from false, otherwise NA.
Examples
x <- c(-5:5, NA)
if_else(x < 0, NA_integer_, x)
if_else(x < 0, "negative", "positive", "missing")
Description
Find the ”previous” (lag()) or ”next” (lead()) values in a vector. Useful for comparing
values behind of or ahead of the current values.
Usage
lag(x, n = 1L, default = NA, order_by = NULL, ...)
Arguments
x Vector of values
n Positive integer of length 1, giving the number of positions to lead or lag
by
default Value used for non-existent rows. Defaults to NA.
order by Override the default ordering to use another vector or column
... Needed for compatibility with lag generic.
Examples
lag(1:5)
lead(1:5)
x <- 1:5
tibble(behind = lag(x), x, ahead = lead(x))
lead(1:5, n = 1)
42 mutate
lead(1:5, n = 2)
lead(1:5)
lead(1:5, default = 6)
Description
mutate() adds new variables and preserves existing ones; transmute() adds new variables
and drops existing ones. New variables overwrite existing variables of the same name.
Variables can be removed by setting their value to NULL.
Usage
mutate(.data, ...)
transmute(.data, ...)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... ¡data-masking¿ Name-value pairs. The name gives the name of the col-
umn in the output.
The value can be:
A vector of length 1, which will be recycled to the correct length.
A vector the same length as the current group (or the whole data
frame if ungrouped).
mutate 43
Value
An object of the same type as .data. The output has the following properties:
Grouped tibbles
Because mutating expressions are computed within groups, they may yield different results
on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking
function is involved. Compare this ungrouped mutate:
starwars %>%
select(name, mass, species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
starwars %>%
select(name, mass, species) %>%
group_by(species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the
averages within species levels.
Methods
These function are generics, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
Methods available in currently loaded packages:
See Also
Other single table verbs: arrange(), filter(), rename(), select(), slice(), summarise()
Examples
# Newly created variables are available immediately
starwars %>%
select(name, mass) %>%
mutate(
mass2 = mass * 2,
mass2_squared = mass2 * mass2
)
# Grouping ----------------------------------------
# The mutate operation may yield different results on grouped
# tibbles because the expressions are computed within groups.
# The following normalises `mass` by the global average:
starwars %>%
select(name, mass, species) %>%
mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
# Indirection ----------------------------------------
# Refer to column names stored as strings with the `.data` pronoun:
vars <- c("mass", "height")
mutate(starwars, prod = .data[[vars[[1]]]] * .data[[vars[[2]]]])
# Learn more in ?dplyr_data_masking
Description
The mutating joins add columns from y to x, matching rows based on the keys:
If a row in x matches multiple rows in y, all the rows in y will be returned once for each
matching row in x.
46 mutate-joins
Usage
inner_join(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = FALSE
)
left_join(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = FALSE
)
right_join(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
mutate-joins 47
keep = FALSE
)
full_join(
x,
y,
by = NULL,
copy = FALSE,
suffix = c(".x", ".y"),
...,
keep = FALSE
)
Arguments
x, y A pair of data frames, data frame extensions (e.g. a tibble), or lazy data
frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
by A character vector of variables to join by.
If NULL, the default, * join() will perform a natural join, using all variables
in common across x and y. A message lists the variables so that you can
check they’re correct; suppress the message by supplying by explicitly.
To join by different variables on x and y, use a named vector. For example,
by = c("a" = "b") will match x$a to y$b.
To join by multiple variables, use a vector with length ¿ 1. For example,
by = c("a","b") will match x$a to y$a and x$b to y$b. Use a named
vector to match different variables in x and y. For example, by = c("a" =
"b","c" = "d") will match x$a to y$b and x$c to y$d.
48 mutate-joins
Value
An object of the same type as x. The order of the rows and columns of x is preserved as
much as possible. The output has the following properties:
For inner join(), a subset of x rows. For left join(), all x rows. For right join(), a
subset of x rows, followed by unmatched y rows. For full join(), all x rows, followed
by unmatched y rows.
For all joins, rows will be duplicated if one or more rows in x matches multiple rows
in y.
Output columns include all x columns and all y columns. If columns in x and y have
the same name (and aren’t included in by), suffixes are added to disambiguate.
Output columns included in by are coerced to common type across x and y.
Groups are taken from x.
Methods
These functions are generics, which means that packages can provide implementations
(methods) for other classes. See the documentation of individual methods for extra argu-
ments and differences in behaviour.
Methods available in currently loaded packages:
See Also
Other joins: filter-joins, nest join()
mutate all 49
Examples
band_members %>% inner_join(band_instruments)
band_members %>% left_join(band_instruments)
band_members %>% right_join(band_instruments)
band_members %>% full_join(band_instruments)
# If a row in `x` matches multiple rows in `y`, all the rows in `y` will be
# returned once for each matching row in `x`
df1 <- tibble(x = 1:3)
df2 <- tibble(x = c(1, 1, 2), y = c("first", "second", "third"))
df1 %>% left_join(df2)
Description
[Superseded]
Scoped verbs ( if, at, all) have been superseded by the use of across() in an existing
verb. See vignette("colwise") for details.
The scoped variants of mutate() and transmute() make it easy to apply the same trans-
formation to multiple variables. There are three variants:
Usage
mutate_all(.tbl, .funs, ...)
50 mutate all
Arguments
.tbl A tbl object.
.funs A function fun, a quosure style lambda ˜ fun(.) or a list of either form.
... Additional arguments for the function calls in .funs. These are evaluated
only once, with tidy dots support.
.predicate A predicate function to be applied to the columns or a logical vector.
The variables for which .predicate is or returns TRUE are selected. This
argument is passed to rlang::as function() and thus supports quosure-
style lambda functions and strings representing function names.
.vars A list of columns generated by vars(), a character vector of column
names, a numeric vector of column positions, or NULL.
.cols This argument has been renamed to .vars to fit dplyr’s terminology and
is deprecated.
Value
A data frame. By default, the newly created columns have the shortest names needed to
uniquely identify the output. To force inclusion of a name, even when not needed, name
the input (see examples for details).
Grouping variables
If applied on a grouped tibble, these operations are not applied to the grouping variables.
The behaviour depends on whether the selection is implicit (all and if selections) or
explicit (at selections).
Grouping variables covered by explicit selections in mutate at() and transmute at()
are always an error. Add -group cols() to the vars() selection to avoid this:
data %>% mutate_at(vars(-group_cols(), ...), myoperation)
Or remove group vars() from the character vector of column names:
nms <- setdiff(nms, group_vars(data))
data %>% mutate_at(vars, myoperation)
Grouping variables covered by implicit selections are ignored by mutate all(), transmute all(),
mutate if(), and transmute if().
Naming
The names of the new columns are derived from the names of the input variables and the
names of the functions.
mutate all 51
if there is only one unnamed function (i.e. if .funs is an unnamed list of length one),
the names of the input variables are used to name the new columns;
for at functions, if there is only one unnamed variable (i.e., if .vars is of the form
vars(a single column)) and .funs has length greater than one, the names of the
functions are used to name the new columns;
otherwise, the new names are created by concatenating the names of the input variables
and the names of the functions, separated with an underscore " ".
The .funs argument can be a named or unnamed list. If a function is unnamed and the
name cannot be derived automatically, a name of the form ”fn#” is used. Similarly, vars()
accepts named and unnamed arguments. If a variable in .vars is named, a new column by
that name will be created.
Name collisions in the new columns are disambiguated using a unique suffix.
Life cycle
The functions are maturing, because the naming scheme and the disambiguation algorithm
are subject to change in dplyr 0.9.0.
See Also
The other scoped verbs, vars()
Examples
iris <- as_tibble(iris)
# You can also supply selection helpers to _at() functions but you have
# to quote them with vars():
iris %>% mutate_at(vars(matches("Sepal")), log)
iris %>% mutate(across(matches("Sepal"), log))
na if Convert values to NA
Description
This is a translation of the SQL command NULLIF. It is useful if you want to convert an
annoying value to NA.
Usage
na_if(x, y)
Arguments
x Vector to modify
y Value to replace with NA
Value
A modified version of x that replaces any values that are equal to y with NA.
See Also
coalesce() to replace missing values with a specified value.
tidyr::replace na() to replace NA with a value.
recode() to more generally replace values.
near 53
Examples
na_if(1:5, 5:1)
Description
This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal.
This is safer than using ==, because it has a built in tolerance
Usage
Arguments
Examples
sqrt(2) ˆ 2 == 2
near(sqrt(2) ˆ 2, 2)
54 nest join
Description
nest join() returns all rows and columns in x with a new nested-df column that contains
all matches from y. When there is no match, the list column is a 0-row tibble.
Usage
nest_join(x, y, by = NULL, copy = FALSE, keep = FALSE, name = NULL, ...)
Arguments
x, y A pair of data frames, data frame extensions (e.g. a tibble), or lazy data
frames (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
by A character vector of variables to join by.
If NULL, the default, * join() will perform a natural join, using all variables
in common across x and y. A message lists the variables so that you can
check they’re correct; suppress the message by supplying by explicitly.
To join by different variables on x and y, use a named vector. For example,
by = c("a" = "b") will match x$a to y$b.
To join by multiple variables, use a vector with length ¿ 1. For example,
by = c("a","b") will match x$a to y$a and x$b to y$b. Use a named
vector to match different variables in x and y. For example, by = c("a" =
"b","c" = "d") will match x$a to y$b and x$c to y$d.
To perform a cross-join, generating all combinations of x and y, use by =
character().
copy If x and y are not from the same data source, and copy is TRUE, then y
will be copied into the same src as x. This allows you to join tables across
srcs, but it is a potentially expensive operation so you must opt into it.
keep Should the join keys from both x and y be preserved in the output?
name The name of the list column nesting joins create. If NULL the name of y
is used.
... Other parameters passed onto methods.
Details
In some sense, a nest join() is the most fundamental join since you can recreate the other
joins from it:
inner join() is a nest join() plus tidyr::unnest()
left join() nest join() plus unnest(.drop = FALSE).
semi join() is a nest join() plus a filter() where you check that every element of
data has at least one row,
anti join() is a nest join() plus a filter() where you check every element has zero
rows.
nth 55
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
See Also
Other joins: filter-joins, mutate-joins
Examples
band_members %>% nest_join(band_instruments)
Description
These are straightforward wrappers around [[. The main advantage is that you can provide
an optional secondary vector that defines the ordering, and provide a default value to use
when the input is shorter than expected.
Usage
nth(x, n, order_by = NULL, default = default_missing(x))
Arguments
x A vector
n For nth(), a single integer specifying the position. Negative integers index
from the end (i.e. -1L will return the last value in the vector).
If a double is supplied, it will be silently truncated.
order by An optional vector used to determine the order
default A default value to use if the position does not exist in the input. This is
guessed by default for base vectors, where a missing value of the appro-
priate type is returned, and for lists, where a NULL is return.
For more complicated objects, you’ll need to supply this value. Make sure
it is the same type as x.
Value
A single value. [[ is used to do the subsetting.
56 order by
Examples
x <- 1:10
y <- 10:1
first(x)
last(y)
nth(x, 1)
nth(x, 5)
nth(x, -2)
nth(x, 11)
last(x)
# Second argument provides optional ordering
last(x, y)
Description
This is a faster and more concise equivalent of length(unique(x))
Usage
n_distinct(..., na.rm = FALSE)
Arguments
... vectors of values
na.rm if TRUE missing values don’t count
Examples
x <- sample(1:10, 1e5, rep = TRUE)
length(unique(x))
n_distinct(x)
Description
This function makes it possible to control the ordering of window functions in R that don’t
have a specific ordering parameter. When translated to SQL it will modify the order clause
of the OVER function.
pull 57
Usage
order_by(order_by, call)
Arguments
order by a vector to order by
call a function call to a window function, where the first argument is the vector
being operated on
Details
This function works by changing the call to instead call with order() with the appropriate
arguments.
Examples
order_by(10:1, cumsum(1:10))
x <- 10:1
y <- 1:10
order_by(x, cumsum(y))
Description
pull() is similar to $. It’s mostly useful because it looks a little nicer in pipes, it also works
with remote data frames, and it can optionally name the output.
Usage
pull(.data, var = -1, name = NULL, ...)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
var A variable specified as:
a literal variable name
a positive integer, giving the position counting from the left
a negative integer, giving the position counting from the right.
58 ranking
The default returns the last column (on the assumption that’s the column
you’ve created most recently).
This argument is taken by expression and supports quasiquotation (you
can unquote column names and column locations).
name An optional parameter that specifies the column to be used as names for
a named vector. Specified in a similar manner as var.
... For use by methods.
Value
A vector the same size as .data.
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
Examples
mtcars %>% pull(-1)
mtcars %>% pull(1)
mtcars %>% pull(cyl)
Description
Six variations on ranking functions, mimicking the ranking functions described in SQL2003.
They are currently implemented using the built in rank function, and are provided mainly
as a convenience when converting between R and SQL. All ranking functions map smallest
inputs to smallest outputs. Use desc() to reverse the direction.
Usage
row_number(x)
ntile(x = row_number(), n)
recode 59
min_rank(x)
dense_rank(x)
percent_rank(x)
cume_dist(x)
Arguments
x a vector of values to rank. Missing values are left as is. If you want to
treat them as the smallest or largest values, replace with Inf or -Inf before
ranking.
n number of groups to split up into.
Details
Examples
x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
min_rank(x)
dense_rank(x)
percent_rank(x)
cume_dist(x)
ntile(x, 2)
ntile(1:8, 3)
Description
This is a vectorised version of switch(): you can replace numeric values based on their
position or their name, and character or factor values only by their name. This is an S3
generic: dplyr provides methods for numeric, character, and factors. For logical vectors,
use if else(). For more complicated criteria, use case when().
You can use recode() directly with factors; it will preserve the existing order of levels while
changing the values. Alternatively, you can use recode factor(), which will change the
order of levels to match the order of replacements. See the forcats package for more tools
for working with factors and their levels.
[Questioning] recode() is questioning because the arguments are in the wrong order. We
have new <-old, mutate(df,new = old), and rename(df,new = old) but recode(x,old = new).
We don’t yet know how to fix this problem, but it’s likely to involve creating a new function
then retiring or deprecating recode().
Usage
Arguments
.x A vector to modify
... ¡dynamic-dots¿ Replacements. For character and factor .x, these should
be named and replacement is based only on their name. For numeric .x,
these can be named or not. If not named, the replacement is done based
on position i.e. .x represents positions to look for in replacements. See
examples.
When named, the argument names should be the current values to be re-
placed, and the argument values should be the new (replacement) values.
All replacements must be the same type, and must have either length one
or the same length as .x.
.default If supplied, all values not otherwise matched will be given this value. If
not supplied and if the replacements are the same type as the original
values in .x, unmatched values are not changed. If not supplied and if
the replacements are not compatible, unmatched values are replaced with
NA.
.default must be either length 1 or the same length as .x.
.missing If supplied, any missing values in .x will be replaced by this value. Must
be either length 1 or the same length as .x.
.ordered If TRUE, recode factor() creates an ordered factor.
Value
A vector the same length as .x, and the same type as the first of ..., .default, or .missing.
recode factor() returns a factor whose levels are in the same order as in .... The levels
in .default and .missing come last.
recode 61
See Also
na if() to replace specified values with a NA.
coalesce() to replace missing values with a specified value.
tidyr::replace na() to replace NA with a value.
Examples
# For character values, recode values with named arguments only. Unmatched
# values are unchanged.
char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)
recode(char_vec, a = "Apple")
recode(char_vec, a = "Apple", b = "Banana")
## End(Not run)
Description
Use relocate() to change column positions, using the same syntax as select() to make it
easy to move blocks of columns at once.
Usage
relocate(.data, ..., .before = NULL, .after = NULL)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... ¡tidy-select¿ Columns to move.
.before, .after
¡tidy-select¿ Destination of columns selected by .... Supplying neither
will move columns to the left-hand side; specifying both is an error.
Value
An object of the same type as .data. The output has the following properties:
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
rename 63
Examples
df <- tibble(a = 1, b = 1, c = 1, d = "a", e = "a", f = "a")
df %>% relocate(f)
df %>% relocate(a, .after = c)
df %>% relocate(f, .before = b)
df %>% relocate(a, .after = last_col())
Description
rename() changes the names of individual variables using new name = old name syntax;
rename with() renames columns using a function.
Usage
rename(.data, ...)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... For rename(): ¡tidy-select¿ Use new name = old name to rename selected
variables.
For rename with(): additional arguments passed onto .fn.
.fn A function used to transform the selected .cols. Should return a char-
acter vector the same length as the input.
.cols ¡tidy-select¿ Columns to rename; defaults to all columns.
64 rows
Value
An object of the same type as .data. The output has the following properties:
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
See Also
Other single table verbs: arrange(), filter(), mutate(), select(), slice(), summarise()
Examples
iris <- as_tibble(iris) # so it prints a little nicer
rename(iris, petal_length = Petal.Length)
rename_with(iris, toupper)
rename_with(iris, toupper, starts_with("Petal"))
rename_with(iris, ˜ tolower(gsub(".", "_", .x, fixed = TRUE)))
Description
[Experimental]
These functions provide a framework for modifying rows in a table using a second table of
data. The two tables are matched by a set of key variables whose values must uniquely
identify each row. The functions are inspired by SQL’s INSERT, UPDATE, and DELETE, and
can optionally modify in place for selected backends.
rows insert() adds new rows (like INSERT); key values in y must not occur in x.
rows update() modifies existing rows (like UPDATE); key values in y must occur in x.
rows patch() works like rows update() but only overwrites NA values.
rows upsert() inserts or updates depending on whether or not the key value in y
already exists in x.
rows delete() deletes rows (like DELETE); key values in y must exist in x.
rows 65
Usage
rows_insert(x, y, by = NULL, ..., copy = FALSE, in_place = FALSE)
Arguments
x, y A pair of data frames or data frame extensions (e.g. a tibble). y must
have the same columns of x or a subset.
by An unnamed character vector giving the key columns. The key values
must uniquely identify each row (i.e. each combination of key values
occurs at most once), and the key columns must exist in both x and y.
By default, we use the first column in y, since the first column is a rea-
sonable place to put an identifier variable.
... Other parameters passed onto methods.
copy If x and y are not from the same data source, and copy is TRUE, then y
will be copied into the same src as x. This allows you to join tables across
srcs, but it is a potentially expensive operation so you must opt into it.
in place Should x be modified in place? This argument is only relevant for mutable
backends (e.g. databases, data.tables).
When TRUE, a modified version of x is returned invisibly; when FALSE, a
new object representing the resulting changes is returned.
Value
An object of the same type as x. The order of the rows and columns of x is preserved as
much as possible. The output has the following properties:
rows update() preserves rows as is; rows insert() and rows upsert() return all ex-
isting rows and potentially new rows; rows delete() returns a subset of the rows.
Columns are not added, removed, or relocated, though the data may be updated.
Groups are taken from x.
Data frame attributes are taken from x.
If in place = TRUE, the result will be returned invisibly.
Examples
data <- tibble(a = 1:3, b = letters[c(1:2, NA)], c = 0.5 + 0:2)
data
# Insert
rows_insert(data, tibble(a = 4, b = "z"))
try(rows_insert(data, tibble(a = 3, b = "z")))
# Update
66 rowwise
Description
rowwise() allows you to compute on a data frame a row-at-a-time. This is most useful
when a vectorised function doesn’t exist.
Most dplyr verbs preserve row-wise grouping. The exception is summarise(), which return
a grouped df. You can explicitly ungroup with ungroup() or as tibble(), or convert to a
grouped df with group by().
Usage
rowwise(data, ...)
Arguments
data Input data frame.
... ¡tidy-select¿ Variables to be preserved when calling summarise(). This
is typically a set of variables whose combination uniquely identify each
row.
NB: unlike group by() you can not create new variables here but instead
you can select multiple variables with (e.g.) everything().
Value
A row-wise data frame with class rowwise df. Note that a rowwise df is implicitly grouped
by row, but is not a grouped df.
List-columns
Because a rowwise has exactly one row per group it offers a small convenience for working
with list-columns. Normally, summarise() and mutate() extract a groups worth of data
with [. But when you index a list in this way, you get back another list. When you’re
working with a rowwise tibble, then dplyr will use [[ instead of [ to make your life a little
easier.
See Also
nest by() for a convenient way of creating rowwise data frames with nested data.
scoped 67
Examples
df <- tibble(x = runif(6), y = runif(6), z = runif(6))
# Compute the mean of x, y, z in each row
df %>% rowwise() %>% mutate(m = mean(c(x, y, z)))
# use c_across() to more easily select many variables
df %>% rowwise() %>% mutate(m = mean(c_across(x:z)))
# If you want one row per simulation, put the results in a list()
params %>%
rowwise(sim) %>%
summarise(z = list(rnorm(n, mean, sd)))
Description
[Superseded]
Scoped verbs ( if, at, all) have been superseded by the use of across() in an existing
verb. See vignette("colwise") for details.
The variants suffixed with if, at or all apply an expression (sometimes several) to all
variables within a specified subset. This subset can contain all variables ( all variants), a
vars() selection ( at variants), or variables selected with a predicate ( if variants).
The verbs with scoped variants are:
mutate(), transmute() and summarise(). See summarise all().
filter(). See filter all().
group by(). See group by all().
rename() and select(). See select all().
arrange(). See arrange all()
There are three kinds of scoped variants. They differ in the scope of the variable selection
on which operations are applied:
68 scoped
Arguments
.tbl A tbl object.
.funs A function fun, a quosure style lambda ˜ fun(.) or a list of either form.
.vars A list of columns generated by vars(), a character vector of column
names, a numeric vector of column positions, or NULL.
.predicate A predicate function to be applied to the columns or a logical vector.
The variables for which .predicate is or returns TRUE are selected. This
argument is passed to rlang::as function() and thus supports quosure-
style lambda functions and strings representing function names.
... Additional arguments for the function calls in .funs. These are evaluated
only once, with tidy dots support.
Grouping variables
Most of these operations also apply on the grouping variables when they are part of the
selection. This includes:
This is not the case for summarising and mutating variants where operations are not applied
on grouping variables. The behaviour depends on whether the selection is implicit (all and
if selections) or explicit (at selections). Grouping variables covered by explicit selections
(with summarise at(), mutate at(), and transmute at()) are always an error. For implicit
selections, the grouping variables are always ignored. In this case, the level of verbosity
depends on the kind of operation:
Summarising operations (summarise all() and summarise if()) ignore grouping vari-
ables silently because it is obvious that operations are not applied on grouping vari-
ables.
On the other hand it isn’t as obvious in the case of mutating operations (mutate all(),
mutate if(), transmute all(), and transmute if()). For this reason, they issue a
message indicating which grouping variables are ignored.
select 69
Description
Select (and optionally rename) variables in a data frame, using a concise mini-language
that makes it easy to refer to variables based on their name (e.g. a:f selects all columns
from a on the left to f on the right). You can also use predicate functions like is.numeric
to select variables based on their properties.
Usage
select(.data, ...)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... ¡tidy-select¿ One or more unquoted expressions separated by commas.
Variable names can be used as if they were positions in the data frame,
so expressions like x:y can be used to select a range of variables.
70 select
Value
An object of the same type as .data. The output has the following properties:
Rows are not affected.
Output columns are a subset of input columns, potentially with a different order.
Columns will be renamed if new name = old name form is used.
Data frame attributes are preserved.
Groups are maintained; you can’t select off grouping variables.
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
Examples
Here we show the usage for the basic selection operators. See the specific help pages to
learn about helpers like starts with().
The selection language can be used in functions like dplyr::select() or tidyr::pivot longer().
Let’s first attach the tidyverse:
library(tidyverse)
Select multiple variables by separating them with commas. Note how the order of columns
is determined by the order of inputs:
select 71
Functions like tidyr::pivot longer() don’t take variables with dots. In this case use c()
to select multiple variables:
Operators::
The : operator selects a range of consecutive variables:
starwars %>% select(name:mass)
#> # A tibble: 87 x 3
#> name height mass
#> <chr> <int> <dbl>
#> 1 Luke Skywalker 172 77
#> 2 C-3PO 167 75
#> 3 R2-D2 96 32
#> 4 Darth Vader 202 136
#> # ... with 83 more rows
The ! operator negates a selection:
starwars %>% select(!(name:mass))
#> # A tibble: 87 x 11
#> hair_color skin_color eye_color birth_year sex gender homeworld species films vehicles
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 blond fair blue 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [
#> 2 <NA> gold yellow 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [
#> 3 <NA> white, blue red 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [
#> 4 none white yellow 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr
#> # ... with 83 more rows
To take the difference between two selections, combine the & and ! operators:
See Also
Other single table verbs: arrange(), filter(), mutate(), rename(), slice(), summarise()
setops 73
Description
These functions override the set functions provided in base to make them generic so that
efficient versions for data frames and other tables can be provided. The default methods call
the base versions. Beware that intersect(), union() and setdiff() remove duplicates.
Usage
union_all(x, y, ...)
Arguments
Examples
mtcars$model <- rownames(mtcars)
first <- mtcars[1:20, ]
second <- mtcars[10:32, ]
intersect(first, second)
union(first, second)
setdiff(first, second)
setdiff(second, first)
union_all(first, second)
setequal(mtcars, mtcars[32:1, ])
# Handling of duplicates:
a <- data.frame(column = c(1:10, 10))
b <- data.frame(column = c(1:5, 5))
Description
slice() lets you index rows by their (integer) locations. It allows you to select, remove,
and duplicate rows. It is accompanied by a number of helpers for common use cases:
slice head() and slice tail() select the first or last rows.
slice sample() randomly selects rows.
slice min() and slice max() select rows with highest or lowest values of a variable.
If .data is a grouped df, the operation will be performed on each group, so that (e.g.)
slice head(df,n = 5) will select the first five rows in each group.
Usage
slice(.data, ..., .preserve = FALSE)
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... For slice(): ¡data-masking¿ Integer row values.
Provide either positive values to keep, or negative values to drop. The
values provided must be either all positive or all negative. Indices beyond
the number of rows in the input are silently ignored.
For slice helpers(), these arguments are passed on to methods.
.preserve Relevant when the .data input is grouped. If .preserve = FALSE (the
default), the grouping structure is recalculated based on the resulting
data, otherwise the grouping is kept as is.
n, prop Provide either n, the number of rows, or prop, the proportion of rows to
select. If neither are supplied, n = 1 will be used.
If n is greater than the number of rows in the group (or prop > 1), the
result will be silently truncated to the group size. If the proportion of a
group size is not an integer, it is rounded down.
order by Variable or function of variables to order by.
with ties Should ties be kept together? The default, TRUE, may return more rows
than you request. Use FALSE to ignore ties, and return the first n rows.
slice 75
Details
Slice does not work with relational databases because they have no intrinsic notion of row
order. If you want to perform the equivalent operation, use filter() and row number().
Value
An object of the same type as .data. The output has the following properties:
Each row may appear 0, 1, or many times in the output.
Columns are not modified.
Groups are not modified.
Data frame attributes are preserved.
Methods
These function are generics, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
Methods available in currently loaded packages:
slice(): no methods found.
slice head(): no methods found.
slice tail(): no methods found.
slice min(): no methods found.
slice max(): no methods found.
slice sample(): no methods found.
See Also
Other single table verbs: arrange(), filter(), mutate(), rename(), select(), summarise()
Examples
mtcars %>% slice(1L)
# Similar to tail(mtcars, 1):
mtcars %>% slice(n())
mtcars %>% slice(5:n())
# Rows can be dropped with negative indices:
slice(mtcars, -(1:4))
# All slice helpers operate per group, silently truncating to the group
# size, so the following code works without error
df %>% group_by(group) %>% slice_head(n = 2)
Description
These functions are critical when writing functions that translate R functions to sql func-
tions. Typically a conversion function should escape all its inputs and return an sql object.
Usage
sql(...)
Arguments
... Character vectors that will be combined into a single SQL expression.
starwars 77
Description
The original data, from SWAPI, the Star Wars API, https://swapi.dev/, has been revised
to reflect additional research into gender and sex determinations of characters.
Usage
starwars
Format
A tibble with 87 rows and 14 variables:
Examples
starwars
Description
This data is a subset of the NOAA Atlantic hurricane database best track data, https:
//www.nhc.noaa.gov/data/#hurdat. The data includes the positions and attributes of 198
tropical storms, measured every six hours during the lifetime of a storm.
Usage
storms
78 summarise
Format
See Also
Examples
storms
Description
summarise() creates a new data frame. It will have one (or more) rows for each combination
of grouping variables; if there are no grouping variables, the output will have a single row
summarising all observations in the input. It will contain one column for each grouping
variable and one column for each of the summary statistics that you have specified.
summarise() and summarize() are synonyms.
Usage
Arguments
.data A data frame, data frame extension (e.g. a tibble), or a lazy data frame
(e.g. from dbplyr or dtplyr). See Methods, below, for more details.
... ¡data-masking¿ Name-value pairs of summary functions. The name will
be the name of the variable in the result.
The value can be:
A vector of length 1, e.g. min(x), n(), or sum(is.na(y)).
A vector of length n, e.g. quantile().
A data frame, to add multiple columns from a single expression.
.groups [Experimental] Grouping structure of the result.
”drop last”: dropping the last level of grouping. This was the only
supported option before version 1.0.0.
”drop”: All levels of grouping are dropped.
”keep”: Same grouping structure as .data.
”rowwise”: Each row is its own group.
When .groups is not specified, it is chosen based on the number of rows
of the results:
If all the results have 1 row, you get ”drop last”.
If the number of rows varies, you get ”keep”.
In addition, a message informs you of that choice, unless the result is
ungrouped, the option ”dplyr.summarise.inform” is set to FALSE, or when
summarise() is called from a function in a package.
Value
Useful functions
Backend variations
The data frame backend supports creating a variable and using it in the same summary. This
means that previously created summary variables can be further transformed or combined
within the summary, as in mutate(). However, it also means that summary variables with
the same names as previous variables overwrite them, making those variables unavailable
to later summary variables.
This behaviour may not be supported in other backends. To avoid unexpected results,
consider using new names for your summary variables, especially when creating multiple
summaries.
Methods
This function is a generic, which means that packages can provide implementations (meth-
ods) for other classes. See the documentation of individual methods for extra arguments
and differences in behaviour.
The following methods are currently available in loaded packages: no methods found.
See Also
Other single table verbs: arrange(), filter(), mutate(), rename(), select(), slice()
Examples
# A summary applied to ungrouped tbl returns a single row
mtcars %>%
summarise(mean = mean(disp), n = n())
# You use a data frame to create multiple columns so you can wrap
# this up into a function:
my_quantile <- function(x, probs) {
tibble(x = quantile(x, probs), probs = probs)
}
mtcars %>%
group_by(cyl) %>%
summarise(my_quantile(disp, c(0.25, 0.75)))
# Each summary call removes one grouping level (since that group
# is now just a single row)
mtcars %>%
group_by(cyl, vs) %>%
summarise(cyl_n = n()) %>%
group_vars()
group_by(cyl) %>%
summarise(disp = mean(disp), sd = sd(disp))
Description
[Superseded]
Scoped verbs ( if, at, all) have been superseded by the use of across() in an existing
verb. See vignette("colwise") for details.
The scoped variants of summarise() make it easy to apply the same transformation to
multiple variables. There are three variants.
summarise all() affects every variable
summarise at() affects variables selected with a character vector or vars()
summarise if() affects variables selected with a predicate function
Usage
summarise_all(.tbl, .funs, ...)
Arguments
.tbl A tbl object.
.funs A function fun, a quosure style lambda ˜ fun(.) or a list of either form.
... Additional arguments for the function calls in .funs. These are evaluated
only once, with tidy dots support.
.predicate A predicate function to be applied to the columns or a logical vector.
The variables for which .predicate is or returns TRUE are selected. This
argument is passed to rlang::as function() and thus supports quosure-
style lambda functions and strings representing function names.
.vars A list of columns generated by vars(), a character vector of column
names, a numeric vector of column positions, or NULL.
.cols This argument has been renamed to .vars to fit dplyr’s terminology and
is deprecated.
82 summarise all
Value
A data frame. By default, the newly created columns have the shortest names needed to
uniquely identify the output. To force inclusion of a name, even when not needed, name
the input (see examples for details).
Grouping variables
If applied on a grouped tibble, these operations are not applied to the grouping variables.
The behaviour depends on whether the selection is implicit (all and if selections) or
explicit (at selections).
data %>%
summarise_at(vars(-group_cols(), ...), myoperation)
Grouping variables covered by implicit selections are silently ignored by summarise all()
and summarise if().
Naming
The names of the new columns are derived from the names of the input variables and the
names of the functions.
if there is only one unnamed function (i.e. if .funs is an unnamed list of length one),
the names of the input variables are used to name the new columns;
for at functions, if there is only one unnamed variable (i.e., if .vars is of the form
vars(a single column)) and .funs has length greater than one, the names of the
functions are used to name the new columns;
otherwise, the new names are created by concatenating the names of the input variables
and the names of the functions, separated with an underscore " ".
The .funs argument can be a named or unnamed list. If a function is unnamed and the
name cannot be derived automatically, a name of the form ”fn#” is used. Similarly, vars()
accepts named and unnamed arguments. If a variable in .vars is named, a new column by
that name will be created.
Name collisions in the new columns are disambiguated using a unique suffix.
Life cycle
The functions are maturing, because the naming scheme and the disambiguation algorithm
are subject to change in dplyr 0.9.0.
See Also
The other scoped verbs, vars()
tbl 83
Examples
# The _at() variants directly support strings:
starwars %>%
summarise_at(c("height", "mass"), mean, na.rm = TRUE)
# ->
starwars %>% summarise(across(c("height", "mass"), ˜ mean(.x, na.rm = TRUE)))
# You can also supply selection helpers to _at() functions but you have
# to quote them with vars():
starwars %>%
summarise_at(vars(height:mass), mean, na.rm = TRUE)
# ->
starwars %>%
summarise(across(height:mass, ˜ mean(.x, na.rm = TRUE)))
Description
This is a generic method that dispatches based on the first argument.
Usage
tbl(src, ...)
is.tbl(x)
Arguments
src A data source
... Other arguments passed on to the individual methods
x Any object
84 with groups
Description
vars() was only needed for the scoped verbs, which have been superseded by the use of
across() in an existing verb. See vignette("colwise") for details.
This helper is intended to provide equivalent semantics to select(). It is used for instance
in scoped summarising and mutating verbs (mutate at() and summarise at()).
Note that verbs accepting a vars() specification also accept a numeric vector of positions
or a character vector of column names.
Usage
vars(...)
Arguments
... ¡tidy-select¿ Variables to include/exclude in mutate/summarise. You
can use same specifications as in select(). If missing, defaults to all
non-grouping variables.
See Also
all vars() and any vars() for other quoting functions that you can use with scoped verbs.
Description
[Experimental]
This is an experimental new function that allows you to modify the grouping variables for
a single operation.
Usage
with_groups(.data, .groups, .f, ...)
Arguments
.data A data frame
.groups ¡tidy-select¿ One or more variables to group by. Unlike group by(), you
can only group by existing variables, and you can use tidy-select syntax
like c(x,y,z) to select multiple variables.
Use NULL to temporarily ungroup.
.f Function to apply to regrouped data. Supports purrr-style ˜ syntax
... Additional arguments passed on to ....
with groups 85
Examples
df <- tibble(g = c(1, 1, 2, 2, 3), x = runif(5))
df %>%
with_groups(g, mutate, x_mean = mean(x))
df %>%
with_groups(g, ˜ mutate(.x, x1 = first(x)))
df %>%
group_by(g) %>%
with_groups(NULL, mutate, x_mean = mean(x))
86
INDEX 87
transmute at(), 68
transmute if ( mutate all), 49
transmute if(), 68
vars, 84
vars(), 5, 8, 24, 30, 34, 35, 50, 51, 67, 68,
81, 82
vctrs::vec as names(), 10
vctrs::vec c(), 21
where(), 69
with groups, 84
with order(), 57
xor(), 27