Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

BIO360 Biometrics I, Fall 2007 H.

Wagner, Biology UTM

R cheat sheet Modified from: P. Dalgaard (2002). Introductory Statistics with R. Springer, New York.

1. Basics 4. Input and export of data


objects() General data(name) Built-in data set
Commands List of objects in workspace
ls() read.table(“file.txt”) Read from external ASCII file
Same
rm(object) Delete ‘object’
Arguments to header = TRUE First line has variable names
<- read.table()
Assignments Assign value to a variable row.names = 1
= First column has row names
Same sep = “,” Data are separated by commas
sep = “\t” Data are separated by tabs
Getting help help(fun) Display help file for function fun() dec = “,” Decimal point is comma
args(fun) List arguments of function fun() na.strings = “.” Missing value is dot
Libraries / packages library(pkg) Open package (library) ‘pkg’ read.csv(“file.csv”)
library(help=pkg)
Variants of Comma separated
Display description of package ‘pkg’ read.table()
read.delim(“file.txt”) Tab delimited text file
2. Vectors and data types write.table()
Export see help(write.table) for details
Generating seq(-4,4,0.1) Sequence: -4.0, -3.9, -3.8, ..., 3.9, 4.0
2:7 Same as seq(2,7,1) Adding names names() Column names for data frame or list only
c(5,7,9,1:3) Concatenation (vector): 5 7 9 1 2 3 dimnames() Row and column names, also for matrix
rep(1,5) 11111
rep(4:6,1:3) 455666
gl(3,2,12) Factor with 3 levels, repeat each level in blocks 5. Indexing / selection / sorting
of 2, up to length 12 (1 1 2 2 3 3 1 1 2 2 3 3)
Vectors x[1] First element
x[1:5] Subvector containing the first five elements
Coercion as.numeric(x) Convert to numeric
x[c(2,3,5)] Elements nos. 2, 3, and 5
as.character(x) Convert to text string
x[y <= 30] Selection by logical expression
as.logical(x) Convert to logical
x[sex = = “male”] Selection by factor variable
factor(x) Create factor from vector x
i <-c(2,3,5); x[i] Selection by numerical variable
unlist(x) Convert list, result from table() etc. to vector k <- (y <=30); x[k] Selection by logical variable
length(x) Returns length of vector x
3. Data frames Matrices, data m[4, ] Fourth row
frames
Accessing data data.frame(height, Collect vectors ‘height’ and ‘weight’ into m[ ,3]
weight) Third column
data frame drf[drf$var <=30, ] Partial data frame (not for matrices)
dfr&var Select vector ‘var’ in data frame ‘dfr’ subset(dfr,var<=30) Same, often simpler (not for matrices)
attach(dfr) Put data frame in search path m[m[ ,3]<=30, ] Partial matrix (also for data frames)
detach() - and remove it from the path
Editing dfr2 <- edit(dfr) open data frame ‘dfr’ in spreadsheet, write sort(c(7,9,10,6))
Sorting Returns the sorted values: 6, 7, 9, 10
changed version into new data frame ‘dfr2’ order(c(7,9,10,6)) Returns the element number in order of
fix(dfr) open data frame ‘dfr’ in spreadsheet, ascending values: 4, 1, 2, 3
changes will overwrite entries in ‘dfr’ order(c(7,9,10,6), same, but in order of decreasing values:
Summary dim(dfr) Number of rows and columns in data frame decreasing = TRUE) 3, 2, 1, 4
‘dfr’, works also for matrices and arrays rank(c(7,9,10,6)) Returns the ranks in order of ascending
summary(dfr) Summary statistics for each variable in ‘dfr’ values: 2, 3, 4, 1

1 2
BIO360 Biometrics I, Fall 2007 H. Wagner, Biology UTM

6. Missing values 8. Programming


Functions is.na(x) Logical vector. TRUE where x has NA Conditional if(p< 0.5) Print “Hooray” if condition is true
complete.cases(x1,x2,...) execution print(“Hooray”)
Neither missing in x1, nor x2, nor ...
if(p < 0.5) If condition is true, perform all commands
na.rm = { print(“Hooray”) within the curved brackets { }
Arguments to In statistical functions: Remove i = i + 1 }
other functions missing if TRUE, returns NA if FALSE if(p < 0.5) Conditional execution with an alternative
na.last = In ‘sort’ TRUE, FALSE and NA means { print(“Hooray”)}
“last”, “first”, and “discard” else
na.action = { i = i + 1}
in ‘lm()’, etc., values na.fail, for(i in 1:10)
na.omit, na.exclude Loop Go through loop 10 times
{ print(i) }
na.print = In ‘summary()’ and ‘print()’: i <- 1 Same, but more complicated
How to represent NA in output while(i <= 10)
na.strings = In ‘read.table()’: { print(i)
i = i + 1 }
Codes(s) for NA in input fun<- function(a, b,
User-defined Defines a function ‘fun’ that returns the
function doit = FALSE)
sum of a and b if the argument ‘doit’ is
{ if(doit) {a + b}
else 0 } set to TRUE, or zero, if ‘doit’ is FALSE

7. Numerical functions 9. Operators


Mathematical log(x) Logarithm of x, natural logarithm Arithmetic + Addition
log(x, 10) Base10 logarithm of x - Subtraction
exp(x) Exponential function ex * Multiplication
sin(x) Sine / Division
cos(x) Cosine ^ Raise to the power of
tan(x) Tangent % / % Integer division: 5 %/% 3 = 1
asin(x) Arcsin (inverse sine) % % Remainder from integer division: 5 %% 3 = 2
min(x) Smallest value in vector
min(x1, x2, ...) minimum number over several vectors Logical or relational = = Equal to
max(x) Largest value in vector ! = Not equal to
range(x) Like c(min(x), max(x)) < Less than
pmin(x1, x2, ...) Parallel (elementwise) minimum over > Greater than
multiple equally long vectors < = Less than or equal to
length(x) Number of elements in vector > = Greater than or equal to
sum(x) Sum of values in vector is.na(x) Missing?
cumsum(x) Cumulative sum of values in vector & Logical AND
sum(complete.cases(x)) Number of non-missing elements | Logical OR
! Logical NOT
Statistical mean(x) Average
median(x) Median
quantile(x, p) Quantiles: median = quantile(x, 0.5)
var(x) Variance
sd(x) Standard deviation
cor(x, y) Pearson correlation
cor(x, y, method = Spearman rank correlation
“spearman”)
3 4
BIO360 Biometrics I, Fall 2007 H. Wagner, Biology UTM

10. Tabulation, grouping, recoding 12. Statistical standard methods


General table(x) Frequency table of vector (factor) x Parametric tests, t.test One- and two-sample t-test
table(x, y) Crosstabulation of x and y continuous data
xtabs(~ x + y) Formula interface for crosstabulation: pairwise.t.test Pairwise comparison of means
use summary() for chi-square test cor.test Significance test for correlation coeff.
factor(x) Convert vector to factor var.test Comparison of two variances (F-test)
cut(x, breaks) Groups from cutpoints for continuous lm(y ~ x) Regression analysis
variable, breaks is a vector of cutpoints lm(y ~ f) One-way analysis of variance
lm(y ~ x1 + x2 + x3) Multiple regression
Arguments to levels = c() Values of x to code. Use if some values lm(y ~ f1 * f2) Two-way analysis of variance
factor() are not present in data, or if the order
would be wrong. Non-parametric wilcox.test One- and two-sample Wilcox test
labels = c() Values associated with factor levels kruskal.test Kruskal-Wallis test
exclude = c() Values to exclude. Default NA. Set to NULL friedman.test Friedman’s two-way analysis of variance
to have missing values included as a level. cor.test variant method = “spearman” Spearman rank correlation

Arguments to breaks = c() Cutpoints. Note values of x outside of Discrete response binom.test Binomial test (incl. sign test)
cut() ‘breaks’ gives NA. Can also be a single prop.test Comparison of proportions
number, the number of cutpoints. fisher.test Exact test in 2 x 2 tables
labels = c() Names for groups. Default is 1, 2, ... chisq.test Chi-square test of independence
glm(y ~ x1+x2, Logistic regression
Factor recoding levels(f) <- names New level names binomial)
factor(newcodes[f]) Combining levels: ‘newcodes’, e.g.,
c(1,1,1,2,3) to amalgamate the first 3
of 5 groups of factor f

11. Manipulations of matrices and lists 13. Statistical distributions


Matrix algebra m1 % * % m2 Matrix product Normal dnorm(x) Density function
t(m) Matrix transpose distribution
m[lower.tri(m)] Returns the values from the lower triangle pnorm(x) Cumulative distribution function P(X<=x)
of matrix m as a vector qnorm(p) p-quantile, returns x in: P(X<=x) = p
diag(m) Returns the diagonal elements of matrix m rnorm(n) n random normally distributed numbers
matrix(x, dim1, dim2) Fill the values of vector x into a new
matrix with dim1 rows and dim2 columns, Distributions pnorm(x, mean, sd) Normal
plnorm*x, mean, sd) Lognormal
Marginal apply(m, dim, fun) Applies the function ‘fun’ to each row pt(x, df) Student’s t distribution
operations etc. (dim = 1) or column (dim= 2) of matrix m pf(x, n1, n2) F distribution
tapply(m, list(f1, Can be used to aggregate columns or rows pchisq(x, df) Chi-square distribution
f2), fun) within matrix m as defined by f1, f2, using pbinom(x, n, p) Binomial
the function ‘fun’ (e.g., mean, max) ppois(x, lambda) Poisson
split(x, f) Split vector, matrix or data frame by punif(x, min, max) Uniform
factor x. Different results for matrix and pexp(x, rate) Exponential
data frame! The result is a list with one pgamma(x, shape, Gamma
object for each level of f. scale)
sapply(list, fun) applies the function ‘fun’ to each object in pbeta(x, a, b) Beta
sapply(split(x,f), a list, e.g. as created by the split function
fun)

5 6
BIO360 Biometrics I, Fall 2007 H. Wagner, Biology UTM

14. Models 15. Graphics


Model formulas ~ As explained by Standard plots plot(x, y) Scatterplot (or other type of plot if x and
+ Additive effects y are not numeric vectors)
: Interaction plot(f, y) Set of boxplots for each level of factor f
* Main effects + interaction: a*b hist() Histogram
=a+b+a:b boxplot() Boxplot
-1 Remove intercept barplot() Bar diagram
dotplot() Dot diagram
Linear models lm.out <- lm(y ~ x) Fit model and save results as ‘lm.out’ piechart() Pie chart
summary(lm.out) Coefficients etc. interaction.plot() Interaction plot (analysis of variance)
anova(lm.out) Analysis of variance table
fitted(lm.out) Fitted values Plotting elements lines() Lines
resid(lm.out) Residuals (adding to a plot)
predict(lm.out,newdata) Predictions for a new data frame abline() Regression line
points() Points
Other models glm(y ~ x, binomial) Logistic regression arrows() Arrows (NB: angle = 90 for error bars)
glm(y ~ x, poisson) Poisson regression box() Frame around plot
gam(y ~ s(x)) General additive model for non-linear title() Title (above plot)
regression with smoothing. Package: text() Text in plot
gam mtext() Text in margin
tree(y ~ x1+x2+x3) Classification (y = factor) or regression legend() List of symbols
(y = numeric) tree. Package: tree pch
Graphical pars.: Symbol (see below)
arguments to par()
Diagnostics rstudent(lm.out) Studentized residuals mfrow, mfcol Several plots in one (multiframe)
dfbetas(lm.out) Change in standardized regression xlim, ylim Plot limits
coefficients beta if observation removed lty, lwd Line type / width (see below)
dffits(lm.out) Change in fit if observation removed col Color for lines or symbols (see below)
Survival analysis S <- Surv(time,ev) Create survival object. Package: survival
survfit(S) Kaplan-Meier estimate Point symbols (pch)
plot(survfit(S)) Survival curve
survdiff(S ~ g) (Log-rank) test for equal survival curves
coxph(S ~ x1 + x2) Cox’s proportional hazards model 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Multivariate dist() Calculate Euclidean or other distances


hclust() Hierarchical cluster analysis
kmeans() k-means cluster analysis Colors (col) Line types (lty)
rda() Perform principal component analysis

6
PCA or redundancy analysis RDA. 1 black
Package ‘vegan’. 2 red

5
cca() Perform (canonical) correspondence 3 green
analysis, CA /CCA. Package: ‘vegan’

4
4 blue
diversity() Calculate diversity indices. Pkg: ‘vegan’ 5 light blue

3
6 purple
7 yellow

2
8 grey

1
7 8
Base R Vectors Programming
Creating Vectors For Loop While Loop
Cheat Sheet c(2, 4, 6) 2 4 6
Join elements into
for (variable in sequence){ while (condition){
a vector

Do something Do something
Getting Help An integer
2:6 2 3 4 5 6
sequence } }

Accessing the help files seq(2, 3, by=0.5) 2.0 2.5 3.0


A complex Example Example
sequence
?mean for (i in 1:4){ while (i < 5){
Get help of a particular function. rep(1:2, times=3) 1 2 1 2 1 2 Repeat a vector
j <- i + 10 print(i)
help.search(‘weighted mean’)
print(j) i <- i + 1
Search the help files for a word or phrase. rep(1:2, each=3) 1 1 1 2 2 2 Repeat elements
of a vector
help(package = ‘dplyr’) } }
Find help for a package. Vector Functions
More about an object If Statements Functions
sort(x) rev(x)
Return x sorted. Return x reversed. if (condition){ function_name <- function(var){
str(iris)
table(x) unique(x) Do something
Get a summary of an object’s structure. Do something
See counts of values. See unique values. } else {
class(iris) Do something different return(new_variable)
Find the class an object belongs to. } }
Selecting Vector Elements
Example
Using Libraries Example
By Position if (i > 3){ square <- function(x){
install.packages(‘dplyr’) x[4] The fourth element. print(‘Yes’)
squared <- x*x
Download and install a package from CRAN. } else {
print(‘No’) return(squared)
library(dplyr) x[-4] All but the fourth.
} }
Load the package into the session, making all
its functions available to use. x[2:4] Elements two to four.
Reading and Writing Data
dplyr::select All elements except
x[-(2:4)] Input Ouput Description
Use a particular function from a package. two to four.

Elements one and Read and write a delimited text


data(iris) df <- read.table(‘file.txt’) write.table(df, ‘file.txt’)
x[c(1, 5)] file.
five.
Load a built-in dataset into the environment.
By Value Read and write a comma
Working Directory x[x == 10]
Elements which df <- read.csv(‘file.csv’) write.csv(df, ‘file.csv’) separated value file. This is a
special case of read.table/
are equal to 10.
write.table.
getwd()
All elements less
Find the current working directory (where x[x < 0]
than zero. Read and write an R data file, a
inputs are found and outputs are sent). load(‘file.RData’) save(df, file = ’file.Rdata’)
file type special for R.
x[x %in% Elements in the set
setwd(‘C://file/path’) c(1, 2, 5)] 1, 2, 5.
Change the current working directory.
Named Vectors Greater than
a == b Are equal a > b Greater than a >= b is.na(a) Is missing
or equal to
Conditions
Use projects in RStudio to set the working Element with Less than or
x[‘apple’] a != b Not equal a < b Less than a <= b is.null(a) Is null
directory to the folder you are working in. name ‘apple’. equal to

RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15
Types Matrixes Strings Also see the stringr library.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)

w
ww Transpose
grep(pattern, x) Find regular expression matches in x.

ww
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).

w m[ , 1] - Select a column
m %*% n gsub(pattern, replace, x) Replace matches in x with a string.

Integers or floating point


w
ww Matrix Multiplication toupper(x) Convert to uppercase.

ww
as.numeric 1, 0, 1
numbers.

w m[2, 3] - Select an element


solve(m, n)
Find x in: m * x = n
tolower(x) Convert to lowercase.
as.character '1', '0', '1'
Character strings. Generally

w
ww
ww
preferred to factors. nchar(x) Number of characters in a string.

as.factor
'1', '0', '1',
levels: '1', '0'
Character strings with preset
levels. Needed for some
statistical models.
w Lists Factors
l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)
Maths Functions A list is collection of elements which can be of different types. Turn a vector into a factor. Can
set the levels of the factor and
Turn a numeric vector into a
factor but ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
only the first only element
max(x) Largest element. median(x) Median.
of l.
element.
x.
named y. Statistics
min(x) Smallest element. quantile(x) Percentage
lm(x ~ y, data=df) prop.test
Also see the t.test(x, y)
quantiles.
dplyr library. Data Frames Linear model. Preform a t-test for Test for a
round(x, n) Round to n decimal rank(x) Rank of elements. difference
difference between
places. glm(x ~ y, data=df) between
df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) means.
Generalised linear model. proportions.
signif(x, n) Round to n var(x) The variance. A special case of a list where all elements are the same length.
significant figures. pairwise.t.test
List subsetting summary aov
Preform a t-test for
cor(x, y) Correlation. sd(x) The standard x y Get more detailed information Analysis of
paired data.
deviation. out a model. variance.
df$x df[[2]]
1 a
Variable Assignment Distributions
2 b Understanding a data frame
> a <- 'apple' Random Density Cumulative
Quantile
> a See the full data Variates Function Distribution
3 c View(df)
[1] 'apple' frame. Normal rnorm dnorm pnorm qnorm
See the first 6
Matrix subsetting head(df) Poison rpois dpois ppois qpois
rows.
The Environment Binomial rbinom dbinom pbinom qbinom
df[ , 2]
ls() List all variables in the nrow(df) cbind - Bind columns. Uniform runif dunif punif qunif
environment. Number of rows.

rm(x) Remove x from the


environment. df[2, ]
ncol(df)
Number of
Plotting Also see the ggplot2 library.

columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to
df[2, 2] columns and
browse variables in your environment. rows.
Dates See the lubridate library.

RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15
Data Structures
R Programming Cheat Sheet Vector
• Group of elements of the SAME type
data.frame while using single-square brackets, use
‘drop’: df1[, 'col1', drop = FALSE]
just the basics • R is a vectorized language, operations are applied to
data.table
each element of the vector automatically
• R has no concept of column vectors or row vectors What is a data.table
Created By: Arianne Colton and Sean Chen • Special vectors: letters and LETTERS, that contain • Extends and enhances the functionality of data.frames
lower-case and upper-case letters Differences: data.table vs. data.frame
Create Vector v1 <- c(1, 2, 3) • By default data.frame turns character data into factors,
General Manipulating Strings Get Length length(v1) while data.table does not
Check if All or Any is True all(v1); any(v1) • When you print data.frame data, all data prints to the
• R version 3.0 and greater adds support for 64 bit paste('string1', 'string2', sep Integer Indexing v1[1:3]; v1[c(1,6)]
console, with a data.table, it intelligently prints the first
integers = '/') and last five rows
Boolean Indexing v1[is.na(v1)] <- 0
• R is case sensitive Putting # separator ('sep') is a space by default • Key Difference: Data.tables are fast because
Together c(first = 'a', ..)or
• R index starts from 1 paste(c('1', '2'), collapse = Naming they have an index like a database.
Strings names(v1) <- c('first', ..)
'/')
i.e., this search, dt1$col1 > number, does a
HELP # returns '1/2' Factor sequential scan (vector scan). After you create a key
stringr::str_split(string = v1, for this, it will be much faster via binary search.
• as.factor(v1) gets you the levels which is the
help(functionName) or ?functionName Split String pattern = '-')
number of unique values Create data.table from data.frame data.table(df1)
# returns a list
Help Home Page help.start() stringr::str_sub(string = v1, • Factors can reduce the size of a variable because they dt1[, 'col1', with
Get Substring start = 1, end = 3) only store unique values, but could be buggy if not Index by Column(s)* = FALSE] or
Special Character Help help('[') isJohnFound <- stringr::str_ used properly dt1[, list(col1)]
Search Help help.search(..)or ??.. detect(string = df1$col1, Show info for each data.table in tables()
Search Function - with pattern = ignore.case('John')) list memory (i.e., size, ...)
apropos('mea') Match String
Partial Name # returns True/False if John was found Store any number of items of ANY type Show Keys in data.table key(dt1)
See Example(s) example(topic) df1[isJohnFound, c('col1', Create index for col1 and setkey(dt1, col1)
...)] Create List list1 <- list(first = 'a', ...) reorder data according to col1
vector(mode = 'list', length dt1[c('col1Value1',
Objects in current environment Create Empty List = 3) Use Key to Select Data
'col1Value2'), ]
Get Element list1[[1]] or list1[['first']] Multiple Key Select dt1[J('1', c('2', '3')), ]
Display Object Name
Remove Object
objects() or ls()
rm(object1, object2,..)
Data Types Append Using
Numeric Index
list1[[6]] <- 2 dt1[, list(col1 =
mean(col1)), by =
Append Using Name list1[['newElement']] <- 2 col2]
Aggregation ** dt1[, list(col1 =
Notes: Check data type: class(variable)
mean(col1), col2Sum
Note: repeatedly appending to list, vector, data.frame
1. .name starting with a period are accessible but Four Basic Data Types etc. is expensive, it is best to create a list of a certain
= sum(col2)), by =
list(col3, col4)]
invisible, so they will not be found by ‘ls’ 1. Numeric - includes float/double, int, etc. size, then fill it.
2. To guarantee memory removal, use ‘gc’, releasing * Accessing columns must be done via list of actual
unused memory to the OS. R performs automatic ‘gc’
is.numeric(variable)
data.frame names, not as characters. If column names are
periodically 2. Character(string) • Each column is a variable, each row is an observation characters, then "with" argument should be set to
• Internally, each column is a vector FALSE.
nchar(variable) # length of a character or numeric
Symbol Name Environment • idata.frame is a data structure that creates a reference ** Aggregate and d*ply functions will work, but built-in
3. Date/POSIXct to a data.frame, therefore, no copying is performed aggregation functionality of data table is faster
• If multiple packages use the same function name the • Date: stores just a date. In numeric form, number
df1 <- data.frame(col1 = v1,
function that the package loaded the last will get called. of days since 1/1/1970 (see below). Create Data Frame col2 = v2, v3) Matrix
date1 <- as.Date('2012-06-28'), Dimension nrow(df1); ncol(df1); dim(df1) • Similar to data.frame except every element must be
• To avoid this precede the function with the name of the as.numeric(date1) Get/Set Column names(df1) the SAME type, most commonly all numerics
package. e.g. packageName::functionName(..) Names names(df1) <- c(...) • Functions that work with data.frame should work with
• POSIXct: stores a date and time. In numeric
form, number of seconds since 1/1/1970. Get/Set Row rownames(df1) matrix as well
Names rownames(df1) <- c(...)
Library date2 <- as.POSIXct('2012-06-28 18:00') Preview head(df1, n = 10); tail(...) Create Matrix matrix1 <- matrix(1:10, nrow = 5), # fills
rows 1 to 5, column 1 with 1:5, and column 2 with 6:10
Only trust reliable R packages i.e., 'ggplot2' for plotting, Get Data Type class(df1) # is data.frame Matrix matrix1 %*% t(matrix2)
'sp' for dealing spatial data, 'reshape2', 'survival', etc. Note: Use 'lubridate' and 'chron' packages to work df1['col1']or df1[1];† Multiplication # where t() is transpose
with Dates Index by Column(s) df1[c('col1', 'col3')] or
library(packageName)or
df1[c(1, 3)] Array
Load Package 4. Logical Index by Rows and df1[c(1, 3), 2:3] # returns data • Multidimensional vector of the SAME type
require(packageName) Columns from row 1 & 3, columns 2 to 3
Unload Package detach(packageName) • (TRUE = 1, FALSE = 0) • array1 <- array(1:12, dim = c(2, 3, 2))
• Use ==/!= to test equality and inequality † Index method: df1$col1 or df1[, 'col1'] or • Using arrays is not recommended
Note: require() returns the status(True/False) df1[, 1] returns as a vector. To return single column • Matrices are restricted to two dimensions while array
as.numeric(TRUE) => 1 can have any dimension
Data Munging Functions and Controls Data Reshaping
Apply (apply, tapply, lapply, mapply) group_by(), sample_n() say_hello <- function(first,
Create Function last = 'hola') { } Rearrange
• Apply - most restrictive. Must be used on a matrix, all • Chain functions reshape2.melt(df1, id.vars =
Call Function say_hello(first = 'hello')
elements must be the same type df1 %>% group_by(year, month) %>% Melt Data - from c('col1', 'col2'), variable.
• If used on some other object, such as a data.frame, it
select(col1, col2) %>% summarise(col1mean • R automatically returns the value of the last line of column to row name = 'newCol1', value.name =
= mean(col1)) code in a function. This is bad practice. Use return() 'newCol2')
will be converted to a matrix first reshape2.dcast(df1, col1 +
explicitly instead. Cast Data - from col2 ~ newCol1, value.var =
apply(matrix1, 1 - rows or 2 - columns, • Much faster than plyr, with four types of easy-to-use row to column 'newCol2')
function to apply) joins (inner, left, semi, anti) • do.call() - specify the name of a function either as
# if rows, then pass each row as input to the function
string (i.e. 'mean') or as object (i.e. mean) and provide
• Abstracts the way data is stored so you can work with arguments as a list. If df1 has 3 more columns, col3 to col5, 'melting' creates
• By default, computation on NA (missing data) always data frames, data tables, and remote databases with a new df that has 3 rows for each combination of col1
returns NA, so if a matrix contains NAs, you can the same set of functions do.call(mean, args = list(first = '1st')) and col2, with the values coming from the respective col3
ignore them (use na.rm = TRUE in the apply(..) Helper functions to col5.
which doesn’t pass NAs to your function) if /else /else if /switch
each() - supply multiple functions to a function like aggregate Combine (mutiple sets into one)
lapply if { } else ifelse
aggregate(price ~ cut, diamonds, each(mean, 1. cbind - bind by columns
Applies a function to each element of a list and returns median)) Works with Vectorized Argument No Yes
the results as a list Most Efficient for Non-Vectorized Argument Yes No data.frame from two vectors cbind(v1, v2)
sapply data.frame combining df1 and
Works with NA * No Yes cbind(df1, df2)
Same as lapply except return the results as a vector Data Use &&, || **† Yes No
df2 columns

2. rbind - similar to cbind but for rows, you can assign


Note: lapply & sapply can both take a vector as input, a Use &, | ***† No Yes
new column names to vectors in cbind
vector is technically a form of list Load Data from CSV
cbind(col1 = v1, ...)
Aggregate (SQL groupby) • Read csv * NA == 1 result is NA, thus if won’t work, it’ll be an
read.table(file = url or filepath, header = error. For ifelse, NA will return instead 3. Joins - (merge, join, data.table) using common keys
• aggregate(formulas, data, function)
TRUE, sep = ',') ** &&, || is best used in if, since it only compares the 3.1 Merge
• Formulas: y ~ x, y represents a variable that we first element of vector from each side
• “stringAsFactors” argument defaults to TRUE, set it to • by.x and by.y specify the key columns use in the
want to make a calculation on, x represents one or FALSE to prevent converting columns to factors. This
more variables we want to group the calculation by *** &, | is necessary for ifelse, as it compares every join() operation
saves computation time and maintains character data element of vector from each side
• Can only use one function in aggregate(). To apply • Other useful arguments are "quote" and "colClasses", • Merge can be much slower than the alternatives
more than one function, use the plyr() package specifying the character used for enclosing cells and † &&, || are similar to if in that they don’t work with
vectors, where ifelse, &, | work with vectors merge(x = df1, y = df2, by.x = c('col1',
In the example below diamonds is a data.frame; price, the data type for each column. 'col3'), by.y = c('col3', 'col6'))
cut, color etc. are columns of diamonds. • If cell separator has been used inside a cell, then use • Similar to C++/Java, for &, |, both sides of operator
read.csv2() or read delim2() instead of read. 3.2 Join
aggregate(price ~ cut, diamonds, mean)
table()
are always checked. For &&, ||, if left side fails, no • Join in plyr() package works similar to merge but
# get the average price of different cuts for the diamonds need to check the right side. much faster, drawback is key columns in each
aggregate(price ~ cut + color, diamonds, Database • } else, else must be on the same line as } table must have the same name
mean) # group by cut and color Connect to
aggregate(cbind(price, carat) ~ cut, Database
db1 <- RODBC::odbcConnect('conStr') • join() has an argument for specifying left, right,
diamonds, mean) # get the average price and average Query df1 <- RODBC::sqlQuery(db1, 'SELECT inner joins
carat of different cuts Database
Close
..', stringAsFactors = FALSE) Graphics join(x = df1, y = df2, by = c('col1',
Plyr ('split-apply-combine') Connection
RODBC::odbcClose(db1) 'col3'))

• ddply(), llply(), ldply(), etc. (1st letter = the type of • Only one connection may be open at a time. The Default basic graphic 3.3 data.table
input, 2nd = the type of output connection automatically closes if R closes or another
connection is opened. hist(df1$col1, main = 'title', xlab = 'x dt1 <- data.table(df1, key = c('1',
• plyr can be slow, most of the functionality in plyr axis label')
can be accomplished using base function or other • If table name has space, use [ ] to surround the table '2')), dt2 <- ...‡
packages, but plyr is easier to use name in the SQL string. plot(col2 ~ col1, data = df1),
aka y ~ x or plot(x, y) • Left Join
ddply • which() in R is similar to ‘where’ in SQL
Takes a data.frame, splits it according to some Included Data lattice and ggplot2 (more popular) dt1[dt2]
variable(s), performs a desired action on it and returns a R and some packages come with data included.
data.frame • Initialize the object and add layers (points, lines, ‡ Data table join requires specifying the keys for the data
List Available Datasets data() histograms) using +, map variable in the data to an
List Available Datasets in data(package = tables
llply
a Specific Package 'ggplot2')
axis or aesthetic using ‘aes’
• Can use this instead of lapply ggplot(data = df1) + geom_histogram(aes(x
• For sapply, can use laply (‘a’ is array/vector/matrix), Missing Data (NA and NULL) = col1)) Created by Arianne Colton and Sean Chen
however, laply result does not include the names. NULL is not missing, it’s nothingness. NULL is atomical [email protected]
and cannot exist within a vector. If used inside a vector, it • Normalized histogram (pdf, not relative frequency
DPLYR (for data.frame ONLY) simply disappears. histogram) Based on content from
• Basic functions: filter(), slice(), arrange(), select(), ggplot(data = df1) + geom_density(aes(x = 'R for Everyone' by Jared Lander
Check Missing Data is.na()
rename(), distinct(), mutate(), summarise(), col1), fill = 'grey50')
Avoid Using is.null() Updated: December 2, 2015
Data import with the tidyverse : : CHEAT SHEET
Read Tabular Data with readr
read_*(file, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale, n_max = Inf, One of the first steps of a project is to import OTHER TYPES OF DATA
skip = 0, na = c("", "NA"), guess_max = min(1000, n_max), show_col_types = TRUE) See ?read_delim outside data into R. Data is o en stored in Try one of the following
tabular formats, like csv files or spreadsheets. packages to import other types of files:
A|B|C
A B C read_delim("file.txt", delim = "|") Read files with any delimiter. If no The front page of this sheet shows • haven - SPSS, Stata, and SAS files
1 2 3 delimiter is specified, it will automatically guess. how to import and save text files into • DBI - databases
1|2|3 4 5 NA To make file.txt, run: write_file("A|B|C\n1|2|3\n4|5|NA", file = "file.txt")
4|5|NA R using readr. • jsonlite - json
The back page shows how to import • xml2 - XML
A B C read_csv("file.csv") Read a comma delimited file with period • httr - Web APIs
A,B,C spreadsheet data from Excel files
1 2 3 decimal marks. • rvest - HTML (Web Scraping)
1,2,3 4 5 NA write_file("A,B,C\n1,2,3\n4,5,NA", file = "file.csv") using readxl or Google Sheets using
4,5,NA googlesheets4. • readr::read_lines() - text data

read_csv2("file2.csv") Read semicolon delimited files with comma


Column Specification with readr
A B C
A;B;C
1.5 2 3 decimal marks.
1,5;2;3 4.5 5 NA write_file("A;B;C\n1,5;2;3\n4,5;5;NA", file = "file2.csv")
4,5;5;NA Column specifications define what data type each
USEFUL COLUMN ARGUMENTS
column of a file will be imported as. By default
A B C read_tsv("file.tsv") Read a tab delimited file. Also read_table(). readr will generate a column spec when a file is Hide col spec message
A B C read_*(file, show_col_types = FALSE)
1 2 3 read_fwf("file.tsv", fwf_widths(c(2, 2, NA))) Read a fixed width file. read and output a summary.
1 2 3 4 5 NA write_file("A\tB\tC\n1\t2\t3\n4\t5\tNA\n", file = "file.tsv")
4 5 NA spec(x) Extract the full column specification for Select columns to import
the given imported data frame. Use names, position, or selection helpers.
read_*(file, col_select = c(age, earn))
USEFUL READ ARGUMENTS spec(x)
# cols(
A B C No header 1 2 3 Skip lines # age = col_integer(), age is an
integer Guess column types
1 2 3 read_csv("file.csv", col_names = FALSE) 4 5 NA read_csv("file.csv", skip = 1) # edu = col_character(),
4 5 NA # earn = col_double() To guess a column type, read_ *() looks at the
A B C Read a subset of lines # ) first 1000 rows of data. Increase with guess_max.
x y z Provide header 1 2 3 read_csv("file.csv", n_max = 1) edu is a read_*(file, guess_max = Inf)
A B C read_csv("file.csv", earn is a double (numeric) character
1 2 3 col_names = c("x", "y", "z")) A B C Read values as missing
4 5 NA
NA 2 3 read_csv("file.csv", na = c("1")) COLUMN TYPES DEFINE COLUMN SPECIFICATION
4 5 NA
Read multiple files into a single table Each column type has a function and Set a default type
read_csv(c(“f1.csv”, “f2.csv”, “f3.csv"), Specify decimal marks corresponding string abbreviation. read_csv(
A;B;C
id = "origin_file") read_delim("file2.csv", locale = file,
1,5;2;3,0 locale(decimal_mark = ",")) • col_logical() - "l"
col_type = list(.default = col_double())
• col_integer() - "i" )
• col_double() - "d"
Save Data with readr • col_number() - "n"
Use column type or string abbreviation
read_csv(
• col_character() - "c" file,
write_*(x, file, na = "NA", append, col_names, quote, escape, eol, num_threads, progress) • col_factor(levels, ordered = FALSE) - "f" col_type = list(x = col_double(), y = "l", z = "_")
• col_datetime(format = "") - "T" )
A B C write_delim(x, file, delim = " ") Write files with any delimiter. • col_date(format = "") - "D" Use a single string of abbreviations
A,B,C • col_time(format = "") - "t"
1 2 3
write_csv(x, file) Write a comma delimited file. # col types: skip, guess, integer, logical, character
4 5 NA 1,2,3 • col_skip() - "-", "_" read_csv(
4,5,NA write_csv2(x, file) Write a semicolon delimited file. • col_guess() - "?" file,
col_type = "_?ilc"
write_tsv(x, file) Write a tab delimited file. )

CC BY SA Posit So ware, PBC • [email protected] • posit.co • readr.tidyverse.org • readxl.tidyverse.org and googlesheets4.tidyverse.org • readr 2.1.4 • readxl 1.4.2 • googlesheets4 1.1.0 • Updated: 2023-05









ft











ft










Import Spreadsheets
with readxl with googlesheets4
READ EXCEL FILES READ SHEETS
A B C D E A B C D E
1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
2 x z 8 x NA z 8 NA 2 x z 8 x NA z 8 NA
3 y 7 9 10 y 7 NA 9 10 READXL COLUMN SPECIFICATION 3 y 7 9 10 y 7 NA 9 10 GOOGLESHEETS4 COLUMN SPECIFICATION
s1 s1
Column specifications define what data type Column specifications define what data type
each column of a file will be imported as. each column of a file will be imported as.
read_excel(path, sheet = NULL, range = NULL) read_sheet(ss, sheet = NULL, range = NULL)
Read a .xls or .xlsx file based on the file extension. Read a sheet from a URL, a Sheet ID, or a dribble
Use the col_types argument of read_excel() to Use the col_types argument of read_sheet()/
See front page for more read arguments. Also from the googledrive package. See front page for
set the column specification. range_read() to set the column specification.
read_xls() and read_xlsx(). more read arguments. Same as range_read().
read_excel("excel_file.xlsx")
Guess column types Guess column types
To guess a column type, read_ excel() looks at SHEETS METADATA To guess a column type read_sheet()/
READ SHEETS the first 1000 rows of data. Increase with the URLs are in the form: range_read() looks at the first 1000 rows of data.
guess_max argument. https://docs.google.com/spreadsheets/d/ Increase with guess_max.
A B C D E read_excel(path, sheet = read_excel(path, guess_max = Inf) read_sheet(path, guess_max = Inf)
NULL) Specify which sheet SPREADSHEET_ID/edit#gid=SHEET_ID
to read by position or name. Set all columns to same type, e.g. character gs4_get(ss) Get spreadsheet meta data. Set all columns to same type, e.g. character
read_excel(path, sheet = 1) read_excel(path, col_types = "text") read_sheet(path, col_types = "c")
s1 s2 s3
read_excel(path, sheet = "s1") gs4_find(...) Get data on all spreadsheet files.
Set each column individually sheet_properties(ss) Get a tibble of properties Set each column individually
read_excel( for each worksheet. Also sheet_names(). # col types: skip, guess, integer, logical, character
excel_sheets(path) Get a
vector of sheet names. path, read_sheets(ss, col_types = "_?ilc")
s1 s2 s3
col_types = c("text", "guess", "guess",“numeric") WRITE SHEETS
excel_sheets("excel_file.xlsx")
) A B C write_sheet(data, ss =
1 x 4 1 1 x 4 NULL, sheet = NULL) COLUMN TYPES
A B C D E To read multiple sheets: 2 y 5 2 2 y 5
Write a data frame into a
COLUMN TYPES l n c D L
A B C D E 1. Get a vector of sheet 3 z 6 3 3 z 6
new or existing Sheet. TRUE 2 hello 1947-01-08 hello
s1
names from the file path. logical numeric text date list FALSE 3.45 world 1956-10-21 1
A B C D E gs4_create(name, ...,
2. Set the vector names to TRUE 2 hello 1947-01-08 hello
s1 s2 A B C D sheets = NULL) Create a
be the sheet names. FALSE 3.45 world 1956-10-21 1 • skip - "_" or "-" • date - "D"
1 new Sheet with a vector
s1 s2 3. Use purrr::map_dfr() to • guess - "?" • datetime - "T"
• skip • logical • date 2 of names, a data frame,
s1 s2 s3 read multiple files into • logical - "l" • character - "c"
• guess • numeric • list s1 or a (named) list of data
one data frame. • integer - "i" • list-column - "L"
• text frames.
• double - "d" • cell - "C" Returns
path <- "your_file_path.xlsx" A B C
sheet_append(ss, data,
x1 x2 x3 1 x1 x2 x3 • numeric - "n" list of raw cell data.
path |> excel_sheets() |> Use list for columns that include multiple data 2 1 x 4 sheet = 1) Add rows to
2 y 5
set_names() |> types. See tidyr and purrr for list-column data. 3 z 6 3 2 y 5 the end of a worksheet. Use list for columns that include multiple data
map_dfr(read_excel, path = path) 4 3 z 6 types. See tidyr and purrr for list-column data.
s1

OTHER USEFUL EXCEL PACKAGES CELL SPECIFICATION FOR READXL AND GOOGLESHEETS4 FILE LEVEL OPERATIONS
For functions to write data to Excel files, see: Use the range argument of readxl::read_excel() or googlesheets4 also o ers ways to modify other
• openxlsx googlesheets4::read_sheet() to read a subset of cells from a aspects of Sheets (e.g. freeze rows, set column
• writexl A B C D E sheet. width, manage (work)sheets). Go to
1 1 2 3 4 5 2 3 4 read_excel(path, range = "Sheet1!B1:D2") googlesheets4.tidyverse.org to read more.
For working with non-tabular Excel data, see: 2 x y z NA y z read_sheet(ss, range = "B1:D2")
• tidyxl 3 6 7 9 10 For whole-file operations (e.g. renaming, sharing,
s1 Also use the range argument with cell specification functions placing within a folder), see the tidyverse
cell_limits(), cell_rows(), cell_cols(), and anchored(). package googledrive at
googledrive.tidyverse.org.

CC BY SA Posit So ware, PBC • [email protected] • posit.co • readr.tidyverse.org • readxl.tidyverse.org and googlesheets4.tidyverse.org • readr 2.1.4 • readxl 1.4.2 • googlesheets4 1.1.0 • Updated: 2023-05





ft


ff


















Data transformation with dplyr : : CHEAT SHEET
dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x |> f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract

Summarize Cases w
www
ww that meet logical criteria.
mtcars |> filter(mpg > 20) w
www column values as a vector, by name or index.
mtcars |> pull(wt)

distinct(.data, …, .keep_all = FALSE) Remove select(.data, …) Extract columns as a table.

w
www
Apply summary functions to columns to create a new table of

w
www
ww
rows with duplicate values. mtcars |> select(mpg, wt)
summary statistics. Summary functions take vectors as input and mtcars |> distinct(gear)
return one value (see back).
relocate(.data, …, .before = NULL, .a er = NULL)
slice(.data, …, .preserve = FALSE) Select rows

w
www
ww
summary function Move columns to new position.
by position. mtcars |> relocate(mpg, cyl, .a er = last_col())
mtcars |> slice(10:15)
summarize(.data, …)

w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
mtcars |> summarize(avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. mtcars |> select(mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows.
NULL) Count number of rows in each group defined contains(match) num_range(prefix, range) :, e.g., mpg:cyl
mtcars |> slice_sample(n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) !, e.g., !gear
by the variables in … Also tally(), add_count(),

w
ww add_tally(). starts_with(match) matches(match) everything()
mtcars |> count(cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values. MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
mtcars |> slice_min(mpg, prop = 0.25)
df <- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5))
slice_head(.data, …, n, prop) and slice_tail()
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a Select the first or last rows. across(.cols, .funs, …, .names = NULL) Summarize

w
ww
"grouped" copy of a table grouped by columns in ... dplyr mtcars |> slice_head(n = 5) or mutate multiple columns in the same way.
functions will manipulate each "group" separately and combine df |> summarize(across(everything(), mean))
the results.
Logical and boolean operators to use with filter() c_across(.cols) Compute across columns in

w
ww
== < <= is.na() %in% | xor() row-wise data.

w
www
ww mtcars |> != > >= !is.na() ! &
df |>
rowwise() |>
w
group_by(cyl) |>
summarize(avg = mean(mpg)) See ?base::Logic and ?Comparison for help. mutate(x_total = sum(c_across(1:2)))
MAKE NEW VARIABLES
ARRANGE CASES Apply vectorized functions to columns. Vectorized functions take
Use rowwise(.data, …) to group data into individual rows. dplyr arrange(.data, …, .by_group = FALSE) Order vectors as input and return vectors of the same length as output
functions will compute results for each row. Also apply functions (see back).
w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. vectorized function
mtcars |> arrange(mpg) mutate(.data, …, .keep = "all", .before = NULL,
starwars |> mtcars |> arrange(desc(mpg))

ww
www w
www
ww
.a er = NULL) Compute new column(s). Also

w
w ww
rowwise() |> add_column().
mutate(film_count = length(films)) mtcars |> mutate(gpm = 1 / mpg)
ADD CASES mtcars |> mutate(gpm = 1 / mpg, .keep = "none")
add_row(.data, …, .before = NULL, .a er = NULL)
ungroup(x, …) Returns ungrouped copy of table.

w
www
ww
Add one or more rows to a table. rename(.data, …) Rename columns. Use

w
www
w
g_mtcars <- mtcars |> group_by(cyl) cars |> add_row(speed = 1, dist = 1) rename_with() to rename with a function.
ungroup(g_mtcars) mtcars |> rename(miles_per_gallon = mpg)

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at dplyr.tidyverse.org • dplyr 1.1.2 • Updated: 2023-05
ft

ft



ft


ft

ft


Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARIZE () COMBINE VARIABLES COMBINE CASES
mutate() applies vectorized functions to summarize() applies summary functions to x y
columns to create new columns. Vectorized columns to create a new table. Summary A B C E F G A B C E F G A B C

functions take vectors as input and return


vectors of the same length as output.
functions take vectors as input and return single
values as output.
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
= a
b
c
t
u
v
1
2
3
a
b
d
t
u
w
3
2
1
x
a t 1
b u 2
A B C

vectorized function summary function


bind_cols(…, .name_repair) Returns tables
placed side by side as a single table. Column
+ y
c v 3
d w 4 bind_rows(…, .id = NULL)
Returns tables one on top of the
lengths must be equal. Columns will NOT be DF A B C other as a single table. Set .id to
matched by id (to do that look at Relational Data x a t 1
a column name to add a column
OFFSET COUNT below), so be sure to check that both tables are
x
y
b
c
u
v
2
3 of the original table names (as
dplyr::lag() - o set elements by 1 dplyr::n() - number of values/rows ordered the way you want before binding. y d w 4 pictured).
dplyr::lead() - o set elements by -1 dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NAs RELATIONAL DATA
CUMULATIVE AGGREGATE
dplyr::cumall() - cumulative all() POSITION Use a "Mutating Join" to join one table to Use a "Filtering Join" to filter one table against
dplyr::cumany() - cumulative any() columns from another, matching values with the the rows of another.
cummax() - cumulative max() mean() - mean, also mean(!is.na()) rows that they correspond to. Each join retains a
median() - median x y
dplyr::cummean() - cumulative mean() di erent combination of values from the tables. A B C A B D
cummin() - cumulative min()
cumprod() - cumulative prod()
cumsum() - cumulative sum()
LOGICAL
A B C D le _join(x, y, by = NULL, copy = FALSE,
a
b
c
t
u
v
1
2
3
+ a
b
d
t
u
w
3
2
1
=
mean() - proportion of TRUEs
sum() - # of TRUEs
a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
RANKING
b u 2 2
na_matches = "na") Join matching
A B C semi_join(x, y, by = NULL, copy = FALSE,
c v 3 NA
values from y to x.
a t 1
…, na_matches = "na") Return rows of x
dplyr::cume_dist() - proportion of all values <= ORDER b u 2
that have a match in y. Use to see what
dplyr::dense_rank() - rank w ties = min, no gaps will be included in a join.
dplyr::min_rank() - rank with ties = min dplyr::first() - first value A B C D right_join(x, y, by = NULL, copy = FALSE,
dplyr::ntile() - bins into n bins dplyr::last() - last value a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
dplyr::percent_rank() - min_rank scaled to [0,1] dplyr::nth() - value in nth location of vector b u 2 2
na_matches = "na") Join matching A B C anti_join(x, y, by = NULL, copy = FALSE,
dplyr::row_number() - rank with ties = "first"
d w NA 1
values from x to y.
c v 3
…, na_matches = "na") Return rows of x
RANK that do not have a match in y. Use to see
MATH inner_join(x, y, by = NULL, copy = FALSE, what will not be included in a join.
quantile() - nth quantile  A B C D

+, - , *, /, ^, %/%, %% - arithmetic ops min() - minimum value


a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE,
log(), log2(), log10() - logs
b u 2 2
na_matches = "na") Join data. Retain Use a "Nest Join" to inner join one table to
max() - maximum value another into a nested data frame.
<, <=, >, >=, !=, == - logical comparisons only rows with matches.
dplyr::between() - x >= le & x <= right SPREAD A B C y nest_join(x, y, by = NULL, copy =
dplyr::near() - safe == for floating point numbers A B C D full_join(x, y, by = NULL, copy = FALSE, a t 1 <tibble [1x2]>
FALSE, keep = FALSE, name =
IQR() - Inter-Quartile Range a t 1 3
su ix = c(".x", ".y"), …, keep = FALSE, b u 2 <tibble [1x2]>
MISCELLANEOUS mad() - median absolute deviation b u 2 2 c v 3 <tibble [1x2]> NULL, …) Join data, nesting
c v 3 NA na_matches = "na") Join data. Retain all matches from y in a single new
dplyr::case_when() - multi-case if_else() sd() - standard deviation d w NA 1 values, all rows.
var() - variance data frame column.
starwars |>
mutate(type = case_when(
height > 200 | mass > 200 ~ "large",
species == "Droid" ~ "robot", Row Names COLUMN MATCHING FOR JOINS SET OPERATIONS

TRUE ~ "other") Tidy data does not use rownames, which store a A B C intersect(x, y, …)
A B.x C B.y D Use by = c("col1", "col2", …) to
) variable outside of the columns. To work with the
c v 3
Rows that appear in both x and y.
a t 1 t 3
specify one or more common
dplyr::coalesce() - first non-NA values by rownames, first move them into a column. b u 2 u 2
columns to match on.
element  across a set of vectors c v 3 NA NA
setdi (x, y, …)
tibble::rownames_to_column() le _join(x, y, by = "A") A B C
dplyr::if_else() - element-wise if() + else() A B C A B
a t 1 Rows that appear in x but not y.
dplyr::na_if() - replace specific values with NA 1 a t 1 a t Move row names into col. b u 2

pmax() - element-wise max() a <- mtcars |>


A.x B.x C A.y B.y Use a named vector, by = c("col1" =
2 b u 2 b u a t 1 d w union(x, y, …)
pmin() - element-wise min() 3 c v 3 c v
rownames_to_column(var = "C") "col2"), to match on columns that A B C
b u 2 b u a t 1 Rows that appear in x or y,
c v 3 a t have di erent names in each table. b u 2
duplicates removed). union_all()
tibble::column_to_rownames() le _join(x, y, by = c("C" = "D")) c v 3
A B C A B d w 4 retains duplicates.
1 a t t 1 a
Move col into row names. 
2 b u u 2 b
a |> column_to_rownames(var = "C") A1 B1 C A2 B2 Use su ix to specify the su ix to
3 c v v 3 c a t 1 d w
give to unmatched columns that Use setequal() to test whether two data sets
b u 2 b u
have the same name in both tables. contain the exact same rows (in any order).
Also tibble::has_rownames() and c v 3 a t
tibble::remove_rownames(). le _join(x, y, by = c("C" = "D"),
su ix = c("1", "2"))

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at dplyr.tidyverse.org • dplyr 1.1.2 • Updated: 2023-05
ft
ft
ft
ft
ff
ff
ff
ff
ff
ff
ff

ff
ff

ft

ff

ff

ft

ff




Data visualization with ggplot2 : : CHEAT SHEET


Basics Geoms Use a geom function to represent data points, use the geom’s aesthetic properties to represent variables.
Each function returns a layer.
ggplot2 is based on the grammar of graphics, the idea
that you can build every graph from the same GRAPHICAL PRIMITIVES TWO VARIABLES
components: a data set, a coordinate system, a <- ggplot(economics, aes(date, unemploy)) both continuous continuous bivariate distribution
and geoms—visual marks that represent data points. b <- ggplot(seals, aes(x = long, y = lat)) e <- ggplot(mpg, aes(cty, hwy)) h <- ggplot(diamonds, aes(carat, price))

F M A a + geom_blank() and a + expand_limits() e + geom_label(aes(label = cty), nudge_x = 1, h + geom_bin2d(binwidth = c(0.25, 500))


Ensure limits include values across all plots. nudge_y = 1) - x, y, label, alpha, angle, color, x, y, alpha, color, fill, linetype, size, weight
+ = b + geom_curve(aes(yend = lat + 1,
family, fontface, hjust, lineheight, size, vjust
h + geom_density_2d()
xend = long + 1), curvature = 1) - x, xend, y, yend, e + geom_point() x, y, alpha, color, group, linetype, size
data geom coordinate plot alpha, angle, color, curvature, linetype, size
x=F·y=A system x, y, alpha, color, fill, shape, size, stroke
a + geom_path(lineend = "butt", h + geom_hex()
To display values, map variables in the data to visual linejoin = "round", linemitre = 1) e + geom_quantile() x, y, alpha, color, fill, size
properties of the geom (aesthetics) like size, color, and x x, y, alpha, color, group, linetype, size x, y, alpha, color, group, linetype, size, weight
and y locations.
a + geom_polygon(aes(alpha = 50)) - x, y, alpha, e + geom_rug(sides = “bl") continuous function
F M A color, fill, group, subgroup, linetype, size x, y, alpha, color, linetype, size i <- ggplot(economics, aes(date, unemploy))

+ = b + geom_rect(aes(xmin = long, ymin = lat,


xmax = long + 1, ymax = lat + 1)) - xmax, xmin,
e + geom_smooth(method = lm)
x, y, alpha, color, fill, group, linetype, size, weight
i + geom_area()
x, y, alpha, color, fill, linetype, size
data geom coordinate plot ymax, ymin, alpha, color, fill, linetype, size
x=F·y=A system e + geom_text(aes(label = cty), nudge_x = 1, i + geom_line()
color = F a + geom_ribbon(aes(ymin = unemploy - 900, nudge_y = 1) - x, y, label, alpha, angle, color,
size = A ymax = unemploy + 900)) - x, ymax, ymin, x, y, alpha, color, group, linetype, size
family, fontface, hjust, lineheight, size, vjust
alpha, color, fill, group, linetype, size
i + geom_step(direction = "hv")
Complete the template below to build a graph. x, y, alpha, color, group, linetype, size
required LINE SEGMENTS
ggplot (data = <DATA> ) + common aesthetics: x, y, alpha, color, linetype, size
one discrete, one continuous visualizing error
<GEOM_FUNCTION> (mapping = aes( <MAPPINGS> ), b + geom_abline(aes(intercept = 0, slope = 1)) f <- ggplot(mpg, aes(class, hwy)) df <- data.frame(grp = c("A", "B"), fit = 4:5, se = 1:2)
stat = <STAT> , position = <POSITION> ) + Not b + geom_hline(aes(yintercept = lat)) j <- ggplot(df, aes(grp, fit, ymin = fit - se, ymax = fit + se))
<COORDINATE_FUNCTION> +
required, b + geom_vline(aes(xintercept = long))
sensible f + geom_col() j + geom_crossbar(fatten = 2) - x, y, ymax,
<FACET_FUNCTION> + defaults b + geom_segment(aes(yend = lat + 1, xend = long + 1)) x, y, alpha, color, fill, group, linetype, size ymin, alpha, color, fill, group, linetype, size
supplied b + geom_spoke(aes(angle = 1:1155, radius = 1))
<SCALE_FUNCTION> +
f + geom_boxplot() j + geom_errorbar() - x, ymax, ymin,
<THEME_FUNCTION> x, y, lower, middle, upper, ymax, ymin, alpha, alpha, color, group, linetype, size, width
color, fill, group, linetype, shape, size, weight Also geom_errorbarh().
ggplot(data = mpg, aes(x = cty, y = hwy)) Begins a plot ONE VARIABLE continuous
that you finish by adding layers to. Add one geom c <- ggplot(mpg, aes(hwy)); c2 <- ggplot(mpg) f + geom_dotplot(binaxis = "y", stackdir = “center") j + geom_linerange()
function per layer. x, y, alpha, color, fill, group x, ymin, ymax, alpha, color, group, linetype, size
c + geom_area(stat = "bin")
last_plot() Returns the last plot. x, y, alpha, color, fill, linetype, size f + geom_violin(scale = “area") j + geom_pointrange() - x, y, ymin, ymax,
x, y, alpha, color, fill, group, linetype, size, weight alpha, color, fill, group, linetype, shape, size
ggsave("plot.png", width = 5, height = 5) Saves last plot c + geom_density(kernel = "gaussian")
as 5’ x 5’ file named "plot.png" in working directory. x, y, alpha, color, fill, group, linetype, size, weight
Matches file type to file extension. both discrete maps
c + geom_dotplot()
g <- ggplot(diamonds, aes(cut, color)) data <- data.frame(murder = USArrests$Murder,
x, y, alpha, color, fill
state = tolower(rownames(USArrests)))
Aes Common aesthetic values. c + geom_freqpoly()
x, y, alpha, color, group, linetype, size
g + geom_count()
x, y, alpha, color, fill, shape, size, stroke
map <- map_data("state")
k <- ggplot(data, aes(fill = murder))
color and fill - string ("red", "#RRGGBB") k + geom_map(aes(map_id = state), map = map)
e + geom_jitter(height = 2, width = 2)
linetype - integer or string (0 = "blank", 1 = "solid", c + geom_histogram(binwidth = 5) x, y, alpha, color, fill, shape, size + expand_limits(x = map$long, y = map$lat)
2 = "dashed", 3 = "dotted", 4 = "dotdash", 5 = "longdash", x, y, alpha, color, fill, linetype, size, weight map_id, alpha, color, fill, linetype, size
6 = "twodash")
c2 + geom_qq(aes(sample = hwy))
lineend - string ("round", "butt", or "square") x, y, alpha, color, fill, linetype, size, weight THREE VARIABLES
linejoin - string ("round", "mitre", or "bevel") seals$z <- with(seals, sqrt(delta_long^2 + delta_lat^2)); l <- ggplot(seals, aes(long, lat))
size - integer (line width in mm) l + geom_contour(aes(z = z)) l + geom_raster(aes(fill = z), hjust = 0.5,
discrete x, y, z, alpha, color, group, linetype, size, weight vjust = 0.5, interpolate = FALSE)
shape - integer/shape name or d <- ggplot(mpg, aes(fl))
a single character ("a") x, y, alpha, fill
d + geom_bar() l + geom_contour_filled(aes(fill = z)) l + geom_tile(aes(fill = z))
x, alpha, color, fill, linetype, size, weight x, y, alpha, color, fill, group, linetype, size, subgroup x, y, alpha, color, fill, linetype, size, width

RCC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08







ft


































Stats An alternative way to build a layer. Scales Override defaults with scales package. Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into
fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5)) - xlim, ylim subplots based on the
n <- d + geom_bar(aes(fill = fl)) The default cartesian coordinate system. values of one or more

+ =
x ..count..
discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2)
scale_ to adjust scale to use arguments ratio, xlim, ylim - Cartesian coordinates with t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot
x=x· system n + scale_fill_manual( fixed aspect ratio between x and y units.
y = ..count.. values = c("skyblue", "royalblue", "blue", "navy"), t + facet_grid(cols = vars(fl))
Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), ggplot(mpg, aes(y = fl)) + geom_bar() Facet into columns based on fl.
name = "fuel", labels = c("D", "E", "P", "R")) Flip cartesian coordinates by switching
function, geom_bar(stat="count") or by using a stat
x and y aesthetic mappings. t + facet_grid(rows = vars(year))
function, stat_count(geom="bar"), which calls a default range of title to use in labels to use breaks to use in
values to include legend/axis in legend/axis legend/axis Facet into rows based on year.
geom to make a layer (equivalent to a geom function). in mapping
Use ..name.. syntax to map stat variables to aesthetics. r + coord_polar(theta = "x", direction=1)
theta, start, direction - Polar coordinates. t + facet_grid(rows = vars(year), cols = vars(fl))
GENERAL PURPOSE SCALES Facet into both rows and columns.
geom to use stat function geommappings r + coord_trans(y = “sqrt") - x, y, xlim, ylim t + facet_wrap(vars(fl))
Use with most aesthetics Transformed cartesian coordinates. Set xtrans
i + stat_density_2d(aes(fill = ..level..), Wrap facets into a rectangular layout.
scale_*_continuous() - Map cont’ values to visual ones. and ytrans to the name of a window function.
geom = "polygon")
variable created by stat scale_*_discrete() - Map discrete values to visual ones. Set scales to let axis limits vary across facets.
scale_*_binned() - Map continuous values to discrete bins. π + coord_quickmap()
60
π + coord_map(projection = "ortho", orientation t + facet_grid(rows = vars(drv), cols = vars(fl),
c + stat_bin(binwidth = 1, boundary = 10) scale_*_identity() - Use data values as visual ones. = c(41, -74, 0)) - projection, xlim, ylim scales = "free")

lat
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. scale_*_manual(values = c()) - Map discrete values to Map projections from the mapproj package x and y axis limits adjust to individual facets:
manually chosen visual ones.
c + stat_count(width = 1) x, y | ..count.., ..prop.. long
(mercator (default), azequalarea, lagrange, etc.). "free_x" - x axis limits adjust
scale_*_date(date_labels = "%m/%d"), "free_y" - y axis limits adjust
c + stat_density(adjust = 1, kernel = "gaussian") date_breaks = "2 weeks") - Treat data values as dates.
x, y | ..count.., ..density.., ..scaled..
e + stat_bin_2d(bins = 30, drop = T)
scale_*_datetime() - Treat data values as date times.
Same as scale_*_date(). See ?strptime for label formats.
Position Adjustments Set labeller to adjust facet label:
t + facet_grid(cols = vars(fl), labeller = label_both)
x, y, fill | ..count.., ..density.. Position adjustments determine how to arrange geoms fl: c fl: d fl: e fl: p fl: r
X & Y LOCATION SCALES that would otherwise occupy the same space.
e + stat_bin_hex(bins = 30) x, y, fill | ..count.., ..density.. t + facet_grid(rows = vars(fl),
Use with x or y aesthetics (x shown here) s <- ggplot(mpg, aes(fl, fill = drv)) labeller = label_bquote(alpha ^ .(fl)))
e + stat_density_2d(contour = TRUE, n = 100)
x, y, color, size | ..level.. scale_x_log10() - Plot x on log10 scale. ↵c ↵d ↵e ↵p ↵r
scale_x_reverse() - Reverse the direction of the x axis. s + geom_bar(position = "dodge")
e + stat_ellipse(level = 0.95, segments = 51, type = "t") scale_x_sqrt() - Plot x on square root scale. Arrange elements side by side.
l + stat_contour(aes(z = z)) x, y, z, order | ..level..
l + stat_summary_hex(aes(z = z), bins = 30, fun = max) COLOR AND FILL SCALES (DISCRETE)
s + geom_bar(position = "fill")
Stack elements on top of one
Labels and Legends
x, y, z, fill | ..value.. another, normalize height. Use labs() to label the elements of your plot.
n + scale_fill_brewer(palette = "Blues")
l + stat_summary_2d(aes(z = z), bins = 30, fun = mean) For palette choices: e + geom_point(position = "jitter") t + labs(x = "New x axis label", y = "New y axis label",
x, y, z, fill | ..value.. RColorBrewer::display.brewer.all() Add random noise to X and Y position of title ="Add a title above the plot",
each element to avoid overplotting. subtitle = "Add a subtitle below title",
f + stat_boxplot(coef = 1.5) n + scale_fill_grey(start = 0.2, A caption = "Add a caption below plot",
x, y | ..lower.., ..middle.., ..upper.., ..width.. , ..ymin.., ..ymax.. end = 0.8, na.value = "red") e + geom_label(position = "nudge") alt = "Add alt text to the plot",
B
Nudge labels away from points. <aes> = "New <aes>
<AES> <AES> legend title")
f + stat_ydensity(kernel = "gaussian", scale = "area") x, y
| ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. COLOR AND FILL SCALES (CONTINUOUS) s + geom_bar(position = "stack") t + annotate(geom = "text", x = 8, y = 9, label = “A")
Stack elements on top of one another. Places a geom with manually selected aesthetics.
e + stat_ecdf(n = 40) x, y | ..x.., ..y.. o <- c + geom_dotplot(aes(fill = ..x..))
e + stat_quantile(quantiles = c(0.1, 0.9), Each position adjustment can be recast as a function p + guides(x = guide_axis(n.dodge = 2)) Avoid crowded
o + scale_fill_distiller(palette = “Blues”) with manual width and height arguments: or overlapping labels with guide_axis(n.dodge or angle).
formula = y ~ log(x), method = "rq") x, y | ..quantile..
s + geom_bar(position = position_dodge(width = 1)) n + guides(fill = “none") Set legend type for each
e + stat_smooth(method = "lm", formula = y ~ x, se = T, o + scale_fill_gradient(low="red", high=“yellow") aesthetic: colorbar, legend, or none (no legend).
level = 0.95) x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax..
ggplot() + xlim(-5, 5) + stat_function(fun = dnorm,
o + scale_fill_gradient2(low = "red", high = “blue”,
mid = "white", midpoint = 25) Themes n + theme(legend.position = "bottom")
Place legend at "bottom", "top", "le ", or “right”.
n = 20, geom = “point”) x | ..x.., ..y.. n + scale_fill_discrete(name = "Title",
ggplot() + stat_qq(aes(sample = 1:100)) o + scale_fill_gradientn(colors = topo.colors(6)) r + theme_bw() r + theme_classic() labels = c("A", "B", "C", "D", "E"))
x, y, sample | ..sample.., ..theoretical.. Also: rainbow(), heat.colors(), terrain.colors(), White background Set legend title and labels with a scale function.
cm.colors(), RColorBrewer::brewer.pal() with grid lines. r + theme_light()
e + stat_sum() x, y, size | ..n.., ..prop..
e + stat_summary(fun.data = "mean_cl_boot")
h + stat_summary_bin(fun = "mean", geom = "bar")
SHAPE AND SIZE SCALES
r + theme_gray()
Grey background
r + theme_linedraw()
r + theme_minimal()
Zooming
p <- e + geom_point(aes(shape = fl, size = cyl)) (default theme). Minimal theme. Without clipping (preferred):
e + stat_identity() p + scale_shape() + scale_size() r + theme_dark() r + theme_void() t + coord_cartesian(xlim = c(0, 100), ylim = c(10, 20))
e + stat_unique() p + scale_shape_manual(values = c(3:7)) Dark for contrast. Empty theme.
With clipping (removes unseen data points):
r + theme() Customize aspects of the theme such
as axis, legend, panel, and facet properties. t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) r + ggtitle(“Title”) + theme(plot.title.postion = “plot”) t + scale_x_continuous(limits = c(0, 100)) +
r + theme(panel.background = element_rect(fill = “blue”)) scale_y_continuous(limits = c(0, 100))

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08

ft


















ft




























Data tidying with tidyr : : CHEAT SHEET


Tidy data is a way to organize tabular data in a
consistent data structure across packages. Reshape Data - Pivot data to reorganize values into a new layout. Expand
A table is tidy if:
A B C A B C
table4a Tables
country 1999 2000 country year cases pivot_longer(data, cols, names_to = "name", Create new combinations of variables or identify
& A
B
0.7K 2K
37K 80K
A
B
1999 0.7K
1999 37K
values_to = "value", values_drop_na = FALSE) implicit missing values (combinations of
C 212K 213K C 1999 212K "Lengthen" data by collapsing several columns variables not present in the data).
A 2000 2K
Each variable is in Each observation, or into two. Column names move to a new
B 2000 80K x
its own column case, is in its own row C 2000 213K names_to column and values to a new values_to x1 x2 x3 x1 x2 expand(data, …) Create a
column. A 1 3
B 1 4
A 1
A 2 new tibble with all possible
A B C A *B C pivot_longer(table4a, cols = 2:3, names_to ="year", B 2 3 B 1
B 2
combinations of the values
values_to = "cases") of the variables listed in …
Drop other variables.
table2 expand(mtcars, cyl, gear,
Access variables Preserve cases in country year type count country year cases pop pivot_wider(data, names_from = "name", carb)
as vectors vectorized operations A 1999 cases 0.7K A 1999 0.7K 19M
values_from = "value")
A 1999 pop 19M A 2000 2K 20M x
A 2000 cases 2K B 1999 37K 172M The inverse of pivot_longer(). "Widen" data by x1 x2 x3 x1 x2 x3 complete(data, …, fill =
Tibbles
A 1 3 A 1 3
A
B
2000
1999
pop 20M
cases 37K
B
C
2000 80K 174M
1999 212K 1T
expanding two columns into several. One column B 1 4 A 2 NA list()) Add missing possible
B 1999 pop 172M C 2000 213K 1T provides the new column names, the other the B 2 3 B 1 4
combinations of values of
AN ENHANCED DATA FRAME
B 2 3
B 2000 cases 80K values. variables listed in … Fill
Tibbles are a table format provided B 2000 pop 174M
remaining variables with NA.
C 1999 cases 212K pivot_wider(table2, names_from = type,
by the tibble package. They inherit the complete(mtcars, cyl, gear,
C 1999 pop 1T values_from = count)
data frame class, but have improved behaviors: C 2000 cases 213K carb)
C 2000 pop 1T
• Subset a new tibble with ], a vector with [[ and $.
• No partial matching when subsetting columns.
• Display concise views of the data on one screen. Split Cells - Use these functions to split or combine cells into individual, isolated values. Handle Missing Values
options(tibble.print_max = n, tibble.print_min = m, table5 Drop or replace explicit missing values (NA).
tibble.width = Inf) Control default display settings. country century year country year unite(data, col, …, sep = "_", remove = TRUE,
x
View() or glimpse() View the entire data set.
A 19 99 A 1999
na.rm = FALSE) Collapse cells across several x1 x2 x1 x2 drop_na(data, …) Drop
A 20 00 A 2000
B 19 99 B 1999 columns into a single column. A 1 A 1
rows containing NA’s in …
CONSTRUCT A TIBBLE B NA D 3
B 20 00 B 2000
unite(table5, century, year, col = "year", sep = "") C
D
NA
3
columns.
tibble(…) Construct by columns. E NA drop_na(x, x2)
tibble(x = 1:3, y = c("a", "b", "c")) Both make table3
x
this tibble country year rate country year cases pop separate(data, col, into, sep = "[^[:alnum:]]+",
tribble(…) Construct by rows. x1 x2 x1 x2 fill(data, …, .direction =
A 1999 0.7K/19M0 A 1999 0.7K 19M remove = TRUE, convert = FALSE, extra = "warn", A 1 A 1
tribble(~x, ~y, A 2000 0.2K/20M0 A 2000 2K 20M B NA B 1 "down") Fill in NA’s in …
A tibble: 3 × 2 fill = "warn", …) Separate each cell in a column
1, "a", x y B 1999 .37K/172M B 1999 37K 172 C NA C 1
columns using the next or
<int> <chr> B 2000 .80K/174M B 2000 80K 174 into several columns. Also extract(). D 3 D 3
2, "b", 1 1 a
E NA E 3 previous value.
3, "c") 2
3
2
3
b
c
separate(table3, rate, sep = "/", fill(x, x2)
into = c("cases", "pop"))
x
as_tibble(x, …) Convert a data frame to a tibble. table3
country
A
year
1999
rate
0.7K x1 x2 x1 x2 replace_na(data, replace)
A 1 A 1
enframe(x, name = "name", value = "value") country year rate A 1999 19M
Specify a value to replace
A 1999 0.7K/19M0 A 2000 2K separate_rows(data, …, sep = "[^[:alnum:].]+", B NA B 2
Convert a named vector to a tibble. Also deframe(). C NA C 2
NA in selected columns.
A 2000 0.2K/20M0 A 2000 20M
convert = FALSE) Separate each cell in a column D 3 D 3

is_tibble(x) Test whether x is a tibble. B 1999 .37K/172M B 1999 37K


into several rows.
E NA E 2 replace_na(x, list(x2 = 2))
B 2000 .80K/174M B 1999 172M
B 2000 80K
B 2000 174M separate_rows(table3, rate, sep = "/")

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at tidyr.tidyverse.org • tibble 3.2.1 • tidyr 1.3.0 • Updated: 2023–05


ft














Nested Data
A nested data frame stores individual tables as a list-column of data frames within a larger organizing data frame. List-columns can also be lists of vectors or lists of varying data types.
Use a nested data frame to:
• Preserve relationships between observations and subsets of data. Preserve the type of the variables being nested (factors and datetimes aren't coerced to character).
• Manipulate many sub-tables at once with purrr functions like map(), map2(), or pmap() or with dplyr rowwise() grouping.

CREATE NESTED DATA RESHAPE NESTED DATA TRANSFORM NESTED DATA


nest(data, …) Moves groups of cells into a list-column of a data unnest(data, cols, ..., keep_empty = FALSE) Flatten nested columns A vectorized function takes a vector, transforms each element in
frame. Use alone or with dplyr::group_by(): back to regular columns. The inverse of nest(). parallel, and returns a vector of the same length. By themselves
n_storms |> unnest(data) vectorized functions cannot work with lists, such as list-columns.
1. Group the data frame with group_by() and use nest() to move
the groups into a list-column. unnest_longer(data, col, values_to = NULL, indices_to = NULL) dplyr::rowwise(.data, …) Group data so that each row is one
n_storms <- storms |> Turn each element of a list-column into a row. group, and within the groups, elements of list-columns appear
group_by(name) |> directly (accessed with [[ ), not as lists of length one. When you
nest() starwars |> use rowwise(), dplyr functions will seem to apply functions to
select(name, films) |> list-columns in a vectorized fashion.
2. Use nest(new_col = c(x, y)) to specify the columns to group
using dplyr::select() syntax. unnest_longer(films)
n_storms <- storms |>
name films
nest(data = c(year:long)) data data data result
Luke The Empire Strik…
<tibble [50x4]> <tibble [50x4]> fun( <tibble [50x4]> , …) result 1
"cell" contents Luke Revenge of the S… <tibble [50x4]> <tibble [50x4]> fun( <tibble [50x4]> , …) result 2
yr lat long name films Luke Return of the Jed… <tibble [50x4]> <tibble [50x4]> fun( <tibble [50x4]> , …) result 3
name yr lat long name yr lat long 1975 27.5 -79.0 Luke <chr [5]> C-3PO The Empire Strik…
Amy 1975 27.5 -79.0 Amy 1975 27.5 -79.0 1975 28.5 -79.0 C-3PO <chr [6]> C-3PO Attack of the Cl…
Amy Amy 1975 28.5 -79.0 nested data frame 1975 29.5 -79.0
1975 28.5 -79.0 R2-D2 <chr[7]> C-3PO The Phantom M…
Amy 1975 29.5 -79.0 Amy 1975 29.5 -79.0
Bob 1979 22.0 -96.0 Bob 1979 22.0 -96.0
name data
Amy <tibble [50x3]>
yr lat
1979 22.0 -96.0
long
R2-D2 The Empire Strik… Apply a function to a list-column and create a new list-column.
Bob 1979 22.5 -95.3
R2-D2 Attack of the Cl…
Bob 1979 22.5 -95.3 Bob <tibble [50x3]> 1979 22.5 -95.3
Bob 1979 23.0 -94.6 R2-D2 The Phantom M…
Bob 1979 23.0 -94.6 Zeta <tibble [50x3]> 1979 23.0 -94.6 dim() returns two
Zeta 2005 23.9 -35.6 Zeta 2005 23.9 -35.6
yr lat long
n_storms |> values per row
Zeta 2005 24.2 -36.1 Zeta
Zeta
2005
2005
24.2
24.7
-36.1
-36.6
2005 23.9 -35.6 rowwise() |>
Zeta 2005 24.7 -36.6
2005 24.2 -36.1 unnest_wider(data, col) Turn each element of a list-column into a mutate(n = list(dim(data))) wrap with list to tell mutate
to create a list-column
2005 24.7 -36.6
Index list-columns with [[]]. n_storms$data[[1]] regular column.
starwars |>
select(name, films) |> Apply a function to a list-column and create a regular column.
CREATE TIBBLES WITH LIST-COLUMNS unnest_wider(films, names_sep = “_")
tibble::tribble(…) Makes list-columns when needed.
tribble( ~max, ~seq, n_storms |>
3, 1:3, max seq name films name films_1 films_2 films_3 rowwise() |>
Luke <chr [5]> Luke The Empire... Revenge of... Return of... nrow() returns one
4, 1:4,
3
4
<int [3]>
<int [4]>
mutate(n = nrow(data)) integer per row
C-3PO <chr [6]> C-3PO The Empire... Attack of... The Phantom...
5, 1:5) 5 <int [5]>
R2-D2 <chr[7]> R2-D2 The Empire... Attack of... The Phantom...

tibble::tibble(…) Saves list input as list-columns.


tibble(max = c(3, 4, 5), seq = list(1:3, 1:4, 1:5)) Collapse multiple list-columns into a single list-column.
tibble::enframe(x, name="name", value="value") hoist(.data, .col, ..., .remove = TRUE) Selectively pull list components
Converts multi-level list to a tibble with list-cols. out into their own top-level columns. Uses purrr::pluck() syntax for starwars |> append() returns a list for each
row, so col type must be list
enframe(list('3'=1:3, '4'=1:4, '5'=1:5), 'max', 'seq') selecting from lists. rowwise() |>
mutate(transport = list(append(vehicles, starships)))
OUTPUT LIST-COLUMNS FROM OTHER FUNCTIONS starwars |>
dplyr::mutate(), transmute(), and summarise() will output select(name, films) |>
Apply a function to multiple list-columns.
list-columns if they return a list. hoist(films, first_film = 1, second_film = 2)
mtcars |> starwars |> length() returns one
integer per row
group_by(cyl) |> name films name first_film second_film films rowwise() |>
summarise(q = list(quantile(mpg))) Luke <chr [5]> Luke The Empire… Revenge of… <chr [3]> mutate(n_transports = length(c(vehicles, starships)))
C-3PO <chr [6]> C-3PO The Empire… Attack of… <chr [4]>
R2-D2 <chr[7]> R2-D2 The Empire… Attack of… <chr [5]>
See purrr package for more list functions.

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at tidyr.tidyverse.org • tibble 3.2.1 • tidyr 1.3.0 • Updated: 2023–05









ft


























Data Science in Spark with sparklyr : : CHEAT SHEET
Intro Import Wrangle
Visualize
• Collect result, plot in R Communicate
• From R (copy_to()) • dplyr verb
sparklyr is an R interface for Apache Spark™.
• Read a file (spark_read_) • tidyr commands Collect results into R
It enables us to write all of our analysis code in R, Model
• Read Hive table (tbl()) • Feature transformer (ft_) share using RMarkdown
but have the actual processing happen inside • Spark MLlib (ml_)
• Direct Spark SQL (DBI)
Spark clusters. Easily manipulate and model • H2O Extension R for Data Science,
large-scale using R and Spark via sparklyr. Grolemund & Wickham

Import Wrangle
(x - median) /
DPLYR VERBS ft_idf() - Compute the Inverse Document (p75 - p25) ft_robust_scaler() - Removes the median
Push Translates into Spark SQL statements Frequency (IDF) given a collection of and scales according to standard scale.
Compute
copy_to(sc, mtcars) %>%
documents.
mutate(trm = ifelse(am == 0,
𝞼=x 𝞼= 0 ft_standard_scaler() - Removes the mean
Import ft_imputer() - Imputation estimator for and scaling to unit variance using column
"auto", "man")) %>%
completing missing values, uses the mean summary statistics
Collect Source group_by(trm) %>%
or the median of the columns.
Results summarise_all(mean)
ft_stop_words_remover() - Filters out stop
no
Import data into Spark, not R 0 a ft_index_to_string() - Index labels back to words from input
TIDYR 1 c
1 c label as strings
READ A FILE INTO SPARK a 0 ft_string_indexer() - Column of labels into
pivot_longer() - Collapse c 1
Arguments that apply to all functions: X A 1 ft_interaction() - Takes in Double and a column of label indices.
X B 2 several columns into two. c 1
sc, name, path, options=list(), repartition=0, Y A 3 2,3 4,2 8,6 Vector columns and outputs a flattened
memory=TRUE, overwrite=TRUE Y B 4
A B pivot_wider() - Expand two vector of their feature interactions. a b
ft_tokenizer() - Converts to lowercase and
AB
X 1 2 then splits it by white spaces
CSV spark_read_csv( header = TRUE, Y 3 4 columns into several.
0 ft_max_abs_scaler() - Rescale each
columns=NULL, infer_schema=TRUE, -1 0 a 0,a ft_vector_assembler() - Combine vectors
delimiter = ",", quote= "\"", escape = "\\", nest() / unnest() - Convert groups of cells 1 feature individually to range [-1, 1] 1 a 1,a
into list-columns, and vice versa. 1 b 1,b into single row-vector
charset = "UTF-8", null_value = NULL)
ft_min_max_scaler() - Rescale each feature
1-4 2
0 a 0a unite() / separate() - Split a single column 1 to a common range [min, max] linearly 0 a 0,0 ft_vector_indexer() - Indexing categorical
JSON spark_read_json() 1 a 1a 4
1 a 1,0
1 b 1b into several columns, and vice versa. 1 b 1,1 feature columns in a dataset of Vector
PARQUET spark_read_parquet() ft_ngram() - Converts the input array of
NA fill() - Fill NA with the previous value
SPAR K
strings into an array of n-grams ft_vector_slicer() - Takes a feature vector
TEXT spark_read_text() NA 0,a a
and outputs a new feature vector with a
1,a a
1,b b
ORC spark_read_orc() ft_bucketed_random_projection_lsh() subarray of the original features
FEATURE TRANSFORMERS ft_minhash_lsh() - Locality Sensitive
LIBSVM spark_read_libsvm() boo ft_word2vec() - Word2Vec transforms a
Hashing functions for Euclidean distance too
DELTA spark_read_delta() 0 ft_binarizer() - Assigned values based on and Jaccard distance (MinHash) next word into a code
0
AVRO spark_read_avro() 1 threshold p=x p=2

1 ft_bucketizer() - Numeric column to


ft_normalizer() - Normalize a vector to
have unit norm using the given p-norm Visualize
R DATA FRAME INTO SPARK 0
2 discretized column
ft_one_hot_encoder()- Continuous to
1 0
dplyr::copy_to(dest, df, name) a b 0,1 1,1 ft_count_vectorizer() - Extracts a 0 1 binary vectors Summarize in Plot results
Spark in R
b b 0 2 vocabulary from document
Apache Arrow accelerates data transfer between R ft_pca() - Project vectors to a lower
and Spark. To use, simply load the library ft_discrete_cosine_transform() - 1D dimensional space of top k principal
discrete cosine transform of a real vector components. DPLYR + GGPLOT2
library(sparklyr)
library(arrow)
x
x
ft_elementwise_product() - Element- 0 ft_quantile_discretizer() - Continuous to copy_to(sc, mtcars) %>% Summarize in Spark
0
FROM A TABLE IN HIVE x
x
wise product between 2 cols 1 binned categorical values. group_by(cyl) %>%
summarise(mpg_m = mean(mpg)) %>%
dplyr::tbl(scr, …) - Creates a reference to ft_hashing_tf() - Maps a sequence of ft_regex_tokenizer() - Extracts tokens collect() %>%
a b Collect results in R
the table without loading it into memory a b
1 1 terms to their term frequencies using the AB a b either by using the provided regex pattern ggplot() +
b b 0 2
to split the text. geom_col(aes(cyl, mpg_m))
hashing trick. Create plot

CC BY SA Posit Software, PBC • [email protected] • posit.co • Learn more at spark.rstudio.com • sparklyr 1.7.5 • Updated: 2022-02
Data Science in Spark with sparklyr : : CHEAT SHEET
Modeling Sessions
REGRESSION CLUSTERING UTILITIES YARN CLIENT
ml_linear_regression() - Linear regression. ml_bisecting_kmeans() - A bisecting k-means ml_call_constructor() - Identifies the associated 1. Install RStudio Server on an edge node
algorithm based on the paper sparklyr ML constructor for the JVM 2. Locate path to the clusterʼs Spark Home Directory, it
ml_aft_survival_regression() - Parametric normally is “/usr/lib/spark”
survival regression model named accelerated ml_lda() | ml_describe_topics() | ml_log_likelihood() ml_model_data() - Extracts data associated with a
3. Basic configuration example
failure time (AFT) model | ml_log_perplexity() | ml_topics_matrix() - LDA topic Spark ML model
conf <- spark_config()
model designed for text documents. ml_standardize_formula() - Generates a formula conf$spark.executor.memory <- "300M"
ml_generalized_linear_regression() - GLM conf$spark.executor.cores <- 2
ml_gaussian_mixture() - Expectation maximization string from user inputs, to be used in `ml_model`
conf$spark.executor.instances <- 3
ml_isotonic_regression() - Currently for multivariate Gaussian Mixture Models (GMMs) constructor conf$spark.dynamicAllocation.enabled<-"false"
implemented using parallelized pool adjacent ml_kmeans() | ml_compute_cost() ml_uid() - Extracts the UID of an ML object. 4. Open a connection
sc <- spark_connect(master = "yarn",
violators algorithm. Only univariate (single |ml_compute_silhouette_measure() - Clustering with spark_home = "/usr/lib/spark/",
feature) algorithm supported support for k-means ML Pipelines version = "2.1.0", config = conf)

ml_random_forest_regressor() - Regression using ml_power_iteration() - For clustering vertices of a


Easily create a formal Spark Pipeline models using R. YARN CLUSTER
random forests. graph given pairwise similarities as edge properties.
Save the Pipeline in native Sacala. The saved model will 1. Make sure to have copies of the yarn-site.xml and
FEATURE have no dependencies on R. hive-site.xml files in the RStudio Server
CLASSIFICATION 2. Point environment variables to the correct paths
ml_chisquare_test(x,features,label) - Pearson's INITIALIZE AND TRAIN Sys.setenv(JAVA_HOME="[Path]")
ml_linear_svc() - Classification using linear independence test for every feature against the label Sys.setenv(SPARK_HOME ="[Path]")
ml_pipeline() - Initializes a new Spark Pipeline
support vector machines Sys.setenv(YARN_CONF_DIR ="[Path]")
ml_default_stop_words() - Loads the default stop ml_fit() - Trains the model, outputs a Spark Pipeline 3. Open a connection
ml_logistic_regression() - Logistic regression words for the given language sc <- spark_connect(master = "yarn-cluster")
Model.
ml_multilayer_perceptron_classifier() - STATS SAVE AND RETRIEVE STANDALONE CLUSTER
Classification model based on the Multilayer 1. Install RStudio Server on one of the existing nodes or
ml_summary() - Extracts a metric from the summary ml_save() - Saves into a format that can be read by
Perceptron. a server in the same LAN
object of a Spark ML model Scala and PySpark .
ml_naive_bayes() - It supports Multinomial NB 2. Open a connection
ml_corr() - Compute correlation matrix ml_read() - Reads Spark object into sparklyr. spark_connect(master="spark://host:port",
which can handle finitely supported discrete data
version = "2.0.1",
ml_one_vs_rest() - Reduction of Multiclass RECOMMENDATION SQL AND DPLYR spark_home = [path to Spark])

Classification to Binary Classification. Performs ml_als() | ml_recommend() - Recommendation using ft_sql_transformer() - Creates a Pipeline step based LOCAL MODE
reduction using one against all strategy. Alternating Least Squares matrix factorization on the SQL statement passed to the command. No cluster required. Use for learning purposes only
TREE EVALUATION ft_dplyr_transformer() - Creates a Pipeline step 1. Install a local version of Spark: spark_install()

ml_decision_tree_classifier()|ml_decision_tree() ml_clustering_evaluator() - Evaluator for clustering based on one or several dplyr commands. 2. Open a connection
sc <- spark_connect(master="local")
|ml_decision_tree_regressor() - Classification ml_evaluate() - Compute performance metrics ft_dplyr_transformer() ml_linear_regression()
and regression using decision trees ml_pipeline() ml_fit() ml_save() KUBERNETES
ml_binary_classification_evaluator() |
ml_gbt_classifier()|ml_gradient_boosted_trees() ml_binary_classification_eval() | 1. Use the following to obtain the Host and Port
ft_bucketizer()
| ml_gbt_regressor() - Binary classification and system2("kubectl", "cluster-info")
ml_classification_eval() - A set of functions to
spark.rstudio.com/guides/pipelines 2. Open a connection
regression using gradient boosted trees calculate performance metrics for prediction models. sc <- spark_connect(config =
ml_random_forest_classifier() - Classification More Info spark_config_kubernetes(
FREQUENT PATTERN
and regression using random forests. "k8s://https://[HOST]>:[PORT]",
ml_fpgrowth() | ml_association_rules() | account = "default",
ml_feature_importances() | ml_freq_itemsets() - A parallel FP-growth algorithm image = "docker.io/owner/repo:version"))
ml_tree_feature_importance() - Feature to mine frequent itemsets.
Importance for Tree Models CLOUD
ml_freq_seq_patterns() | ml_prefixspan() - Databricks - spark_connect(method = "databricks")
PrefixSpan algorithm for mining frequent itemsets. Qubole- spark_connect(method = "qubole")
spark.rstudio.com therinspark.com
CC BY SA Posit Software, PBC • [email protected] • posit.co • Learn more at spark.rstudio.com • sparklyr 1.7.5 • Updated: 2022-02
String manipulation with stringr : : CHEAT SHEET
The stringr package provides a set of internally consistent tools for working with character strings, i.e. sequences of characters surrounded by quotation marks.

Detect Matches Subset Strings Manage Lengths


TRUE str_detect(string, pattern, negate = FALSE) str_sub(string, start = 1L, end = -1L) Extract 4 str_length(string) The width of strings (i.e.
TRUE
FALSE
Detect the presence of a pattern match in a substrings from a character vector. 6
2
number of code points, which generally equals
TRUE string. Also str_like(). str_detect(fruit, "a") str_sub(fruit, 1, 3); str_sub(fruit, -2) 3 the number of characters). str_length(fruit)
TRUE str_starts(string, pattern, negate = FALSE) str_subset(string, pattern, negate = FALSE) str_pad(string, width, side = c("le ", "right",
TRUE
FALSE
Detect the presence of a pattern match at Return only the strings that contain a pattern "both"), pad = " ") Pad strings to constant
TRUE the beginning of a string. Also str_ends(). match. str_subset(fruit, "p") width. str_pad(fruit, 17)
str_starts(fruit, "a")
str_extract(string, pattern) Return the first str_trunc(string, width, side = c("right", "le ",
1 str_which(string, pattern, negate = FALSE) NA
pattern match found in each string, as a vector. "center"), ellipsis = "...") Truncate the width
2
4
Find the indexes of strings that contain Also str_extract_all() to return every pattern of strings, replacing content with ellipsis.
a pattern match. str_which(fruit, "a") match. str_extract(fruit, "[aeiou]") str_trunc(sentences, 6)
start end

2 4 str_locate(string, pattern) Locate the str_match(string, pattern) Return the str_trim(string, side = c("both", "le ", "right"))
4 7
NA NA
positions of pattern matches in a string. NA NA
first pattern match found in each string, as Trim whitespace from the start and/or end of
3 4 Also str_locate_all(). str_locate(fruit, "a") a matrix with a column for each ( ) group in a string. str_trim(str_pad(fruit, 17))
pattern. Also str_match_all().
0 str_count(string, pattern) Count the number str_match(sentences, "(a|the) ([^ +])") str_squish(string) Trim whitespace from each
3
1
of matches in a string. str_count(fruit, "a") end and collapse multiple spaces into single
2 spaces. str_squish(str_pad(fruit, 17, "both"))

Mutate Strings Join and Split Order Strings


str_sub() <- value. Replace substrings by str_c(..., sep = "", collapse = NULL) Join 4 str_order(x, decreasing = FALSE, na_last =
identifying the substrings with str_sub() and multiple strings into a single string. 1
3
TRUE, locale = "en", numeric = FALSE, ...)1
assigning into the results. str_c(letters, LETTERS) 2 Return the vector of indexes that sorts a
str_sub(fruit, 1, 3) <- "str" character vector. fruit[str_order(fruit)]
str_flatten(string, collapse = "") Combines
str_replace(string, pattern, replacement) into a single string, separated by collapse. str_sort(x, decreasing = FALSE, na_last =
Replace the first matched pattern in each str_flatten(fruit, ", ") TRUE, locale = "en", numeric = FALSE, ...)1
string. Also str_remove(). Sort a character vector. str_sort(fruit)
str_replace(fruit, "p", "-") str_dup(string, times) Repeat strings times
times. Also str_unique() to remove duplicates.
str_replace_all(string, pattern, replacement)
Replace all matched patterns in each string.
str_dup(fruit, times = 2)
Helpers
Also str_remove_all(). str_split_fixed(string, pattern, n) Split a str_conv(string, encoding) Override the
str_replace_all(fruit, "p", "-") vector of strings into a matrix of substrings encoding of a string. str_conv(fruit,"ISO-8859-1")
(splitting at occurrences of a pattern match).
A STRING str_to_lower(string, locale = "en")1 Also str_split() to return a list of substrings appl<e> str_view_all(string, pattern, match = NA)
a string Convert strings to lower case. and str_split_n() to return the nth substring. banana View HTML rendering of all regex matches.
str_to_lower(sentences) str_split_fixed(sentences, " ", n=3) p<e>ar Also str_view() to see only the first match.
str_view_all(sentences, "[aeiou]")
a string str_to_upper(string, locale = "en")1 {xx} {yy} str_glue(…, .sep = "", .envir = parent.frame())
A STRING Convert strings to upper case. Create a string from strings and {expressions} TRUE
str_equal(x, y, locale = "en", ignore_case =
str_to_upper(sentences) to evaluate. str_glue("Pi is {pi}") TRUE
FALSE FALSE, ...)1 Determine if two strings are
TRUE equivalent. str_equal(c("a", "b"), c("a", "c"))
a string str_to_title(string, locale = "en")1 Convert str_glue_data(.x, ..., .sep = "", .envir =
A String strings to title case. Also str_to_sentence(). parent.frame(), .na = "NA") Use a data frame, This is a long sentence. str_wrap(string, width = 80, indent = 0,
str_to_title(sentences) list, or environment to create a string from exdent = 0) Wrap strings into nicely formatted
This is a long
strings and {expressions} to evaluate. sentence. paragraphs. str_wrap(sentences, 20)
str_glue_data(mtcars, "{rownames(mtcars)} has
{hp} hp") 1 See bit.ly/ISO639-1 for a complete list of locales.

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor on Twitter • stringr 1.5.0 • Updated: 2023-05

ft

ft
ft












ft


Need to Know Regular Expressions - Regular expressions, or regexps, are a concise language for
describing patterns in strings.
[:space:]
new line
Pattern arguments in stringr are interpreted as MATCH CHARACTERS see <- function(rx) str_view_all("abc ABC 123\t.!?\\(){}\n", rx)
regular expressions a er any special characters [:blank:] .
have been parsed. string regexp matches example
(type this) (to mean this) (which matches this) space
In R, you write regular expressions as strings, a (etc.) a (etc.) see("a") abc ABC 123 .!?\(){} tab
sequences of characters surrounded by quotes \\. \. . see("\\.") abc ABC 123 .!?\(){}
("") or single quotes('').
\\! \! ! see("\\!") abc ABC 123 .!?\(){} [:graph:]
Some characters cannot be represented directly \\? \? ? see("\\?") abc ABC 123 .!?\(){}
in an R string . These must be represented as \\\\ \\ \ see("\\\\") abc ABC 123 .!?\(){} [:punct:] [:symbol:]
special characters, sequences of characters that \\( \( ( see("\\(") abc ABC 123 .!?\(){}
have a specific meaning., e.g. . , : ; ? ! / *@# | ` = + ^
\\) \) ) see("\\)") abc ABC 123 .!?\(){}
Special Character Represents \\{ \{ { see("\\{") abc ABC 123 .!?\(){} - _ " ' [ ] { } ( ) ~ < > $
\\ \ \\} \} } see( "\\}") abc ABC 123 .!?\(){}
\" " \\n \n new line (return) see("\\n") abc ABC 123 .!?\(){} [:alnum:]
\n new line \\t \t tab see("\\t") abc ABC 123 .!?\(){}
Run ?"'" to see a complete list \\s \s any whitespace (\S for non-whitespaces) see("\\s") abc ABC 123 .!?\(){} [:digit:]
\\d \d any digit (\D for non-digits) see("\\d") abc ABC 123 .!?\(){}
0 1 2 3 4 5 6 7 8 9
Because of this, whenever a \ appears in a regular \\w \w any word character (\W for non-word chars) see("\\w") abc ABC 123 .!?\(){}
expression, you must write it as \\ in the string \\b \b word boundaries see("\\b") abc ABC 123 .!?\(){}
that represents the regular expression. [:digit:]
1
digits see("[:digit:]") abc ABC 123 .!?\(){} [:alpha:]
1
Use writeLines() to see how R views your string [:alpha:] letters see("[:alpha:]") abc ABC 123 .!?\(){} [:lower:] [:upper:]
1
a er all special characters have been parsed. [:lower:] lowercase letters see("[:lower:]") abc ABC 123 .!?\(){}
[:upper:]
1
uppercase letters see("[:upper:]") abc ABC 123 .!?\(){} a b c d e f A B C D E F
writeLines("\\.") 1
# \. [:alnum:] letters and numbers see("[:alnum:]") abc ABC 123 .!?\(){} g h i j k l GH I J K L
[:punct:] 1 punctuation see("[:punct:]") abc ABC 123 .!?\(){}
mn o p q r MNOPQR
writeLines("\\ is a backslash") [:graph:]1 letters, numbers, and punctuation see("[:graph:]") abc ABC 123 .!?\(){}
# \ is a backslash
[:space:]1 space characters (i.e. \s) see("[:space:]") abc ABC 123 .!?\(){} s t u v w x S T U VWX
[:blank:]1 space and tab (but not new line) see("[:blank:]") abc ABC 123 .!?\(){} y z Y Z
. every character except a new line see(".") abc ABC 123 .!?\(){}
INTERPRETATION 1 Many base R functions require classes to be wrapped in a second set of [ ], e.g. [[:digit:]]
Patterns in stringr are interpreted as regexs. To
change this default, wrap the pattern in one of:
ALTERNATES alt <- function(rx) str_view_all("abcde", rx) QUANTIFIERS quant <- function(rx) str_view_all(".a.aa.aaa", rx)
regex(pattern, ignore_case = FALSE, multiline = example example
regexp matches regexp matches
FALSE, comments = FALSE, dotall = FALSE, ...)
Modifies a regex to ignore cases, match end of ab|d or alt("ab|d") abcde a? zero or one quant("a?") .a.aa.aaa
lines as well of end of strings, allow R comments [abe] one of alt("[abe]") abcde a* zero or more quant("a*") .a.aa.aaa
within regex's , and/or to have . match everything a+ one or more quant("a+") .a.aa.aaa
including \n. [^abe] anything but alt("[^abe]") abcde
str_detect("I", regex("i", TRUE)) [a-c] range alt("[a-c]") abcde 1 2 ... n a{n} exactly n quant("a{2}") .a.aa.aaa
1 2 ... n a{n, } n or more quant("a{2,}") .a.aa.aaa
fixed() Matches raw bytes but will miss some n ... m a{n, m} between n and m quant("a{2,4}") .a.aa.aaa
characters that can be represented in multiple ANCHORS anchor <- function(rx) str_view_all("aaa", rx)
ways (fast). str_detect("\u0130", fixed("i")) regexp matches example
^a start of string anchor("^a") aaa GROUPS ref <- function(rx) str_view_all("abbaab", rx)
coll() Matches raw bytes and will use locale
specific collation rules to recognize characters a$ end of string anchor("a$") aaa Use parentheses to set precedent (order of evaluation) and create groups
that can be represented in multiple ways (slow).
regexp matches example
str_detect("\u0130", coll("i", TRUE, locale = "tr"))
(ab|d)e sets precedence alt("(ab|d)e") abcde
LOOK AROUNDS look <- function(rx) str_view_all("bacad", rx)
boundary() Matches boundaries between
characters, line_breaks, sentences, or words. regexp matches example Use an escaped number to refer to and duplicate parentheses groups that occur
str_split(sentences, boundary("word")) a(?=c) followed by look("a(?=c)") bacad earlier in a pattern. Refer to each group by its order of appearance
a(?!c) not followed by look("a(?!c)") bacad string regexp matches example
(?<=b)a preceded by look("(?<=b)a") bacad (type this) (to mean this) (which matches this) (the result is the same as ref("abba"))

(?<!b)a not preceded by look("(?<!b)a") bacad \\1 \1 (etc.) first () group, etc. ref("(a)(b)\\2\\1") abbaab

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor on Twitter • stringr 1.5.0 • Updated: 2023-05
ft

ft

ft

Dates and times with lubridate : : CHEAT SHEET


Date-times 2017-11-28 12:00:00 2017-11-28 12:00:00 Round Date-times
A date-time is a point on the timeline, A date is a day stored as An hms is a time stored as floor_date(x, unit = "second")
stored as the number of seconds since the number of days since the number of seconds since Round down to nearest unit.
1970-01-01 00:00:00 UTC 1970-01-01 00:00:00 floor_date(dt, unit = "month")
2016 2017 2018 2019 2020
Jan Feb Mar Apr
dt <- as_datetime(1511870400) d <- as_date(17498) t <- hms::as_hms(85) round_date(x, unit = "second")
2017-11-28 12:00:00 Round to nearest unit.
## "2017-11-28 12:00:00 UTC" ## "2017-11-28" ## 00:01:25
round_date(dt, unit = "month")
Jan Feb Mar Apr ceiling_date(x, unit = "second",
PARSE DATE-TIMES (Convert strings or numbers to date-times) GET AND SET COMPONENTS change_on_boundary = NULL)
d ## "2017-11-28" Round up to nearest unit.
1. Identify the order of the year (y), month (m), day (d), hour (h), Use an accessor function to get a component. day(d) ## 28 ceiling_date(dt, unit = "month")
minute (m) and second (s) elements in your data. Assign into an accessor function to change a day(d) <- 1 Jan Feb Mar Apr

2. Use the function below whose name replicates the order. Each component in place. d ## "2017-11-01" Valid units are second, minute, hour, day, week, month, bimonth,
accepts a tz argument to set the time zone, e.g. ymd(x, tz = "UTC"). quarter, season, halfyear and year.

ymd_hms(), ymd_hm(), ymd_h(). 2018-01-31 11:59:59 date(x) Date component. date(dt) rollback(dates, roll_to_first = FALSE, preserve_hms = TRUE) Roll back to
2017-11-28T14:02:00 ymd_hms("2017-11-28T14:02:00") last day of previous month. Also rollforward(). rollback(dt)
year(x) Year. year(dt)
2017-22-12 10:00:00
ydm_hms(), ydm_hm(), ydm_h().
ydm_hms("2017-22-12 10:00:00")
2018-01-31 11:59:59 isoyear(x) The ISO 8601 year.
epiyear(x) Epidemiological year. Stamp Date-times
mdy_hms(), mdy_hm(), mdy_h(). stamp() Derive a template from an example string and return a new
11/28/2017 1:02:03 2018-01-31 11:59:59 month(x, label, abbr) Month. function that will apply the template to date-times. Also
mdy_hms("11/28/2017 1:02:03") month(dt) stamp_date() and stamp_time().
dmy_hms(), dmy_hm(), dmy_h(). day(x) Day of month. day(dt) 1. Derive a template, create a function
1 Jan 2017 23:59:59 dmy_hms("1 Jan 2017 23:59:59") Tip: use a
2018-01-31 11:59:59 wday(x, label, abbr) Day of week. sf <- stamp("Created Sunday, Jan 17, 1999 3:34") date with
ymd(), ydm(). ymd(20170131) qday(x) Day of quarter. day > 12
20170131 2. Apply the template to dates
sf(ymd("2010-04-05"))
mdy(), myd(). mdy("July 4th, 2000") 2018-01-31 11:59:59 hour(x) Hour. hour(dt) ## [1] "Created Monday, Apr 05, 2010 00:00"
July 4th, 2000
dmy(), dym(). dmy("4th of July '99") 2018-01-31 11:59:59 minute(x) Minutes. minute(dt)
4th of July '99
2001: Q3 yq() Q for quarter. yq("2001: Q3") 2018-01-31 11:59:59 second(x) Seconds. second(dt) Time Zones
R recognizes ~600 time zones. Each encodes the time zone, Daylight
07-2020 my(), ym(). my("07-2020") 2018-01-31 11:59:59 UTC tz(x) Time zone. tz(dt) Savings Time, and historical calendar variations for an area. R assigns
one time zone per vector.
2:01 hms::hms() Also lubridate::hms(), week(x) Week of the year. week(dt)
hm() and ms(), which return x
J F M A M J isoweek() ISO 8601 week. Use the UTC time zone to avoid Daylight Savings.
periods.* hms::hms(seconds = 0, J A S O N D epiweek() Epidemiological week.
minutes = 1, hours = 2) OlsonNames() Returns a list of valid time zone names. OlsonNames()
x
J F M A M J quarter(x) Quarter. quarter(dt) Sys.timezone() Gets current time zone.
J A S O N D
2017.5 date_decimal(decimal, tz = "UTC") 5:00 6:00
semester(x, with_year = FALSE)
date_decimal(2017.5)
x
J F M A M J Semester. semester(dt) 4:00 Mountain Central 7:00 with_tz(time, tzone = "") Get
the same date-time in a new
now(tzone = "") Current time in tz J A S O N D Pacific Eastern time zone (a new clock time).
(defaults to system tz). now() am(x) Is it in the am? am(dt) Also local_time(dt, tz, units).
pm(x) Is it in the pm? pm(dt) with_tz(dt, "US/Pacific")
today(tzone = "") Current date in a PT
MT
January

CT ET
xxxxx dst(x) Is it daylight savings? dst(d)
xxx tz (defaults to system tz). today()
force_tz(time, tzone = "") Get
fast_strptime() Faster strptime. leap_year(x) Is it a leap year? the same clock time in a new
leap_year(d) 7:00 7:00
fast_strptime(“9/1/01”, “%y/%m/%d”) Pacific Eastern time zone (a new date-time).
Also force_tzs().
parse_date_time() Easier strptime. update(object, ..., simple = FALSE) 7:00 7:00 force_tz(dt, "US/Pacific")
parse_date_time(“09-01-01”, "ymd") update(dt, mday = 2, hour = 1) Mountain Central

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at lubridate.tidyverse.org • lubridate 1.9.2 • Updated: 2023-05

ft

Math with Date-times — Lubridate provides three classes of timespans to facilitate math with dates and date-times.
Math with date-times relies on the timeline, Periods track changes in clock times, Durations track the passage of Intervals represent specific intervals Not all years
which behaves inconsistently. Consider how which ignore time line irregularities. physical time, which deviates from of the timeline, bounded by start and are 365 days
the timeline behaves during: clock time when irregularities occur. end date-times. due to leap days.
A normal day nor + minutes(90) nor + dminutes(90) interval(nor, nor + minutes(90)) Not all minutes
nor <- ymd_hms("2018-01-01 01:30:00",tz="US/Eastern") are 60 seconds due to
leap seconds.

1:00 2:00 3:00 4:00


It is possible to create an imaginary date
1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00
by adding months, e.g. February 31st
The start of daylight savings (spring forward) gap + minutes(90) gap + dminutes(90) interval(gap, gap + minutes(90)) jan31 <- ymd(20180131)
gap <- ymd_hms("2018-03-11 01:30:00",tz="US/Eastern") jan31 + months(1)
## NA
1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00 1:00 2:00 3:00 4:00
%m+% and %m-% will roll imaginary
dates to the last day of the previous
The end of daylight savings (fall back) month.
lap + minutes(90) lap + dminutes(90) interval(lap, lap + minutes(90))
lap <- ymd_hms("2018-11-04 00:30:00",tz="US/Eastern")
jan31 %m+% months(1)
## "2018-02-28"
add_with_rollback(e1, e2, roll_to_first =
12:00 1:00 2:00 3:00 12:00 1:00 2:00 3:00 12:00 1:00 2:00 3:00 12:00 1:00 2:00 3:00 TRUE) will roll imaginary dates to the
leap + years(1) leap + dyears(1) interval(leap, leap + years(1)) first day of the new month.
Leap years and leap seconds
leap <- ymd("2019-03-01") add_with_rollback(jan31, months(1),
roll_to_first = TRUE)
## "2018-03-01"
2019 2020 2021 2019 2020 2021 2019 2020 2021 2019 2020 2021

PERIODS DURATIONS INTERVALS


Add or subtract periods to model events that happen at specific clock Add or subtract durations to model physical processes, like battery life. Divide an interval by a duration to determine its physical length, divide
times, like the NYSE opening bell. Durations are stored as seconds, the only time unit with a consistent length. an interval by a period to determine its implied length in clock time.
Di imes are a class of durations found in base R.
Start End
Make a period with the name of a time unit pluralized, e.g. Make a duration with the name of a period prefixed with a d, e.g. Make an interval with interval() or %--%, e.g. Date Date
p <- months(3) + days(12) years(x = 1) x years. dd <- ddays(14) dyears(x = 1) 31536000x seconds. i <- interval(ymd("2017-01-01"), d) ## 2017-01-01 UTC--2017-11-28 UTC
p months(x) x months. dd dmonths(x = 1) 2629800x seconds. j <- d %--% ymd("2017-12-31") ## 2017-11-28 UTC--2017-12-31 UTC
"3m 12d 0H 0M 0S" weeks(x = 1) x weeks. "1209600s (~2 weeks)" dweeks(x = 1) 604800x seconds. a %within% b Does interval or date-time a fall
days(x = 1) x days. Exact Equivalent ddays(x = 1) 86400x seconds. within interval b? now() %within% i
Number Number
etc. hours(x = 1) x hours. length in in common dhours(x = 1) 3600x seconds.
of months of days
minutes(x = 1) x minutes. seconds units dminutes(x = 1) 60x seconds. int_start(int) Access/set the start date-time of
seconds(x = 1) x seconds. dseconds(x = 1) x seconds. an interval. Also int_end(). int_start(i) <- now();
milliseconds(x = 1) x milliseconds. int_start(i)
dmilliseconds(x = 1) x x 10-3 seconds.
microseconds(x = 1) x microseconds dmicroseconds(x = 1) x x 10-6 seconds. int_aligns(int1, int2) Do two intervals share a
nanoseconds(x = 1) x nanoseconds. dnanoseconds(x = 1) x x 10-9 seconds. boundary? Also int_overlaps(). int_aligns(i, j)
picoseconds(x = 1) x picoseconds. dpicoseconds(x = 1) x x 10-12 seconds.
int_di (times) Make the intervals that occur
period(num = NULL, units = "second", ...) duration(num = NULL, units = "second", …) between the date-times in a vector.
An automation friendly period constructor. An automation friendly duration v <-c(dt, dt + 100, dt + 1000); int_di (v)
period(5, unit = "years") constructor. duration(5, unit = "years")
int_flip(int) Reverse the direction of an
as.period(x, unit) Coerce a timespan to a as.duration(x, …) Coerce a timespan to a interval. Also int_standardize(). int_flip(i)
period, optionally in the specified units. duration. Also is.duration(), is.di ime().
Also is.period(). as.period(p) as.duration(i) l int_length(int) Length in seconds. int_length(i)
period_to_seconds(x) Convert a period to make_di ime(x) Make di ime with the
the "standard" number of seconds implied int_shi (int, by) Shi s an interval up or down
specified number of units. the timeline by a timespan. int_shi (i, days(-1))
by the period. Also seconds_to_period(). make_di ime(99999)
period_to_seconds(p)
as.interval(x, start, …) Coerce a timespan to
an interval with the start date-time. Also
is.interval(). as.interval(days(1), start = now())
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at lubridate.tidyverse.org • lubridate 1.9.2 • Updated: 2023-05

fft

ff
ft

fft
fft

ft

ft

fft

ff
fft

ft

RStudio IDE : : CHEAT SHEET


Documents and Apps Source Editor Tab Panes Version
Open Shiny, R Markdown,
knitr, Sweave, LaTeX, .Rd files
Navigate

forwards
Open in new Save Find and
backwards/ window replace
Compile as Run
notebook selected
code
Import data History of past
with wizard commands to
run/copy
Manage
external
View
memory
databases usage
R tutorials
Control
and more in Source Pane Turn on at Tools > Project Options > Git/SVN
A• Added M• Modified
Check Render Choose Configure Insert D• Deleted R• Renamed
?• Untracked

spelling output output render code Publish


format options chunk to server
Stage Commit Push/Pull View Current
Re-run Source with or Show file files: staged files to remote History branch
previous code w/out Echo or outline Load Save Clear R Search inside
as a Local Job workspace workspace workspace environment
Jump to Jump Run Show file Visual Multiple cursors/column selection Choose environment to display from Display objects
previous to next code outline Editor with Alt + mouse drag. list of parent environments as list or grid Open shell to type commands
chunk chunk (reverse
side) Code diagnostics that appear in the margin. Show file di to view file di erences
Run this and Hover over diagnostic symbols for details.
Jump to all previous Run this Syntax highlighting based
section code chunks code chunk on your file's extension
or chunk Tab completion to finish function
Set knitr Displays saved objects by View in data View function
chunk names, file paths, arguments, and more. type with short description viewer source code
options

Access markdown guide at


Help > Markdown Quick Reference
See reverse side for more on Visual Editor
Multi-language code snippets to
quickly use common blocks of code. More file
options
Debug Mode
Jump to function in file Change file type Use debug(), browser(), or a breakpoint and execute
Create Delete Rename Change
folder file file directory your code to open the debugger mode.
RStudio recognizes that files named app.R,
server.R, ui.R, and global.R belong to a shiny app Path to displayed directory Launch debugger Open traceback to examine
mode from origin the functions that R called
Working Run scripts in Maximize, of error before the error occurred
Directory separate sessions minimize panes
A File browser keyed to your working directory.
Run Choose Publish to Manage Ctrl/Cmd + arrow- Click on file or directory name to open.
app location to shinyapps.io publish R Markdown Drag pane
up

view app or server accounts to see history Build Log boundaries

Package Development
Click next to line number to Highlighted line shows where
RStudio opens plots in a dedicated Plots pane RStudio opens documentation in a dedicated Help pane add/remove a breakpoint. execution has paused
Create a new package with
File > New Project > New Directory > R Package
Navigate Open in Export Delete Delete
Enable roxygen documentation with recent plots window plot plot all plots Home page of Search within Search for
Tools > Project Options > Build Tools helpful links help file help file
Roxygen guide at Help > Roxygen Quick Reference
See package information in the Build Tab Viewer pane displays HTML content, such as Shiny
apps, RMarkdown reports, and interactive visualizations
GUI Package manager lists every installed package
Install package Run devtools::load_all()
and restart R and reload changes
Stop Shiny Publish to shinyapps.io, Refresh
Install Update Browse app rpubs, RSConnect, … Run commands in Examine variables Select function
Packages Packages package site environment where in executing in traceback to
Clear output execution has paused environment debug
Run R CMD and rebuild
check View(<data>) opens spreadsheet like view of data set
Customize Run Click to load package with Package Delete
package build package library(). Unclick to detach version from
options tests package with detach(). installed library

Filter rows by value Sort by Search Step through Step into and Resume Quit debug
or value range values for value code one line out of functions execution mode
at a time to run
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07

ff


ft

ff



Keyboard Shortcuts RStudio


RUN CODE
Search command history
Windows/Linux
Ctrl+arrow-up
Mac
Cmd+arrow-up
DOCUMENTS AND APPS
Knit Document (knitr) Ctrl+Shi +K Cmd+Shi +K
Workbench
Interrupt current command Esc Esc Insert chunk (Sweave & Knitr) Ctrl+Alt+I Cmd+Option+I WHY RSTUDIO WORKBENCH?
Clear console Ctrl+L Ctrl+L Run from start to current line Ctrl+Alt+B Cmd+Option+B Extend the open source server with a
commercial license, support, and more:
NAVIGATE CODE MORE KEYBOARD SHORTCUTS
Go to File/Function Ctrl+. Ctrl+. Keyboard Shortcuts Help Alt+Shi +K Option+Shi +K • open and run multiple R sessions at once
Show Command Palette Ctrl+Shi +P Cmd+Shi +P • tune your resources to improve performance
WRITE CODE
Attempt completion Tab or Tab or
• administrative tools for managing user sessions
Ctrl+Space Ctrl+Space View the Keyboard Shortcut Quick Search for keyboard shortcuts with • collaborate real-time with others in shared projects
Insert <- (assignment operator) Alt+- Option+- Reference with Tools > Keyboard Tools > Show Command Palette • switch easily from one version of R to a di erent version
Shortcuts or Alt/Option + Shi + K or Ctrl/Cmd + Shi + P.
Insert %>% (pipe operator) Ctrl+Shi +M Cmd+Shi +M • integrate with your authentication, authorization, and audit practices
(Un)Comment selection Ctrl+Shi +C Cmd+Shi +C • work in the RStudio IDE, JupyterLab, Jupyter Notebooks, or VS Code
MAKE PACKAGES Windows/Linux Mac Download a free 45 day evaluation at
Load All (devtools) Ctrl+Shi +L Cmd+Shi +L www.rstudio.com/products/workbench/evaluation/
Test Package (Desktop)
Document Package
Ctrl+Shi +T
Ctrl+Shi +D
Cmd+Shi +T
Cmd+Shi +D Share Projects
File > New Project
RStudio saves the call history,

Visual Editor
workspace, and working Start new R Session Close R Session
Choose Choose Insert Jump to Jump Run directory associated with a in current project in project
Check Render output output code previous to next selected Publish Show file project. It reloads each when
spelling output format location chunk chunk chunk lines to server outline you re-open a project.
T H J
Back to
Source Editor
Block (front page) Active shared
format collaborators
Name of
current
Lists and Links Citations Images File outline project
Insert blocks, Select
block
citations, Insert and Share Project R Version
quotes More
formatting equations, and edit tables with Collaborators
Clear special
formatting characters
Insert
verbatim
code
Run Remote Jobs
Run R on remote clusters
(Kubernetes/Slurm) via the
Job Launcher
Add/Edit
attributes Monitor Launch a job
launcher jobs

Run this and


Set knitr all previous
chunk code chunks
options
Run this
Jump to chunk code chunk
or header

Run launcher
jobs remotely

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07


ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft

ft

ff

Factors with forcats : : CHEAT SHEET


The forcats package provides tools for working with factors, which are R's data structure for categorical data.

Factors stored displayed Change the order of levels Change the value of levels
R represents categorical integer
1 1= a a 1= a
data with factors. A factor vector 3 23 == bc c 23 == bc a 1= a a 1= b fct_relevel(.f, ..., a er = 0L) a 1= a v 1= v
2= x
fct_recode(.f, ...) Manually change
is an integer vector with a 2 b c 2= b c 2= c Manually reorder factor levels. c 2= b
z levels. Also fct_relabel() which obeys
3= c 3= a fct_relevel(f, c("b", "c", "a")) 3= c 3= z purrr::map syntax to apply a function
levels attribute that stores levels 1 a b b b x
a set of mappings between or expression to each level.
a a a v fct_recode(f, v = "a", x = "b", z = "c")
integers and categorical values. When you view a factor, R fct_infreq(f, ordered = NA) Reorder
displays not the integers, but the levels associated with them. fct_relabel(f, ~ paste0("x", .x))
levels by the frequency
c 1= a c 1= c in which they appear in the
= c 2= c c 2= a data (highest frequency first).
a a 1= a Also fct_inseq(). a 1= a 2 1=2 fct_anon(f, prefix = "")
c c 2= b a a c 2= b 2=1 Anonymize levels with random
3= c
f3 <- factor(c("c", "c", "a")) 3= c 1 3=3
integers.
b b fct_infreq(f3) b 3 fct_anon(f)
a a a 2
b 1= a b 1= b fct_inorder(f, ordered = NA)
a 2= b a 2= a Reorder levels by order in which
they appear in the data. a 1= a x 1= x fct_collapse(.f, …, other_level = NULL)
a 1= a a
c 2= b c 2= c Collapse levels into manually defined
c 2= b b fct_inorder(f2) 3= c groups.
3= c c b x fct_collapse(f, x = c("a", "b"))
b a x
a a 1= a a 1= c fct_rev(f) Reverse level order.
2= b 2= b f4 <- factor(c("a","b","c"))
b 3= c b 3= a
c c fct_rev(f4) fct_lump_min(f, min, w = NULL,
Inspect Factors a
c
1= a
2= b
3= c
a
Other
1= a
2 = Other
other_level = "Other") Lumps together
factors that appear fewer than min
times. Also fct_lump_n(),
a 1= a f n fct_count(f, sort = FALSE, a 1= a a 1= c fct_shi (f) Shi levels to le or b Other
fct_lump_prop(), and
c 2= b
a 2 prop = FALSE) Count the 2= b 2= a right, wrapping around end. a a
3= c number of values with each b 3= c b 3= b
fct_lump_lowfreq().
b b 1 c c fct_shi (f4) fct_lump_min(f, min = 2)
level. fct_count(f)
a c 1
fct_match(f, lvls) Check for
lvls in f. fct_match(f, "a") a 1= a a 1= a fct_shu le(f, n = 1L) Randomly a 1= a a 1= a fct_other(f, keep, drop, other_level =
2= b 2= c permute order of factor levels. 2= b 2= b "Other") Replace levels with "other."
a 1= a a 1= a fct_unique(f) Return the b 3= c b 3= b
c 3= c
Other
3 = Other
c c fct_shu le(f4) fct_other(f, keep = c("a", "b"))
b 2= b
b 2= b unique values, removing b b
a duplicates. fct_unique(f) a a

Combine Factors Add or drop levels


1= a 1= b fct_reorder(.f, .x, .fun = median,
a 2= b
ca 2= c ..., .desc = FALSE) Reorder levels by their
bc 3= c b 3= a relationship with another variable.
boxplot(
a 1= a + b 1= a = a 1= a fct_c(…) Combine factors PlantGrowth, a 1= a a 1= a fct_drop(f, only) Drop unused levels.
c 2= c a 2= b c 2= c with di erent levels. weight ~ fct_reorder(group, weight)
b 2= b
b 2= b f5 <- factor(c("a","b"),c("a","b","x"))
3= b Also fct_cross(). ) 3= x f6 <- fct_drop(f5)
b f1 <- factor(c("a", "c"))
a f2 <- factor(c("b", "a")) fct_reorder2(.f, .x, .y, .fun = last2, ..., .desc =
fct_c(f1, f2)
1= a
2= b
1= b
2= c TRUE) Reorder levels by their final values a 1= a a 1= a fct_expand(f, …) Add levels to a factor.
2= b 2= b fct_expand(f6, "x")
3= c 3= a when plotted with two other variables.
ggplot(
b b 3= x
a 1= a
2= b
a 1= a
2= b
fct_unify(fs, levels = diamonds,
b b 3= c lvls_union(fs)) Standardize
levels across a list of factors.
aes(
carat, price,
a 1= a a 1= a fct_na_value_to_level(f, level =
a 1= a
2= c
a 1= a
2= b color = fct_reorder2(color, carat, price) b 2= b
b 2= b "(Missing)") Assigns a level to NAs to
c 2c 3= c fct_unify(list(f2, f1)) 3= x ensure they appear in plots, etc.
)) +
geom_smooth()
NA x f7 <- factor(c("a", "b", NA))
fct_na_value_to_level(f7, level = "(Missing)")

CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at forcats.tidyverse.org • Diagrams inspired by @LVaudor on Twitter • forcats 1.0.0 • Updated: 2023-05

ft

ff
ff
ft

ff



ft

ft


ft






ft









Shiny : : CHEAT SHEET


Building an App To generate the template, type shinyapp and press Tab in the RStudio IDE
or go to File > New Project > New Directory > Shiny Web Application
Inputs
Collect values from the user.
A Shiny app is a web page (ui)
# app.R
connected to a computer library(shiny) Access the current value of an input object with
running a live R session (server). Customize the UI with Layout Functions input$<inputId>. Input values are reactive.
In ui nest R ui <- fluidPage(
functions to numericInput(inputId = "n", Add Inputs with *Input() functions actionButton(inputId, label, icon,
build an HTML "Sample size", value = 25), width, …)
interface plotOutput(outputId = "hist") Add Outputs with *Output() functions
)
actionLink(inputId, label, icon, …)
Users can manipulate the UI,
which will cause the server to server <- function(input, output, session) { checkboxGroupInput(inputId,
update the UI’s displays (by Tell the server output$hist <- renderPlot({ label, choices, selected, inline, width,
how to render Wrap code in render*() functions
running R code). hist(rnorm(input$n)) choiceNames, choiceValues)
outputs and before saving to output
})
respond to } Refer to UI inputs with input$<id> checkboxInput(inputId, label, value,
Save your template as app.R. inputs with R and outputs with output$<id> width)
Keep your app in a directory
along with optional extra files. shinyApp(ui = ui, server = server) dateInput(inputId, label, value, min,
max, format, startview, weekstart,
app-name Call shinyApp() to combine ui and server into an interactive app!
language, width, autoclose,
The directory name is the app name datesdisabled, daysofweekdisabled)
app.R
(optional) used in showcase mode
DESCRIPTION
README (optional) directory of supplemental .R files that are sourced dateRangeInput(inputId, label, start,
automatically, must be named "R" See annotated examples of Shiny apps by running end, min, max, format, startview,
R/ runExample(<example name>). Run runExample() weekstart, language, separator, width,
www/ (optional) directory of files to share with web browsers (images, autoclose)
CSS, .js, etc.), must be named "www" with no arguments for a list of example names.
Launch apps stored in a directory with runApp(<path to directory>). fileInput(inputId, label, multiple,
accept, width, buttonLabel, placeholder)

Share Outputs render*() and *Output() functions work together to add R output to the UI. numericInput(inputId, label, value,
min, max, step, width)
Share your app in three ways: DT::renderDataTable(expr, options, dataTableOutput(outputId)
searchDelay, callback, escape, env, quoted, passwordInput(inputId, label, value,
1. Host it on shinyapps.io, a cloud based outputArgs) width, placeholder)
service from RStudio. To deploy Shiny apps:
renderImage(expr, env, quoted, deleteFile, imageOutput(outputId, width, height, radioButtons(inputId, label,
Create a free or professional outputArgs) click, dblclick, hover, brush, inline) choices, selected, inline, width,
account at shinyapps.io choiceNames, choiceValues)
renderPlot(expr, width, height, res, …, alt, env, plotOutput(outputId, width, height, click,
Click the Publish icon in RStudio IDE, or run: quoted, execOnResize, outputArgs) dblclick, hover, brush, inline) selectInput(inputId, label, choices,
rsconnect::deployApp("<path to directory>") selected, multiple, selectize, width, size)
Also selectizeInput()
renderPrint(expr, env, quoted, width, verbatimTextOutput(outputId,
2. Purchase RStudio Connect, a outputArgs) placeholder)
publishing platform for R and Python. sliderInput(inputId, label, min, max,
value, step, round, format, locale, ticks,
rstudio.com/products/connect/ renderTable(expr, striped, hover, bordered, tableOutput(outputId) animate, width, sep, pre, post,
spacing, width, align, rownames, colnames, timeFormat, timezone, dragRange)
digits, na, …, env, quoted, outputArgs)
3. Build your own Shiny Server
rstudio.com/products/shiny/shiny-server/ renderText(expr, env, quoted, outputArgs, sep) textOutput(outputId, container, inline) submitButton(text, icon, width)
(Prevent reactions for entire app)
renderUI(expr, env, quoted, outputArgs) uiOutput(outputId, inline, container, …)
htmlOutput(outputId, inline, container, …) textInput(inputId, label, value, width,
placeholder) Also textAreaInput()
These are the core output types. See htmlwidgets.org for many more options.
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at shiny.rstudio.com • shiny 1.7.4 • Updated: 2023-05


f
f
fff

ft




Reactivity UI - An app’s UI is an HTML document. Layouts
Reactive values work together with reactive functions. Call a reactive value from within the arguments of one Use Shiny’s functions to assemble this HTML with R. Combine multiple elements
of these functions to avoid the error Operation not allowed without an active reactive context. fluidPage( into a "single element" that
textInput("a","") Returns has its own properties with a
) HTML panel function, e.g.
## <div class="container-fluid"> wellPanel(
## <div class="form-group shiny-input-container"> dateInput("a", ""),
## <label for="a"></label> submitButton()
## <input id="a" type="text" )
## class="form-control" value=""/>
## </div> absolutePanel() navlistPanel()
## </div> conditionalPanel() sidebarPanel()
fixedPanel() tabPanel()
headerPanel() tabsetPanel()
Add static HTML elements with tags, a list inputPanel() titlePanel()
of functions that parallel common HTML mainPanel() wellPanel()
tags, e.g. tags$a(). Unnamed arguments
will be passed into the tag; named Organize panels and elements into a layout with a
arguments will become tag attributes. layout function. Add elements as arguments of the
layout functions.
Run names(tags) for a complete list. sidebarLayout()
tags$h1("Header") -> <h1>Header</h1> ui <- fluidPage(
sidebarLayout(
The most common tags have wrapper functions. You side main sidebarPanel(),
do not need to prefix their names with tags$ panel
panel mainPanel()
CREATE YOUR OWN REACTIVE VALUES RENDER REACTIVE OUTPUT )
*Input() functions library(shiny) render*() functions ui <- fluidPage( )
# *Input() example
(see front page) ui <- fluidPage(
(see front page) h1("Header 1"),
fluidRow()
ui <- fluidPage( hr(),
textInput("a","","A"),
textInput("a","","A") Each input function creates textOutput("b") br(), ui <- fluidPage(
)
a reactive value stored as ) Builds an object to p(strong("bold")), row
column col fluidRow(column(width = 4),
input$<inputId>. display. Will rerun code in p(em("italic")), column(width = 2, o set = 3)),
server <- body to rebuild the object p(code("code")), fluidRow(column(width = 12))
#reactiveValues example
function(input,output){
whenever a reactive value a(href="", "link"), column
reactiveValues(…) output$b <- )
server <- renderText({ in the code changes. HTML("<p>Raw html</p>")
input$a )
function(input,output){ Creates a list of reactive }) Also flowLayout(), splitLayout(), verticalLayout(),
rv <- reactiveValues() Save the results to
rv$number <- 5 values whose values you } fixedPage(), and fixedRow().
} can set. output$<outputId>.
shinyApp(ui, server) To include a CSS file, use includeCSS(), or
1. Place the file in the www subdirectory Layer tabPanels on top of each other,
CREATE REACTIVE EXPRESSIONS PERFORM SIDE EFFECTS and navigate between them, with:
2. Link to it with:
library(shiny) reactive(x, env, quoted, library(shiny)
observeEvent(eventExpr, ui <- fluidPage( tabsetPanel(
label, domain) handlerExpr, event.env, tags$head(tags$link(rel = "stylesheet", tabPanel("tab 1", "contents"),
ui <- fluidPage( ui <- fluidPage(
textInput("a","","A"), event.quoted, handler.env, type = "text/css", href = "<file name>")) tabPanel("tab 2", "contents"),
textInput("a","","A"),
textInput("z","","Z"), Reactive expressions: actionButton("go","Go") handler.quoted, ..., label, tabPanel("tab 3", "contents")))
textOutput("b"))
• cache their value to ) suspended, priority, domain, To include JavaScript, use includeScript() or
server <-
function(input,output){
reduce computation server <- autoDestroy, ignoreNULL, 1. Place the file in the www subdirectory ui <- fluidPage( navlistPanel(
re <- reactive({ • can be called elsewhere function(input,output){ ignoreInit, once) 2. Link to it with:
tabPanel("tab 1", "contents"),
• notify dependencies observeEvent(input$go,{ tabPanel("tab 2", "contents"),
paste(input$a,input$z)}) print(input$a) Runs code in 2nd tabPanel("tab 3", "contents")))
output$b <- renderText({
re()
when invalidated }) argument when reactive tags$head(tags$script(src = "<file name>"))
Call the expression with }
}) values in 1st argument ui <- navbarPage(title = "Page",
}
function syntax, e.g. re(). change. See observe() for IMAGES To include an image:
shinyApp(ui, server) tabPanel("tab 1", "contents"),
shinyApp(ui, server)
alternative. 1. Place the file in the www subdirectory tabPanel("tab 2", "contents"),
2. Link to it with img(src="<file name>") tabPanel("tab 3", "contents"))
REACT BASED ON EVENT REMOVE REACTIVITY
eventReactive(eventExpr, library(shiny) isolate(expr)
Themes
library(shiny)
ui <- fluidPage( valueExpr, event.env, ui <- fluidPage( Runs a code block.
textInput("a","","A"), event.quoted, value.env, textInput("a","","A"),
actionButton("go","Go"),
value.quoted, ..., label, textOutput("b")
Returns a non-reactive Use the bslib package to add existing themes to Build your own theme by customizing individual
textOutput("b")
) domain, ignoreNULL, ) copy of the results. your Shiny app ui, or make your own. arguments.
server <- ignoreInit) server <- bs_theme(bg = "#558AC5",
function(input,output){ function(input,output){ library(bslib)
re <- eventReactive( Creates reactive output$b <- fg = "#F9B02D",
ui <- fluidPage(
input$go,{input$a})
output$b <- renderText({
expression with code in renderText({
theme = bs_theme( ...)
isolate({input$a})
re() 2nd argument that only }) bootswatch = "darkly",
})
} invalidates when reactive } ... ?bs_theme for a full list
values in 1st argument shinyApp(ui, server) ) of arguments.
shinyApp(ui, server)
change. )
bs_themer() Place within the server function to
bootswatch_themes() Get a list of themes. use the interactive theming widget.
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at shiny.rstudio.com • shiny 1.7.4 • Updated: 2023-05

ft


ff



Creating Survival Plots Diagnostics of Cox Model Summary of Cox Model

Informative and Elegant The function cox.zph() from survival package may be used to test the
proportional hazards assumption for a Cox regression model fit. The
The function ggforest() from the survminer package creates a forest plot
for a Cox regression model fit. Hazard ratio estimates along with confiden-

with survminer graphical verification of this assumption may be performed with the
function ggcoxzph() from the survminer package. For each covariate it
ce intervals and p-values are plotter for each variable.
library("survival")
produces plots with scaled Schoenfeld residuals against the time. library("survminer")
lung$age <- ifelse(lung$age > 70, ">70","<= 70")

w
www
ww
library("survival")
Survival Curves fit <- coxph(Surv(time,
ftest <- cox.zph(fit)
status) ~ sex + age, data = lung)
fit <- coxph( Surv(time, status) ~ sex + ph.ecog + age, data = lung)
fit
ftest . ## Call:
The ggsurvplot() function creates ggplot2 plots from survfit objects. ## rho chisq p ## coxph(formula = Surv(time, status) ~ sex+ph.ecog+age, data=lung)
## sex 0.1236 2.452 0.117 ##
Strata + sex=1 + sex=2 ## age -0.0275 0.129 0.719

w
www
library("survival")

ww
## coef exp(coef) se(coef) z p
fit <- survfit(Surv(time,status) ## GLOBAL NA 2.651 0.266 ## sex -0.567 0.567 0.168 -3.37 0.00075
~ sex, data = lung) 1.00 library("survminer") ## ph.ecog 0.470 1.600 0.113 4.16 3.1e-05
++ ggcoxzph(ftest) ## age>70 0.307 1.359 0.187 1.64 0.10175

w
ww
++
class(fit) ++++ ##
Survival probability

## [1] "survfit"
0.75 +
+ ++++
Global Schoenfeld Test p: 0.2656 ## Likelihood ratio test=31.6 on
+
++ ++ Schoenfeld Individual Test p: 0.1174 Schoenfeld Individual Test p: 0.7192 ## n= 227, number of events= 164
+
+++ ++ sex 0.57 <0.001 ***

library("survminer")
(N=228) (0.41 − 0.79)

w
www
w
+++ ++
0.50 + 0.3
3
ggsurvplot(fit, data = lung) +
● ●

+ ggforest(fit)
● ● ●
● ●
●● ● ● ●
●● ● ●

+++++
● ● ● ●●
● ● ●
● ● ● ●

0.2
● ● ●
●● ● ● ● ●


●● ● ● ●
● ● ● ● ● ●
● ● ●● ● ●

+++
● ● ● ● ● ●

2
● ● ●
● ● ●

+
● ● ● ●●

0.25
●●
● ● ●
● ●
●● ● ● ● ● ●

w
ww

0.1
● ● ● ● ● ● ●
● ●

Beta(t) for age


● ●●

Beta(t) for sex


● ●
●●
● ● ●● ●

+
● ● ●

1
● ● ●
● ●
ph.ecog 1.60 <0.001 ***
+++ +
● ● ● ● ●
●● ● ● ●
(N=228) (1.28 − 2.00)
● ●
● ● ●

++
● ● ● ●
● ●

0.0
● ●

● ● ● ● ●

0.00
● ● ●
● ● ●● ● ● ●● ●●
● ●

0
● ● ● ● ●
● ● ● ●

0 250 500 750 1000


●● ●
● ● ●

−0.1
● ● ●
● ●● ● ● ●

● ● ●

Time

● ● ● ● ●
● ● ● ●

−1
● ●


●● ●

−0.2
● ● ● ●
●● ●

age 1.36 0.102


● ●
●●● ● ●

Use the fun argument to set the transformation of the survival curve.
● ●●●● ● ●
●● ● ●
(N=228) (0.94 − 1.96)
−2
● ● ●● ● ● ● ● ●
● ● ●● ●●● ● ● ●● ● ● ●
● ●
● ●● ●● ●● ●
● ● ●● ●● ● ● ● ●●●
● ● ●
● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ●
● ●

−0.3
● ● ●● ● ● ● ●● ●●●
● ● ●● ●
● ●
● ●
● ● ● ● ●
● ●

E.g. "event" for cumulative events, "cumhaz" for the cumulative 55 130 180 270 340 430 560 730 55 130 180 270 340 430 560 730
n.events: 164, p.value.log: 6.4e−07 0.4 0.6 0.8 1 1.21.41.61.82
Time Time
hazard function or "pct" for survival probability in percentage. AIC: 1463.37, concordance: 0.64 Hazard ratio

ggsurvplot(fit, data = lung, fun = "event") The function ggadjustedcurves() from the survminer package plots
ggsurvplot(fit, data = lung, fun = "cumhaz")
The function ggcoxdiagnostics() plots different types of residuals as a
Adjusted Survival Curves for Cox Proportional Hazards Model. Adjusted

w
wwww
function of time, linear predictor or observation id. The type of residual is
Survival Curves show how a selected factor influences survival estimated
+ + Strata + sex=1 + sex=2 selected with type argument. Possible values are "martingale",. "devian-
Strata sex=1 sex=2
from a Cox model.
ce", "score", "schoenfeld", "dfbeta"', "dfbetas", and "scaledsch".
1.00
++ ++ Note that these curves differ from Kaplan Meier estimates since they
+++ + The ox.scale argument defines what shall be plotted on the OX axis.
+ 3 + present expected ssurvival based on given Cox model.
+ + Possible values are "linear.predictions", "observation.id", "time".
0.75
+++ + + library("survival")
Cumulative hazard

Logical arguments hline and sline may be used to add horizontal line or variable Male Female
Cumulative event

+++++ +
library("survminer")
+ + 2 smooth line to the plot.
+++ 1.00
0.50 ++ ++ residuals
lung$sex <- ifelse(lung$sex == 1,
++ ++ + library("survival")
++ ++ +++ "Male", "Female")

+ ++++ + +++ library("survminer") 2


+++
1 0.75
+
0.25 ++++ +++ fit <- coxph(Surv(time, status) ~ fit <- coxph(Surv(time, status) ~
++ +++ ++++

Survival rate
++
++++++++++++ sex + age, data = lung)
● ●

sex + ph.ecog + age +

Residuals (type = deviance)


++ + 1
++ +
0.50

strata(sex),
● ●

0.00 0

Group Cases

data = lung)
● ●

0 250 500 750 1000 0 250 500 750 1000


ggcoxdiagnostics(fit,
Time Time 0 0.25
type = "deviance", ●

ggcoxadjustedcurves(fit, data=lung)
ox.scale = "linear.predictions")
● ●● ●

With lots of graphical parame-


● ● ●

100 ●

++ −1 ●
0.00
ters you have full control over

++ 0 250 500 750 1000



Survival probability (%)

look and feel of the survival ++++ ggcoxdiagnostics(fit, time


75 +
+ ++++ + type = "schoenfeld", −2
Note that it is not necessary to include the grouping factor in the Cox

plots; position and content of ++ ++ ox.scale = "time") −2 −1 0 1 2 3


+
+++ ++ Linear Predictions
model. Survival curves are estimated from Cox model for each group
the legend; additional annota- 50 +++ ++
+
tions like p-value, title, subtitle. + +
+++++ age ecog.ps rx defined by the factor independently. variable (35,55] (55,65] (65,85]
1.0
+++
ggsurvplot(fit, data = lung, 25 + 20
lung$age3 <- cut(lung$age, 1.00
1
conf.int = TRUE, + 0.5 c(35,55,65,85))
+++ +

pval = TRUE,

p = 0.0013 ++

Residuals (type = schoenfeld)

0 10
fun = "pct", ggcoxadjustedcurves(fit, data=lung, 0.75
● ●

risk.table = TRUE, 0 250 500 750 1000 0.0


variable=”lage3”)
● ●


Survival rate
● ●

Time 0
size = 1,
● ●

● ●

0.50

0
linetype = "strata",

● ●

+ + −0.5

Sex Male Female ●

palette = c("#E7B800",
● ● ●

● ●

"#2E9FDF"), Number at risk ●


0.25
−10 −1 −1.0
legend = "bottom",
Male 138 62 20 7 2
legend.title = "Sex",
Sex

Female 90 53 21 3 0
legend.labs = c("Male",

−1.5 0.00
0 250 500 750 1000
"Female")) 200 400 600 200 400 600 200 400 600 0 250 500 750 1000
Time Time time

This onepager presents the survminer package [Alboukadel Kassambara, Marcin Kosinski 2017] in version 0.4.0.999 CC BY Przemysław Biecek http://github.com/pbiecek
See https://github.com/kassambara/survminer/ for more details. https://creativecommons.org/licenses/by/4.0/
rmarkdown : : CHEAT SHEET SOURCE EDITOR
RENDERED OUTPUT file path to output document

What is rmarkdown? 1. New File Write with


.Rmd files · Develop your code and
ideas side-by-side in a single
5. Save and Render 6. Share find in document
publish to
rpubs.com,
Markdown
document. Run code as individual shinyapps.io, The syntax on the le renders as the output on the right.
chunks or as an entire document. set insert go to run code RStudio Connect
Rmd preview code code chunk(s) Plain text. Plain text.
Dynamic Documents · Knit together location chunk chunk show End a line with two spaces to End a line with two spaces to
plots, tables, and results with outline start a new paragraph. start a new paragraph.
narrative text. Render to a variety of 4. Set Output Format(s) Also end with a backslash\ Also end with a backslash
formats like HTML, PDF, MS Word, or and Options reload document to make a new line. to make a new line.
MS Powerpoint. *italics* and **bold** italics and bold
Reproducible Research · Upload, link superscript^2^/subscript~2~ superscript2/subscript2
to, or attach your report to share. ~~strikethrough~~ strikethrough
Anyone can read or run your code to 3. Write Text run all escaped: \* \_ \\ escaped: * _ \
reproduce your work. previous
modify chunks endash: --, emdash: --- endash: –, emdash: —
chunk run
options current # Header 1
Header 1
Workflow 2. Embed Code
chunk ## Header 2
... Header 2
...
11. Open a new .Rmd file in the RStudio IDE by ###### Header 6 Header 6
going to File > New File > R Markdown. • unordered list
- unordered list
22. Embed code in chunks. Run code by line, by - item 2 • item 2
chunk, or all at once. - item 2a (indent 1 tab) • item 2a (indent 1 tab)
- item 2b • item 2b
33. Write text and add tables, figures, images, and 1. ordered list 1. ordered list
citations. Format with Markdown syntax or the 2. item 2 2. item 2
RStudio Visual Markdown Editor.
44. Set output format(s) and options in the YAML
VISUAL EDITOR insert citations style options Insert Citations - item 2a (indent 1 tab)
- item 2b
<link url>
• item 2a (indent 1 tab)
• item 2b
http://www.rstudio.com/
header. Customize themes or add parameters Create citations from a bibliography file, a Zotero library,
to execute or add interactivity with Shiny. or from DOI references. [This is a link.](link url) This is a link.
[This is another link][id]. This is another link.
55. Save and render the whole document. Knit BUILD YOUR BIBLIOGRAPHY
periodically to preview your work as you write. At the end of the document:
add/edit
• Add BibTeX or CSL bibliographies to the YAML header. [id]: link url
66. Share your work! attributes
---
![Caption](image.png)
title: "My Document"
bibliography: references.bib or ![Caption][id2]
link-citations: TRUE At the end of the document:

Embed Code with knitr


--- [id2]: image.png Caption.
• If Zotero is installed locally, your main library will `verbatim code` verbatim code
automatically be available. ```
CODE CHUNKS OPTION DEFAULT EFFECTS
multiple lines multiple lines
Surround code chunks with ```{r} and ``` or use echo TRUE display code in output document • Add citations by DOI by searching "from DOI" in the
of verbatim code of verbatim code
the Insert Code Chunk button. Add a chunk label TRUE (display error messages in doc) Insert Citation dialog. ```
error FALSE FALSE (stop render when error occurs)
and/or chunk options inside the curly braces a er r. > block quotes block quotes
eval TRUE run code in chunk INSERT CITATIONS
```{r chunk-label, include=FALSE} include TRUE include chunk in doc a er running • Access the Insert Citations dialog in the Visual Editor equation: $e^{i \pi} + 1 = 0$ equation: e iπ + 1 = 0
summary(mtcars) message TRUE display code messages in document by clicking the @ symbol in the toolbar or by clicking
Insert > Citation. equation block: equation block:
``` warning TRUE display code warnings in document
"asis" (passthrough results) $$E = mc^{2}$$ E = m c2
• Add citations with markdown syntax by typing [@cite]
SET GLOBAL OPTIONS results "markup" "hide" (don't display results) or @cite. horizontal rule: horizontal rule:
"hold" (put all results below all code) ---
Set options for the entire document in the first chunk.
Insert Tables
fig.align "default" "le ", "right", or "center"
fig.alt NULL alt text for a figure | Right | Le | Default | Center | Right Le Default Center
```{r include=FALSE} |-------:|:------|-----------|:---------:|
knitr::opts_chunk$set(message = FALSE) fig.cap NULL figure caption as a character string 12 12 12 12
Output data frames as tables using | 12 | 12 | 12 | 12 |
``` fig.path "figure/" prefix for generating figure file paths kable(data, caption). 123 123 123 123
| 123 | 123 | 123 | 123 |
fig.width & 1 1 1 1
fig.height 7 plot dimensions in inches |1|1|1|1|
INLINE CODE out.width rescales output width, e.g. "75%", "300px" ```{r}
Insert `r <code>` into text sections. Code is evaluated collapse FALSE collapse all sources & output into a single block data <- faithful[1:4, ] HTML Tabsets
at render and results appear as text.
comment "##" prefix for each line of results knitr::kable(data, # Results {.tabset} Results
## Plots text
"Built with `r getRversion()`" --> "Built with 4.1.0" child NULL files(s) to knit and then include caption = "Table with kable") text Plots Tables
include or exclude a code chunk when ```
purl TRUE extracting source code with knitr::purl() ## Tables text
See more options and defaults by running str(knitr::opts_chunk$get()) Other table packages include flextable, gt, and kableExtra. more text

RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at rmarkdown.rstudio.com • rmarkdown 2.9.4 • Updated: 2021-08









ft
ft


ft












ft

ft


















ft






Set Output Formats and their Options in YAML Render

MS Word
MS PPT
HTML
PDF
Use the document's YAML header to set an output IMPORTANT OPTIONS DESCRIPTION When you render a
format and customize it with output options. anchor_sections Show section anchors on mouse hover (TRUE or FALSE) X document, rmarkdown:
--- citation_package The LaTeX package to process citations ("default", "natbib", "biblatex") X 1. Runs the code and embeds
title: "My Document" results and text into an .md
author: "Author Name" code_download Give readers an option to download the .Rmd source code (TRUE or FALSE) X
file with knitr.
output: code_folding Let readers to toggle the display of R code ("none", "hide", or "show") X
html_document: Indent format 2 characters, 2. Converts the .md file into the output format with
toc: TRUE css CSS or SCSS file to use to style document (e.g. "style.css") X Pandoc.
indent options 4 characters
--- dev Graphics device to use for figure output (e.g. "png", "pdf") X X HTML
knitr pandoc
.Rmd .md PDF
df_print Method for printing data frames ("default", "kable", "tibble", "paged") X X X X DOC
OUTPUT FORMAT CREATES
html_document .html fig_caption Should figures be rendered with captions (TRUE or FALSE) X X X X
Save, then Knit to preview the document output.
pdf_document* .pdf highlight Syntax highlighting ("tango", "pygments", "kate", "zenburn", "textmate") X X X The resulting HTML/PDF/MS Word/etc. document will
word_document Microso Word (.docx) includes File of content to place in doc ("in_header", "before_body", "a er_body") X X be created and saved in the same directory as
powerpoint_presentation Microso Powerpoint (.pptx)
the .Rmd file.
keep_md Keep the Markdown .md file generated by knitting (TRUE or FALSE) X X X X
odt_document OpenDocument Text Use rmarkdown::render() to render/knit in the R
keep_tex Keep the intermediate TEX file used to convert to PDF (TRUE or FALSE) X
rtf_document Rich Text Format console. See ?render for available options.
latex_engine LaTeX engine for producing PDF output ("pdflatex", "xelatex", or "lualatex") X
md_document Markdown
github_document
ioslides_presentation
Markdown for Github
ioslides HTML slides
reference_docx/_doc
theme
docx/pptx file containing styles to copy in the output (e.g. "file.docx", "file.pptx")
Theme options (see Bootswatch and Custom Themes below) X
X X
Share
Publish on RStudio Connect
slidy_presentation Slidy HTML slides toc Add a table of contents at start of document (TRUE or FALSE) X X X X to share R Markdown documents
beamer_presentation* Beamer slides toc_depth The lowest level of headings to add to table of contents (e.g. 2, 3) X X X X securely, schedule automatic
* Requires LaTeX, use tinytex::install_tinytex()
toc_float Float the table of contents to the le of the main document content (TRUE or FALSE) X updates, and interact with parameters in real time.
Also see flexdashboard, bookdown, distill, and blogdown.
Use ?<output format> to see all of a format's options, e.g. ?html_document rstudio.com/products/connect/

More Header Options


PARAMETERS BOOTSWATCH THEMES STYLING WITH CSS AND SCSS INTERACTIVITY
Parameterize your documents to reuse with new Customize HTML documents with Bootswatch Add CSS and SCSS to your document by adding a Turn your report into an interactive
inputs (e.g., data, values, etc.). themes from the bslib package using the theme path to a file with the css option in the YAML header. Shiny document in 4 steps:
--- output option. 1. Add runtime: shiny to the YAML header.
1. Add parameters params: ---
in the header as state: "hawaii" Use bslib::bootswatch_themes() to list available title: "My Document" 2. Call Shiny input functions to embed input objects.
sub-values of --- themes. author: "Author Name" 3. Call Shiny render functions to embed reactive
params. ```{r} --- output: output.
data <- df[, params$state] title: "Document Title" html_document:
2. Call parameters summary(data) css: "style.css" 4. Render with rmarkdown::run() or click Run
author: "Author Name"
in code using ``` output: --- Document in RStudio IDE.
params$<name>. html_document:
theme: Apply CSS styling by writing HTML tags directly or: ---
3. Set parameters bootswatch: solar • Use markdown to apply style attributes inline. output: html_document
with Knit with --- runtime: shiny
Parameters or the ---
Bracketed Span
params argument A [green]{.my-color} word. A green word. ```{r, echo = FALSE}
of render(). CUSTOM THEMES numericInput("n",
Fenced Div "How many cars?", 5)
REUSABLE TEMPLATES Customize individual HTML elements using bslib ::: {.my-color}
variables. Use ?bs_theme to see more variables. All of these words
1. Create a new package with a inst/rmarkdown/ All of these words renderTable({
templates directory. --- are green. are green. head(cars, input$n)
2. Add a folder containing template.yaml (below) output: ::: })
html_document: ```
and skeleton.Rmd (template contents).
--- theme: • Use the Visual Editor. Go to Format > Div/Span and
name: "My Template" bg: "#121212" add CSS styling directly with Edit Attributes. Also see Shiny Prerendered for better performance.
--- fg: "#E4E4E4" rmarkdown.rstudio.com/
3. Install the package to access template by going to base_font: authoring_shiny_prerendered
google: "Prompt" .my-css-tag ...
File > New R Markdown > From Template.
--- Embed a complete app into your document with
This is a div with some text in it. shiny::shinyAppDir(). More at bookdown.org/yihui/
More on bslib at pkgs.rstudio.com/bslib/.
rmarkdown/shiny-embedded.html.

RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at rmarkdown.rstudio.com • rmarkdown 2.9.4 • Updated: 2021-08












ft
ft




































ft





ft
Data Visualization in R: ggvis Cheat Sheet
by shanly3011 via cheatography.com/20988/cs/3865/

ggvis Grammar Properties for points Transf​orm​ations (cont) Transf​orm​ations (cont)

The 4 essential components are The properties for points are - It returns a It returns a Similarly, we have comput​e_c​‐
- fill,x, y, stroke, dataset with 2 dataset ount() or the in- built function
1. Data stroke​Width, stroke​‐ variables, one with 4 layer_​bars()
2. Coordinate system Opa​city, fill, opacity, named pred_ and variables,
Syntax: comput​e_s​mooth()
3. Marks fillOp​acity, shape, the other resp_ x, x2 ,y
Long way: faithful %>%
4. Properties size ,y2.
comput​e_d​ens​ity​(~w​‐
Syntax Example- Sample code: Syntax: comput​e_s​mooth() aiting) %>% ggvis(​‐
faithful %>% ` Long way: mtcars %>% ~pred_, ~resp_) %>%
ggvis(​~wa​iti​ng,​~er​‐ faithful %>% ` comput​e_s​moo​th(mpg ~ layer_​lines()
upt​ions) %>% layer_​‐ ggvis(​~wa​iting, wt) %>% ggvis(​~pr​‐
poi​nts() %>% ~erupt​ions, fillOp​‐ ed_​,~r​esp_) %>% In-built: faithful %>%
add_ax​is(​"​x", title = acity = ~erupt​ions, layer_​lines() ggvis(​~wa​iting, fill
"​Waiting period​", size := 100, fill := "​‐ := "​gre​en") %>%
values = c(1,2,​‐ red​", stroke := "​‐ In-built: mtcars %>% layer_​den​sit​ies()
3,4​,5,​6,7), subdivide red​", shape := "​cro​‐ ggvis(​~wt​,~mpg) %>%
= 9) ss") %>% layer_​smo​oths()
We use the piping operator layer_​poi​nts()
'%>%' for our syntaxes. Syntax: comput​e_bin()
Properties for lines Long way: faithful %>%
Mapping Vs Setting propeties The properties for lines include - comput​e_b​in(​~wa​‐

Mapping Setting x, y, fill, fillOp​‐ iting, width = 5) %>%


acity, opacity, stroke, ggvis(x = ~ xmin_, x2 =
= maps := sets property
stroke​Dash, stroke​‐ ~ xmax_, y = 0, y2 =
property to to a specific
Opa​city, and stroke​‐ ~count_) %>% layer_​‐
a data value size/c​olu​or/​width
Width rects()
Used for Used for custom​‐
visual​ization izing the
Transf​orm​ations In-built: faithful %>%
of data appearance of
ggvis(​~wa​iting) %>%
plots comput​e_s​‐ comput​e_bin()
layer_​his​tog​ram​‐
ggvis scales ggvis sends the mooth
s(width = 5)
the values colour value to It transforms It transforms
to a pre-de​‐ vega - a the data to the data to
Transf​orm​ations
fined scale javascript generate a generate a
of package for new new comput​e_d​ens​ity()
colour​/sizes further dataframe. dataframe. A density plot uses a line to
processing display the density of a variable
at each point in its range.
It returns a data frame with two
columns: pred_, the x values of
the variable's density line, and
resp_, the y values of the
variable's density line.

By shanly3011 Published 9th April, 2015. Sponsored by CrosswordCheats.com


Last updated 12th May, 2016. Learn to solve cryptic crosswords!
Page 1 of 1. http://crosswordcheats.com

cheatography.com/shanly3011/
R color cheatsheet R Color Palettes
This is for all of you who don’t know anything
Finding a good color scheme for presenting data about color theory, and don’t care but want
can be challenging. This color cheatsheet will help! some nice colors on your map or figure….NOW!
R uses hexadecimal to represent colors TIP: When it comes to selecting a color palette,
Hexadecimal is a base-16 number system used to describe DO NOT try to handpick individual colors! You will
color. Red, green, and blue are each represented by two
characters (#rrggbb). Each character has 16 possible waste a lot of time and the result will probably not
symbols: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F: be all that great. R has some good packages for
color palettes. Here are some of the options
“00” can be interpreted as 0.0 and “FF” as 1.0
Packages: grDevices and grDevices
i.e., red= #FF0000 , black=#000000, white = #FFFFFF
colorRamps palettes
Two additional characters (with the same scale) can be grDevices comes with the base cm.colors
topo.colors
added to the end to describe transparency (#rrggbbaa) installation and colorRamps
terrain.colors
R has 657 built in color names Example: must be installed. Each palette’s heat.colors
To see a list of names: function has an argument for rainbow
colors() peachpuff4 the number of colors and see P. 4 for
These colors are displayed on P. 3. transparency (alpha): options
R translates various color models to hex, e.g.: heat.colors(4, alpha=1)
• RGB (red, green, blue): The default intensity scale in R > #FF0000FF" "#FF8000FF" "#FFFF00FF" "#FFFF80FF“
ranges from 0-1; but another commonly used scale is 0-
255. This is obtained in R using maxColorValue=255. For the rainbow palette you can also select start/end color
alpha is an optional argument for transparency, with the (red = 0, yellow = 1/6, green = 2/6, cyan = 3/6, blue
same intensity scale. = 4/6 and magenta = 5/6) and saturation (s) and value (v):
rgb(r, g, b, maxColorValue=255, alpha=255) rainbow(n, s = 1, v = 1, start = 0, end = max(1, n - 1)/n, alpha = 1)
• HSV (hue, saturation, value): values range from 0-1, with
optional alpha argument Package: RcolorBrewer
hsv(h, s, v, alpha)
• HCL (hue, chroma, luminance): hue describes the color and This function has an argument for the number of
ranges from 0-360; 0 = red, 120 = green, blue = 240, etc. colors and the color palette (see P. 4 for options).
Range of chroma and luminance depend on hue and each brewer.pal(4, “Set3”)
other > "#8DD3C7" "#FFFFB3" "#BEBADA" "#FB8072“
hcl(h, c, l, alpha)
To view colorbrewer palettes in R: display.brewer.all(5)
A few notes on HSV/HLC There is also a very nice interactive viewer:
HSV is a better model for how humans perceive color. http://colorbrewer2.org/
HCL can be thought of as a perceptually based version of
the HSV model….blah blah blah…
## My Recommendation ##
Without delving into color theory: color schemes based Package: colorspace
on HSV/HLC models generally just look good. These color palettes are based colorspace
default palettes
on HCL and HSV color models. diverge_hcl
The results can be very diverge_hsl
aesthetically pleasing. There terrain_hcl
are some default palettes: sequential_hcl
rainbow_hcl
rainbow_hcl(4)
"#E495A5" "#ABB065" "#39BEB1" "#ACA4E2“

However, all palettes are fully customizable:


Page 1, Melanie Frazier

diverge_hcl(7, h = c(246, 40), c = 96, l = c(65, 90))


Choosing the values would be daunting. But
there are some recommended palettes in the
colorspace documentation. There is also an
interactive tool that can be used to obtain a
R can translate colors to rgb (this is handy for customized palette. To start the tool:
matching colors in other programs) pal <- choose_palette()
col2rgb(c(“#FF0000”, “blue”))
R color cheatsheet How to use hex codes to define color
using the plot function
Overview of colorspace palette selector Discrete variables
library("colorspace")
Option 1
pal <- choose_palette() If you don’t need to control which colors are
associated with each level of a variable:
Select the type of color scheme
based on the type of data plot(Sepal.Length ~ Sepal.Width,
col=rainbow_hcl(3)[c(Species)],
Default color schemes – can be data=iris, pch=16)
used “as is” or as a starting point
for modification legend("topleft", pch=16, col=rainbow_hcl(3),
legend=unique(iris$Species))
Interactively select:
• hue: color
• chroma: low chroma = gray
• luminance: high luminance =
pastel
• power: how the color changes
along a gradient

HCL
hue
Option 2
If you want to control which colors are
associated with the levels of a variable, I find it
Select # of colors in palette easiest to create a variable in the data:
iris$color <- factor(iris$Species,
Save palette for future R sessions: levels=c("virginica", "versicolor", "setosa"),
• txt file with hex codes labels=rainbow_hcl(3))
• .R file with a function describing
how to generate the palette. plot(Sepal.Length ~ Sepal.Width,
source can be used to import the col=as.character(color), pch=16, data=iris)
function into R; but one
complication is that you have to Continuous variables
open the .R file and name the
function to use it. Option 1
• Copy values into relevant Break into categories and assign colors:
colorspace functions. iris2 <- subset(iris, Species=="setosa")
Diverging color schemes:
diverge_hcl(7, h = c(260, 0), c = color <- cut(iris2$Petal.Length,
100, l = c(28, 90), power = 1.5) breaks=c(0,1.3,1.5,2), labels=sequential_hcl(3))
Sequential color schemes:
sequential_hcl(n, h, c.= c(), l=c(), Or, break by quantiles (be sure to include 0 & 1):
power) color <- cut(iris2$Petal.Length,
Qualitative color schemes: breaks=quantile(iris$Petal.Length, c(0, 0.25, 0.5,
rainbow_hcl(n, c, l, start, end) (for
qualtitative schemes; start/ end 0.75, 1)), labels=sequential_hcl(3))
refer to the H1/H2 hue values) plot(Sepal.Width ~ Sepal.Length, pch=16,
col=color, data=iris2)
Display color scheme with
different plot types Option 2
Fully continuous gradient:
data <- data.frame("a"=runif(10000),
"b"=runif(10000))
color=diverge_hcl(length(data$a))[rank(data$a)]
Page 2, Melanie Frazier

plot(a~b, col=color, pch=16, data=data)

For ggplot2, I think the most


When “OK” is selected, the color palette
will be saved in the R session. To return 7
flexible color scales are:
hex color codes from the selected palette: scale_colour_manual
pal <- choose_palette() scale_colour_gradient
pal(7)
[NOTE: These values are not saved if you
for discrete and continuous
don’t save the session] variables, respectively
Page 4, Melanie Frazier

code to produce R color chart from: http://www.biecek.pl/R/R.pdf and http://bc.bojanorama.pl/2013/04/r-color-reference-sheet


colorRamps and grDevices

colorRamps and grDevices color palette, display from:


http://bc.bojanorama.pl/2013/04/r-color-reference-sheet/
RColorBrewer

To begin interactive color selector: pal <- choose_palette()


Sequential

Useful Resources:
A larger color chart of R named colors:
http://research.stowers-
institute.org/efg/R/Color/Chart/ColorChart.pdf

Nice overview of color in R:


http://research.stowers-
institute.org/efg/Report/UsingColorInR.pdf
Qualitative

http://students.washignton.edu/mclarkso/docu
ments/colors Ver2.pdf

A color theory reference:


Page 4, Melanie Frazier

Zeileis, A. K. Hornik, P. Murrell. 2009. Escaping


RGBland: selecting colors for statistical graphics.
Diverging

Computational and Statistics & Data Analysis


53:3259-3270

To display RColorBrewer palette: display.brewer.all()


For interactive color selector: http://colorbrewer2.org/

You might also like