R Cheat Sheet Merged
R Cheat Sheet Merged
R Cheat Sheet Merged
R cheat sheet Modified from: P. Dalgaard (2002). Introductory Statistics with R. Springer, New York.
1 2
BIO360 Biometrics I, Fall 2007 H. Wagner, Biology UTM
Arguments to breaks = c() Cutpoints. Note values of x outside of Discrete response binom.test Binomial test (incl. sign test)
cut() ‘breaks’ gives NA. Can also be a single prop.test Comparison of proportions
number, the number of cutpoints. fisher.test Exact test in 2 x 2 tables
labels = c() Names for groups. Default is 1, 2, ... chisq.test Chi-square test of independence
glm(y ~ x1+x2, Logistic regression
Factor recoding levels(f) <- names New level names binomial)
factor(newcodes[f]) Combining levels: ‘newcodes’, e.g.,
c(1,1,1,2,3) to amalgamate the first 3
of 5 groups of factor f
5 6
BIO360 Biometrics I, Fall 2007 H. Wagner, Biology UTM
6
PCA or redundancy analysis RDA. 1 black
Package ‘vegan’. 2 red
5
cca() Perform (canonical) correspondence 3 green
analysis, CA /CCA. Package: ‘vegan’
4
4 blue
diversity() Calculate diversity indices. Pkg: ‘vegan’ 5 light blue
3
6 purple
7 yellow
2
8 grey
1
7 8
Base R Vectors Programming
Creating Vectors For Loop While Loop
Cheat Sheet c(2, 4, 6) 2 4 6
Join elements into
for (variable in sequence){ while (condition){
a vector
Do something Do something
Getting Help An integer
2:6 2 3 4 5 6
sequence } }
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] Learn more at web page or vignette • package version • Updated: 3/15
Types Matrixes Strings Also see the stringr library.
m <- matrix(x, nrow = 3, ncol = 3) paste(x, y, sep = ' ')
Converting between common data types in R. Can always go Join multiple vectors together.
Create a matrix from x.
from a higher value in the table to a lower value.
paste(x, collapse = ' ') Join elements of a vector together.
m[2, ] - Select a row t(m)
w
ww Transpose
grep(pattern, x) Find regular expression matches in x.
ww
as.logical TRUE, FALSE, TRUE Boolean values (TRUE or FALSE).
w m[ , 1] - Select a column
m %*% n gsub(pattern, replace, x) Replace matches in x with a string.
ww
as.numeric 1, 0, 1
numbers.
w
ww
ww
preferred to factors. nchar(x) Number of characters in a string.
as.factor
'1', '0', '1',
levels: '1', '0'
Character strings with preset
levels. Needed for some
statistical models.
w Lists Factors
l <- list(x = 1:5, y = c('a', 'b')) factor(x) cut(x, breaks = 4)
Maths Functions A list is collection of elements which can be of different types. Turn a vector into a factor. Can
set the levels of the factor and
Turn a numeric vector into a
factor but ‘cutting’ into
log(x) Natural log. sum(x) Sum. l[[2]] l[1] l$x l['y'] the order. sections.
New list with New list with
exp(x) Exponential. mean(x) Mean. Second element Element named
only the first only element
max(x) Largest element. median(x) Median.
of l.
element.
x.
named y. Statistics
min(x) Smallest element. quantile(x) Percentage
lm(x ~ y, data=df) prop.test
Also see the t.test(x, y)
quantiles.
dplyr library. Data Frames Linear model. Preform a t-test for Test for a
round(x, n) Round to n decimal rank(x) Rank of elements. difference
difference between
places. glm(x ~ y, data=df) between
df <- data.frame(x = 1:3, y = c('a', 'b', 'c')) means.
Generalised linear model. proportions.
signif(x, n) Round to n var(x) The variance. A special case of a list where all elements are the same length.
significant figures. pairwise.t.test
List subsetting summary aov
Preform a t-test for
cor(x, y) Correlation. sd(x) The standard x y Get more detailed information Analysis of
paired data.
deviation. out a model. variance.
df$x df[[2]]
1 a
Variable Assignment Distributions
2 b Understanding a data frame
> a <- 'apple' Random Density Cumulative
Quantile
> a See the full data Variates Function Distribution
3 c View(df)
[1] 'apple' frame. Normal rnorm dnorm pnorm qnorm
See the first 6
Matrix subsetting head(df) Poison rpois dpois ppois qpois
rows.
The Environment Binomial rbinom dbinom pbinom qbinom
df[ , 2]
ls() List all variables in the nrow(df) cbind - Bind columns. Uniform runif dunif punif qunif
environment. Number of rows.
columns.
rm(list = ls()) Remove all variables from the rbind - Bind rows. plot(x) plot(x, y) hist(x)
environment. Values of x in Values of x Histogram of
dim(df)
Number of order. against y. x.
You can use the environment panel in RStudio to
df[2, 2] columns and
browse variables in your environment. rows.
Dates See the lubridate library.
RStudio® is a trademark of RStudio, Inc. • CC BY Mhairi McNeill • [email protected] • 844-448-1212 • rstudio.com Learn more at web page or vignette • package version • Updated: 3/15
Data Structures
R Programming Cheat Sheet Vector
• Group of elements of the SAME type
data.frame while using single-square brackets, use
‘drop’: df1[, 'col1', drop = FALSE]
just the basics • R is a vectorized language, operations are applied to
data.table
each element of the vector automatically
• R has no concept of column vectors or row vectors What is a data.table
Created By: Arianne Colton and Sean Chen • Special vectors: letters and LETTERS, that contain • Extends and enhances the functionality of data.frames
lower-case and upper-case letters Differences: data.table vs. data.frame
Create Vector v1 <- c(1, 2, 3) • By default data.frame turns character data into factors,
General Manipulating Strings Get Length length(v1) while data.table does not
Check if All or Any is True all(v1); any(v1) • When you print data.frame data, all data prints to the
• R version 3.0 and greater adds support for 64 bit paste('string1', 'string2', sep Integer Indexing v1[1:3]; v1[c(1,6)]
console, with a data.table, it intelligently prints the first
integers = '/') and last five rows
Boolean Indexing v1[is.na(v1)] <- 0
• R is case sensitive Putting # separator ('sep') is a space by default • Key Difference: Data.tables are fast because
Together c(first = 'a', ..)or
• R index starts from 1 paste(c('1', '2'), collapse = Naming they have an index like a database.
Strings names(v1) <- c('first', ..)
'/')
i.e., this search, dt1$col1 > number, does a
HELP # returns '1/2' Factor sequential scan (vector scan). After you create a key
stringr::str_split(string = v1, for this, it will be much faster via binary search.
• as.factor(v1) gets you the levels which is the
help(functionName) or ?functionName Split String pattern = '-')
number of unique values Create data.table from data.frame data.table(df1)
# returns a list
Help Home Page help.start() stringr::str_sub(string = v1, • Factors can reduce the size of a variable because they dt1[, 'col1', with
Get Substring start = 1, end = 3) only store unique values, but could be buggy if not Index by Column(s)* = FALSE] or
Special Character Help help('[') isJohnFound <- stringr::str_ used properly dt1[, list(col1)]
Search Help help.search(..)or ??.. detect(string = df1$col1, Show info for each data.table in tables()
Search Function - with pattern = ignore.case('John')) list memory (i.e., size, ...)
apropos('mea') Match String
Partial Name # returns True/False if John was found Store any number of items of ANY type Show Keys in data.table key(dt1)
See Example(s) example(topic) df1[isJohnFound, c('col1', Create index for col1 and setkey(dt1, col1)
...)] Create List list1 <- list(first = 'a', ...) reorder data according to col1
vector(mode = 'list', length dt1[c('col1Value1',
Objects in current environment Create Empty List = 3) Use Key to Select Data
'col1Value2'), ]
Get Element list1[[1]] or list1[['first']] Multiple Key Select dt1[J('1', c('2', '3')), ]
Display Object Name
Remove Object
objects() or ls()
rm(object1, object2,..)
Data Types Append Using
Numeric Index
list1[[6]] <- 2 dt1[, list(col1 =
mean(col1)), by =
Append Using Name list1[['newElement']] <- 2 col2]
Aggregation ** dt1[, list(col1 =
Notes: Check data type: class(variable)
mean(col1), col2Sum
Note: repeatedly appending to list, vector, data.frame
1. .name starting with a period are accessible but Four Basic Data Types etc. is expensive, it is best to create a list of a certain
= sum(col2)), by =
list(col3, col4)]
invisible, so they will not be found by ‘ls’ 1. Numeric - includes float/double, int, etc. size, then fill it.
2. To guarantee memory removal, use ‘gc’, releasing * Accessing columns must be done via list of actual
unused memory to the OS. R performs automatic ‘gc’
is.numeric(variable)
data.frame names, not as characters. If column names are
periodically 2. Character(string) • Each column is a variable, each row is an observation characters, then "with" argument should be set to
• Internally, each column is a vector FALSE.
nchar(variable) # length of a character or numeric
Symbol Name Environment • idata.frame is a data structure that creates a reference ** Aggregate and d*ply functions will work, but built-in
3. Date/POSIXct to a data.frame, therefore, no copying is performed aggregation functionality of data table is faster
• If multiple packages use the same function name the • Date: stores just a date. In numeric form, number
df1 <- data.frame(col1 = v1,
function that the package loaded the last will get called. of days since 1/1/1970 (see below). Create Data Frame col2 = v2, v3) Matrix
date1 <- as.Date('2012-06-28'), Dimension nrow(df1); ncol(df1); dim(df1) • Similar to data.frame except every element must be
• To avoid this precede the function with the name of the as.numeric(date1) Get/Set Column names(df1) the SAME type, most commonly all numerics
package. e.g. packageName::functionName(..) Names names(df1) <- c(...) • Functions that work with data.frame should work with
• POSIXct: stores a date and time. In numeric
form, number of seconds since 1/1/1970. Get/Set Row rownames(df1) matrix as well
Names rownames(df1) <- c(...)
Library date2 <- as.POSIXct('2012-06-28 18:00') Preview head(df1, n = 10); tail(...) Create Matrix matrix1 <- matrix(1:10, nrow = 5), # fills
rows 1 to 5, column 1 with 1:5, and column 2 with 6:10
Only trust reliable R packages i.e., 'ggplot2' for plotting, Get Data Type class(df1) # is data.frame Matrix matrix1 %*% t(matrix2)
'sp' for dealing spatial data, 'reshape2', 'survival', etc. Note: Use 'lubridate' and 'chron' packages to work df1['col1']or df1[1];† Multiplication # where t() is transpose
with Dates Index by Column(s) df1[c('col1', 'col3')] or
library(packageName)or
df1[c(1, 3)] Array
Load Package 4. Logical Index by Rows and df1[c(1, 3), 2:3] # returns data • Multidimensional vector of the SAME type
require(packageName) Columns from row 1 & 3, columns 2 to 3
Unload Package detach(packageName) • (TRUE = 1, FALSE = 0) • array1 <- array(1:12, dim = c(2, 3, 2))
• Use ==/!= to test equality and inequality † Index method: df1$col1 or df1[, 'col1'] or • Using arrays is not recommended
Note: require() returns the status(True/False) df1[, 1] returns as a vector. To return single column • Matrices are restricted to two dimensions while array
as.numeric(TRUE) => 1 can have any dimension
Data Munging Functions and Controls Data Reshaping
Apply (apply, tapply, lapply, mapply) group_by(), sample_n() say_hello <- function(first,
Create Function last = 'hola') { } Rearrange
• Apply - most restrictive. Must be used on a matrix, all • Chain functions reshape2.melt(df1, id.vars =
Call Function say_hello(first = 'hello')
elements must be the same type df1 %>% group_by(year, month) %>% Melt Data - from c('col1', 'col2'), variable.
• If used on some other object, such as a data.frame, it
select(col1, col2) %>% summarise(col1mean • R automatically returns the value of the last line of column to row name = 'newCol1', value.name =
= mean(col1)) code in a function. This is bad practice. Use return() 'newCol2')
will be converted to a matrix first reshape2.dcast(df1, col1 +
explicitly instead. Cast Data - from col2 ~ newCol1, value.var =
apply(matrix1, 1 - rows or 2 - columns, • Much faster than plyr, with four types of easy-to-use row to column 'newCol2')
function to apply) joins (inner, left, semi, anti) • do.call() - specify the name of a function either as
# if rows, then pass each row as input to the function
string (i.e. 'mean') or as object (i.e. mean) and provide
• Abstracts the way data is stored so you can work with arguments as a list. If df1 has 3 more columns, col3 to col5, 'melting' creates
• By default, computation on NA (missing data) always data frames, data tables, and remote databases with a new df that has 3 rows for each combination of col1
returns NA, so if a matrix contains NAs, you can the same set of functions do.call(mean, args = list(first = '1st')) and col2, with the values coming from the respective col3
ignore them (use na.rm = TRUE in the apply(..) Helper functions to col5.
which doesn’t pass NAs to your function) if /else /else if /switch
each() - supply multiple functions to a function like aggregate Combine (mutiple sets into one)
lapply if { } else ifelse
aggregate(price ~ cut, diamonds, each(mean, 1. cbind - bind by columns
Applies a function to each element of a list and returns median)) Works with Vectorized Argument No Yes
the results as a list Most Efficient for Non-Vectorized Argument Yes No data.frame from two vectors cbind(v1, v2)
sapply data.frame combining df1 and
Works with NA * No Yes cbind(df1, df2)
Same as lapply except return the results as a vector Data Use &&, || **† Yes No
df2 columns
• ddply(), llply(), ldply(), etc. (1st letter = the type of • Only one connection may be open at a time. The Default basic graphic 3.3 data.table
input, 2nd = the type of output connection automatically closes if R closes or another
connection is opened. hist(df1$col1, main = 'title', xlab = 'x dt1 <- data.table(df1, key = c('1',
• plyr can be slow, most of the functionality in plyr axis label')
can be accomplished using base function or other • If table name has space, use [ ] to surround the table '2')), dt2 <- ...‡
packages, but plyr is easier to use name in the SQL string. plot(col2 ~ col1, data = df1),
aka y ~ x or plot(x, y) • Left Join
ddply • which() in R is similar to ‘where’ in SQL
Takes a data.frame, splits it according to some Included Data lattice and ggplot2 (more popular) dt1[dt2]
variable(s), performs a desired action on it and returns a R and some packages come with data included.
data.frame • Initialize the object and add layers (points, lines, ‡ Data table join requires specifying the keys for the data
List Available Datasets data() histograms) using +, map variable in the data to an
List Available Datasets in data(package = tables
llply
a Specific Package 'ggplot2')
axis or aesthetic using ‘aes’
• Can use this instead of lapply ggplot(data = df1) + geom_histogram(aes(x
• For sapply, can use laply (‘a’ is array/vector/matrix), Missing Data (NA and NULL) = col1)) Created by Arianne Colton and Sean Chen
however, laply result does not include the names. NULL is not missing, it’s nothingness. NULL is atomical [email protected]
and cannot exist within a vector. If used inside a vector, it • Normalized histogram (pdf, not relative frequency
DPLYR (for data.frame ONLY) simply disappears. histogram) Based on content from
• Basic functions: filter(), slice(), arrange(), select(), ggplot(data = df1) + geom_density(aes(x = 'R for Everyone' by Jared Lander
Check Missing Data is.na()
rename(), distinct(), mutate(), summarise(), col1), fill = 'grey50')
Avoid Using is.null() Updated: December 2, 2015
Data import with the tidyverse : : CHEAT SHEET
Read Tabular Data with readr
read_*(file, col_names = TRUE, col_types = NULL, col_select = NULL, id = NULL, locale, n_max = Inf, One of the first steps of a project is to import OTHER TYPES OF DATA
skip = 0, na = c("", "NA"), guess_max = min(1000, n_max), show_col_types = TRUE) See ?read_delim outside data into R. Data is o en stored in Try one of the following
tabular formats, like csv files or spreadsheets. packages to import other types of files:
A|B|C
A B C read_delim("file.txt", delim = "|") Read files with any delimiter. If no The front page of this sheet shows • haven - SPSS, Stata, and SAS files
1 2 3 delimiter is specified, it will automatically guess. how to import and save text files into • DBI - databases
1|2|3 4 5 NA To make file.txt, run: write_file("A|B|C\n1|2|3\n4|5|NA", file = "file.txt")
4|5|NA R using readr. • jsonlite - json
The back page shows how to import • xml2 - XML
A B C read_csv("file.csv") Read a comma delimited file with period • httr - Web APIs
A,B,C spreadsheet data from Excel files
1 2 3 decimal marks. • rvest - HTML (Web Scraping)
1,2,3 4 5 NA write_file("A,B,C\n1,2,3\n4,5,NA", file = "file.csv") using readxl or Google Sheets using
4,5,NA googlesheets4. • readr::read_lines() - text data
CC BY SA Posit So ware, PBC • [email protected] • posit.co • readr.tidyverse.org • readxl.tidyverse.org and googlesheets4.tidyverse.org • readr 2.1.4 • readxl 1.4.2 • googlesheets4 1.1.0 • Updated: 2023-05
ft
ft
Import Spreadsheets
with readxl with googlesheets4
READ EXCEL FILES READ SHEETS
A B C D E A B C D E
1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5 1 x1 x2 x3 x4 x5 x1 x2 x3 x4 x5
2 x z 8 x NA z 8 NA 2 x z 8 x NA z 8 NA
3 y 7 9 10 y 7 NA 9 10 READXL COLUMN SPECIFICATION 3 y 7 9 10 y 7 NA 9 10 GOOGLESHEETS4 COLUMN SPECIFICATION
s1 s1
Column specifications define what data type Column specifications define what data type
each column of a file will be imported as. each column of a file will be imported as.
read_excel(path, sheet = NULL, range = NULL) read_sheet(ss, sheet = NULL, range = NULL)
Read a .xls or .xlsx file based on the file extension. Read a sheet from a URL, a Sheet ID, or a dribble
Use the col_types argument of read_excel() to Use the col_types argument of read_sheet()/
See front page for more read arguments. Also from the googledrive package. See front page for
set the column specification. range_read() to set the column specification.
read_xls() and read_xlsx(). more read arguments. Same as range_read().
read_excel("excel_file.xlsx")
Guess column types Guess column types
To guess a column type, read_ excel() looks at SHEETS METADATA To guess a column type read_sheet()/
READ SHEETS the first 1000 rows of data. Increase with the URLs are in the form: range_read() looks at the first 1000 rows of data.
guess_max argument. https://docs.google.com/spreadsheets/d/ Increase with guess_max.
A B C D E read_excel(path, sheet = read_excel(path, guess_max = Inf) read_sheet(path, guess_max = Inf)
NULL) Specify which sheet SPREADSHEET_ID/edit#gid=SHEET_ID
to read by position or name. Set all columns to same type, e.g. character gs4_get(ss) Get spreadsheet meta data. Set all columns to same type, e.g. character
read_excel(path, sheet = 1) read_excel(path, col_types = "text") read_sheet(path, col_types = "c")
s1 s2 s3
read_excel(path, sheet = "s1") gs4_find(...) Get data on all spreadsheet files.
Set each column individually sheet_properties(ss) Get a tibble of properties Set each column individually
read_excel( for each worksheet. Also sheet_names(). # col types: skip, guess, integer, logical, character
excel_sheets(path) Get a
vector of sheet names. path, read_sheets(ss, col_types = "_?ilc")
s1 s2 s3
col_types = c("text", "guess", "guess",“numeric") WRITE SHEETS
excel_sheets("excel_file.xlsx")
) A B C write_sheet(data, ss =
1 x 4 1 1 x 4 NULL, sheet = NULL) COLUMN TYPES
A B C D E To read multiple sheets: 2 y 5 2 2 y 5
Write a data frame into a
COLUMN TYPES l n c D L
A B C D E 1. Get a vector of sheet 3 z 6 3 3 z 6
new or existing Sheet. TRUE 2 hello 1947-01-08 hello
s1
names from the file path. logical numeric text date list FALSE 3.45 world 1956-10-21 1
A B C D E gs4_create(name, ...,
2. Set the vector names to TRUE 2 hello 1947-01-08 hello
s1 s2 A B C D sheets = NULL) Create a
be the sheet names. FALSE 3.45 world 1956-10-21 1 • skip - "_" or "-" • date - "D"
1 new Sheet with a vector
s1 s2 3. Use purrr::map_dfr() to • guess - "?" • datetime - "T"
• skip • logical • date 2 of names, a data frame,
s1 s2 s3 read multiple files into • logical - "l" • character - "c"
• guess • numeric • list s1 or a (named) list of data
one data frame. • integer - "i" • list-column - "L"
• text frames.
• double - "d" • cell - "C" Returns
path <- "your_file_path.xlsx" A B C
sheet_append(ss, data,
x1 x2 x3 1 x1 x2 x3 • numeric - "n" list of raw cell data.
path |> excel_sheets() |> Use list for columns that include multiple data 2 1 x 4 sheet = 1) Add rows to
2 y 5
set_names() |> types. See tidyr and purrr for list-column data. 3 z 6 3 2 y 5 the end of a worksheet. Use list for columns that include multiple data
map_dfr(read_excel, path = path) 4 3 z 6 types. See tidyr and purrr for list-column data.
s1
OTHER USEFUL EXCEL PACKAGES CELL SPECIFICATION FOR READXL AND GOOGLESHEETS4 FILE LEVEL OPERATIONS
For functions to write data to Excel files, see: Use the range argument of readxl::read_excel() or googlesheets4 also o ers ways to modify other
• openxlsx googlesheets4::read_sheet() to read a subset of cells from a aspects of Sheets (e.g. freeze rows, set column
• writexl A B C D E sheet. width, manage (work)sheets). Go to
1 1 2 3 4 5 2 3 4 read_excel(path, range = "Sheet1!B1:D2") googlesheets4.tidyverse.org to read more.
For working with non-tabular Excel data, see: 2 x y z NA y z read_sheet(ss, range = "B1:D2")
• tidyxl 3 6 7 9 10 For whole-file operations (e.g. renaming, sharing,
s1 Also use the range argument with cell specification functions placing within a folder), see the tidyverse
cell_limits(), cell_rows(), cell_cols(), and anchored(). package googledrive at
googledrive.tidyverse.org.
CC BY SA Posit So ware, PBC • [email protected] • posit.co • readr.tidyverse.org • readxl.tidyverse.org and googlesheets4.tidyverse.org • readr 2.1.4 • readxl 1.4.2 • googlesheets4 1.1.0 • Updated: 2023-05
ft
ff
Data transformation with dplyr : : CHEAT SHEET
dplyr functions work with pipes and expect tidy data. In tidy data:
A B C A B C
Manipulate Cases Manipulate Variables
&
pipes EXTRACT CASES EXTRACT VARIABLES
Row functions return a subset of rows as a new table. Column functions return a set of columns as a new vector or table.
Each variable is in Each observation, or x |> f(y)
its own column case, is in its own row becomes f(x, y) filter(.data, …, .preserve = FALSE) Extract rows pull(.data, var = -1, name = NULL, …) Extract
Summarize Cases w
www
ww that meet logical criteria.
mtcars |> filter(mpg > 20) w
www column values as a vector, by name or index.
mtcars |> pull(wt)
w
www
Apply summary functions to columns to create a new table of
w
www
ww
rows with duplicate values. mtcars |> select(mpg, wt)
summary statistics. Summary functions take vectors as input and mtcars |> distinct(gear)
return one value (see back).
relocate(.data, …, .before = NULL, .a er = NULL)
slice(.data, …, .preserve = FALSE) Select rows
w
www
ww
summary function Move columns to new position.
by position. mtcars |> relocate(mpg, cyl, .a er = last_col())
mtcars |> slice(10:15)
summarize(.data, …)
w
ww w
www
ww
Compute table of summaries. slice_sample(.data, …, n, prop, weight_by =
mtcars |> summarize(avg = mean(mpg)) NULL, replace = FALSE) Randomly select rows. Use these helpers with select() and across()
Use n to select a number of rows and prop to e.g. mtcars |> select(mpg:cyl)
count(.data, …, wt = NULL, sort = FALSE, name = select a fraction of rows.
NULL) Count number of rows in each group defined contains(match) num_range(prefix, range) :, e.g., mpg:cyl
mtcars |> slice_sample(n = 5, replace = TRUE) ends_with(match) all_of(x)/any_of(x, …, vars) !, e.g., !gear
by the variables in … Also tally(), add_count(),
w
ww add_tally(). starts_with(match) matches(match) everything()
mtcars |> count(cyl) slice_min(.data, order_by, …, n, prop,
with_ties = TRUE) and slice_max() Select rows
with the lowest and highest values. MANIPULATE MULTIPLE VARIABLES AT ONCE
Group Cases w
www
ww
mtcars |> slice_min(mpg, prop = 0.25)
df <- tibble(x_1 = c(1, 2), x_2 = c(3, 4), y = c(4, 5))
slice_head(.data, …, n, prop) and slice_tail()
Use group_by(.data, …, .add = FALSE, .drop = TRUE) to create a Select the first or last rows. across(.cols, .funs, …, .names = NULL) Summarize
w
ww
"grouped" copy of a table grouped by columns in ... dplyr mtcars |> slice_head(n = 5) or mutate multiple columns in the same way.
functions will manipulate each "group" separately and combine df |> summarize(across(everything(), mean))
the results.
Logical and boolean operators to use with filter() c_across(.cols) Compute across columns in
w
ww
== < <= is.na() %in% | xor() row-wise data.
w
www
ww mtcars |> != > >= !is.na() ! &
df |>
rowwise() |>
w
group_by(cyl) |>
summarize(avg = mean(mpg)) See ?base::Logic and ?Comparison for help. mutate(x_total = sum(c_across(1:2)))
MAKE NEW VARIABLES
ARRANGE CASES Apply vectorized functions to columns. Vectorized functions take
Use rowwise(.data, …) to group data into individual rows. dplyr arrange(.data, …, .by_group = FALSE) Order vectors as input and return vectors of the same length as output
functions will compute results for each row. Also apply functions (see back).
w
www
ww
rows by values of a column or columns (low to
to list-columns. See tidyr cheat sheet for list-column workflow. high), use with desc() to order from high to low. vectorized function
mtcars |> arrange(mpg) mutate(.data, …, .keep = "all", .before = NULL,
starwars |> mtcars |> arrange(desc(mpg))
ww
www w
www
ww
.a er = NULL) Compute new column(s). Also
w
w ww
rowwise() |> add_column().
mutate(film_count = length(films)) mtcars |> mutate(gpm = 1 / mpg)
ADD CASES mtcars |> mutate(gpm = 1 / mpg, .keep = "none")
add_row(.data, …, .before = NULL, .a er = NULL)
ungroup(x, …) Returns ungrouped copy of table.
w
www
ww
Add one or more rows to a table. rename(.data, …) Rename columns. Use
w
www
w
g_mtcars <- mtcars |> group_by(cyl) cars |> add_row(speed = 1, dist = 1) rename_with() to rename with a function.
ungroup(g_mtcars) mtcars |> rename(miles_per_gallon = mpg)
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at dplyr.tidyverse.org • dplyr 1.1.2 • Updated: 2023-05
ft
ft
ft
ft
ft
Vectorized Functions Summary Functions Combine Tables
TO USE WITH MUTATE () TO USE WITH SUMMARIZE () COMBINE VARIABLES COMBINE CASES
mutate() applies vectorized functions to summarize() applies summary functions to x y
columns to create new columns. Vectorized columns to create a new table. Summary A B C E F G A B C E F G A B C
TRUE ~ "other") Tidy data does not use rownames, which store a A B C intersect(x, y, …)
A B.x C B.y D Use by = c("col1", "col2", …) to
) variable outside of the columns. To work with the
c v 3
Rows that appear in both x and y.
a t 1 t 3
specify one or more common
dplyr::coalesce() - first non-NA values by rownames, first move them into a column. b u 2 u 2
columns to match on.
element across a set of vectors c v 3 NA NA
setdi (x, y, …)
tibble::rownames_to_column() le _join(x, y, by = "A") A B C
dplyr::if_else() - element-wise if() + else() A B C A B
a t 1 Rows that appear in x but not y.
dplyr::na_if() - replace specific values with NA 1 a t 1 a t Move row names into col. b u 2
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at dplyr.tidyverse.org • dplyr 1.1.2 • Updated: 2023-05
ft
ft
ft
ft
ff
ff
ff
ff
ff
ff
ff
ff
ff
ft
ff
ff
ft
ff
RCC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08
ft
Stats An alternative way to build a layer. Scales Override defaults with scales package. Coordinate Systems Faceting
A stat builds new variables to plot (e.g., count, prop). Scales map data values to the visual values of an r <- d + geom_bar() Facets divide a plot into
fl cty cyl aesthetic. To change a mapping, add a new scale. r + coord_cartesian(xlim = c(0, 5)) - xlim, ylim subplots based on the
n <- d + geom_bar(aes(fill = fl)) The default cartesian coordinate system. values of one or more
+ =
x ..count..
discrete variables.
aesthetic prepackaged scale-specific r + coord_fixed(ratio = 1/2)
scale_ to adjust scale to use arguments ratio, xlim, ylim - Cartesian coordinates with t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
data stat geom coordinate plot
x=x· system n + scale_fill_manual( fixed aspect ratio between x and y units.
y = ..count.. values = c("skyblue", "royalblue", "blue", "navy"), t + facet_grid(cols = vars(fl))
Visualize a stat by changing the default stat of a geom limits = c("d", "e", "p", "r"), breaks =c("d", "e", "p", “r"), ggplot(mpg, aes(y = fl)) + geom_bar() Facet into columns based on fl.
name = "fuel", labels = c("D", "E", "P", "R")) Flip cartesian coordinates by switching
function, geom_bar(stat="count") or by using a stat
x and y aesthetic mappings. t + facet_grid(rows = vars(year))
function, stat_count(geom="bar"), which calls a default range of title to use in labels to use breaks to use in
values to include legend/axis in legend/axis legend/axis Facet into rows based on year.
geom to make a layer (equivalent to a geom function). in mapping
Use ..name.. syntax to map stat variables to aesthetics. r + coord_polar(theta = "x", direction=1)
theta, start, direction - Polar coordinates. t + facet_grid(rows = vars(year), cols = vars(fl))
GENERAL PURPOSE SCALES Facet into both rows and columns.
geom to use stat function geommappings r + coord_trans(y = “sqrt") - x, y, xlim, ylim t + facet_wrap(vars(fl))
Use with most aesthetics Transformed cartesian coordinates. Set xtrans
i + stat_density_2d(aes(fill = ..level..), Wrap facets into a rectangular layout.
scale_*_continuous() - Map cont’ values to visual ones. and ytrans to the name of a window function.
geom = "polygon")
variable created by stat scale_*_discrete() - Map discrete values to visual ones. Set scales to let axis limits vary across facets.
scale_*_binned() - Map continuous values to discrete bins. π + coord_quickmap()
60
π + coord_map(projection = "ortho", orientation t + facet_grid(rows = vars(drv), cols = vars(fl),
c + stat_bin(binwidth = 1, boundary = 10) scale_*_identity() - Use data values as visual ones. = c(41, -74, 0)) - projection, xlim, ylim scales = "free")
lat
x, y | ..count.., ..ncount.., ..density.., ..ndensity.. scale_*_manual(values = c()) - Map discrete values to Map projections from the mapproj package x and y axis limits adjust to individual facets:
manually chosen visual ones.
c + stat_count(width = 1) x, y | ..count.., ..prop.. long
(mercator (default), azequalarea, lagrange, etc.). "free_x" - x axis limits adjust
scale_*_date(date_labels = "%m/%d"), "free_y" - y axis limits adjust
c + stat_density(adjust = 1, kernel = "gaussian") date_breaks = "2 weeks") - Treat data values as dates.
x, y | ..count.., ..density.., ..scaled..
e + stat_bin_2d(bins = 30, drop = T)
scale_*_datetime() - Treat data values as date times.
Same as scale_*_date(). See ?strptime for label formats.
Position Adjustments Set labeller to adjust facet label:
t + facet_grid(cols = vars(fl), labeller = label_both)
x, y, fill | ..count.., ..density.. Position adjustments determine how to arrange geoms fl: c fl: d fl: e fl: p fl: r
X & Y LOCATION SCALES that would otherwise occupy the same space.
e + stat_bin_hex(bins = 30) x, y, fill | ..count.., ..density.. t + facet_grid(rows = vars(fl),
Use with x or y aesthetics (x shown here) s <- ggplot(mpg, aes(fl, fill = drv)) labeller = label_bquote(alpha ^ .(fl)))
e + stat_density_2d(contour = TRUE, n = 100)
x, y, color, size | ..level.. scale_x_log10() - Plot x on log10 scale. ↵c ↵d ↵e ↵p ↵r
scale_x_reverse() - Reverse the direction of the x axis. s + geom_bar(position = "dodge")
e + stat_ellipse(level = 0.95, segments = 51, type = "t") scale_x_sqrt() - Plot x on square root scale. Arrange elements side by side.
l + stat_contour(aes(z = z)) x, y, z, order | ..level..
l + stat_summary_hex(aes(z = z), bins = 30, fun = max) COLOR AND FILL SCALES (DISCRETE)
s + geom_bar(position = "fill")
Stack elements on top of one
Labels and Legends
x, y, z, fill | ..value.. another, normalize height. Use labs() to label the elements of your plot.
n + scale_fill_brewer(palette = "Blues")
l + stat_summary_2d(aes(z = z), bins = 30, fun = mean) For palette choices: e + geom_point(position = "jitter") t + labs(x = "New x axis label", y = "New y axis label",
x, y, z, fill | ..value.. RColorBrewer::display.brewer.all() Add random noise to X and Y position of title ="Add a title above the plot",
each element to avoid overplotting. subtitle = "Add a subtitle below title",
f + stat_boxplot(coef = 1.5) n + scale_fill_grey(start = 0.2, A caption = "Add a caption below plot",
x, y | ..lower.., ..middle.., ..upper.., ..width.. , ..ymin.., ..ymax.. end = 0.8, na.value = "red") e + geom_label(position = "nudge") alt = "Add alt text to the plot",
B
Nudge labels away from points. <aes> = "New <aes>
<AES> <AES> legend title")
f + stat_ydensity(kernel = "gaussian", scale = "area") x, y
| ..density.., ..scaled.., ..count.., ..n.., ..violinwidth.., ..width.. COLOR AND FILL SCALES (CONTINUOUS) s + geom_bar(position = "stack") t + annotate(geom = "text", x = 8, y = 9, label = “A")
Stack elements on top of one another. Places a geom with manually selected aesthetics.
e + stat_ecdf(n = 40) x, y | ..x.., ..y.. o <- c + geom_dotplot(aes(fill = ..x..))
e + stat_quantile(quantiles = c(0.1, 0.9), Each position adjustment can be recast as a function p + guides(x = guide_axis(n.dodge = 2)) Avoid crowded
o + scale_fill_distiller(palette = “Blues”) with manual width and height arguments: or overlapping labels with guide_axis(n.dodge or angle).
formula = y ~ log(x), method = "rq") x, y | ..quantile..
s + geom_bar(position = position_dodge(width = 1)) n + guides(fill = “none") Set legend type for each
e + stat_smooth(method = "lm", formula = y ~ x, se = T, o + scale_fill_gradient(low="red", high=“yellow") aesthetic: colorbar, legend, or none (no legend).
level = 0.95) x, y | ..se.., ..x.., ..y.., ..ymin.., ..ymax..
ggplot() + xlim(-5, 5) + stat_function(fun = dnorm,
o + scale_fill_gradient2(low = "red", high = “blue”,
mid = "white", midpoint = 25) Themes n + theme(legend.position = "bottom")
Place legend at "bottom", "top", "le ", or “right”.
n = 20, geom = “point”) x | ..x.., ..y.. n + scale_fill_discrete(name = "Title",
ggplot() + stat_qq(aes(sample = 1:100)) o + scale_fill_gradientn(colors = topo.colors(6)) r + theme_bw() r + theme_classic() labels = c("A", "B", "C", "D", "E"))
x, y, sample | ..sample.., ..theoretical.. Also: rainbow(), heat.colors(), terrain.colors(), White background Set legend title and labels with a scale function.
cm.colors(), RColorBrewer::brewer.pal() with grid lines. r + theme_light()
e + stat_sum() x, y, size | ..n.., ..prop..
e + stat_summary(fun.data = "mean_cl_boot")
h + stat_summary_bin(fun = "mean", geom = "bar")
SHAPE AND SIZE SCALES
r + theme_gray()
Grey background
r + theme_linedraw()
r + theme_minimal()
Zooming
p <- e + geom_point(aes(shape = fl, size = cyl)) (default theme). Minimal theme. Without clipping (preferred):
e + stat_identity() p + scale_shape() + scale_size() r + theme_dark() r + theme_void() t + coord_cartesian(xlim = c(0, 100), ylim = c(10, 20))
e + stat_unique() p + scale_shape_manual(values = c(3:7)) Dark for contrast. Empty theme.
With clipping (removes unseen data points):
r + theme() Customize aspects of the theme such
as axis, legend, panel, and facet properties. t + xlim(0, 100) + ylim(10, 20)
p + scale_radius(range = c(1,6))
p + scale_size_area(max_size = 6) r + ggtitle(“Title”) + theme(plot.title.postion = “plot”) t + scale_x_continuous(limits = c(0, 100)) +
r + theme(panel.background = element_rect(fill = “blue”)) scale_y_continuous(limits = c(0, 100))
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at ggplot2.tidyverse.org • ggplot2 3.3.5 • Updated: 2021-08
ft
ft
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at tidyr.tidyverse.org • tibble 3.2.1 • tidyr 1.3.0 • Updated: 2023–05
ft
Nested Data
A nested data frame stores individual tables as a list-column of data frames within a larger organizing data frame. List-columns can also be lists of vectors or lists of varying data types.
Use a nested data frame to:
• Preserve relationships between observations and subsets of data. Preserve the type of the variables being nested (factors and datetimes aren't coerced to character).
• Manipulate many sub-tables at once with purrr functions like map(), map2(), or pmap() or with dplyr rowwise() grouping.
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at tidyr.tidyverse.org • tibble 3.2.1 • tidyr 1.3.0 • Updated: 2023–05
ft
Data Science in Spark with sparklyr : : CHEAT SHEET
Intro Import Wrangle
Visualize
• Collect result, plot in R Communicate
• From R (copy_to()) • dplyr verb
sparklyr is an R interface for Apache Spark™.
• Read a file (spark_read_) • tidyr commands Collect results into R
It enables us to write all of our analysis code in R, Model
• Read Hive table (tbl()) • Feature transformer (ft_) share using RMarkdown
but have the actual processing happen inside • Spark MLlib (ml_)
• Direct Spark SQL (DBI)
Spark clusters. Easily manipulate and model • H2O Extension R for Data Science,
large-scale using R and Spark via sparklyr. Grolemund & Wickham
Import Wrangle
(x - median) /
DPLYR VERBS ft_idf() - Compute the Inverse Document (p75 - p25) ft_robust_scaler() - Removes the median
Push Translates into Spark SQL statements Frequency (IDF) given a collection of and scales according to standard scale.
Compute
copy_to(sc, mtcars) %>%
documents.
mutate(trm = ifelse(am == 0,
𝞼=x 𝞼= 0 ft_standard_scaler() - Removes the mean
Import ft_imputer() - Imputation estimator for and scaling to unit variance using column
"auto", "man")) %>%
completing missing values, uses the mean summary statistics
Collect Source group_by(trm) %>%
or the median of the columns.
Results summarise_all(mean)
ft_stop_words_remover() - Filters out stop
no
Import data into Spark, not R 0 a ft_index_to_string() - Index labels back to words from input
TIDYR 1 c
1 c label as strings
READ A FILE INTO SPARK a 0 ft_string_indexer() - Column of labels into
pivot_longer() - Collapse c 1
Arguments that apply to all functions: X A 1 ft_interaction() - Takes in Double and a column of label indices.
X B 2 several columns into two. c 1
sc, name, path, options=list(), repartition=0, Y A 3 2,3 4,2 8,6 Vector columns and outputs a flattened
memory=TRUE, overwrite=TRUE Y B 4
A B pivot_wider() - Expand two vector of their feature interactions. a b
ft_tokenizer() - Converts to lowercase and
AB
X 1 2 then splits it by white spaces
CSV spark_read_csv( header = TRUE, Y 3 4 columns into several.
0 ft_max_abs_scaler() - Rescale each
columns=NULL, infer_schema=TRUE, -1 0 a 0,a ft_vector_assembler() - Combine vectors
delimiter = ",", quote= "\"", escape = "\\", nest() / unnest() - Convert groups of cells 1 feature individually to range [-1, 1] 1 a 1,a
into list-columns, and vice versa. 1 b 1,b into single row-vector
charset = "UTF-8", null_value = NULL)
ft_min_max_scaler() - Rescale each feature
1-4 2
0 a 0a unite() / separate() - Split a single column 1 to a common range [min, max] linearly 0 a 0,0 ft_vector_indexer() - Indexing categorical
JSON spark_read_json() 1 a 1a 4
1 a 1,0
1 b 1b into several columns, and vice versa. 1 b 1,1 feature columns in a dataset of Vector
PARQUET spark_read_parquet() ft_ngram() - Converts the input array of
NA fill() - Fill NA with the previous value
SPAR K
strings into an array of n-grams ft_vector_slicer() - Takes a feature vector
TEXT spark_read_text() NA 0,a a
and outputs a new feature vector with a
1,a a
1,b b
ORC spark_read_orc() ft_bucketed_random_projection_lsh() subarray of the original features
FEATURE TRANSFORMERS ft_minhash_lsh() - Locality Sensitive
LIBSVM spark_read_libsvm() boo ft_word2vec() - Word2Vec transforms a
Hashing functions for Euclidean distance too
DELTA spark_read_delta() 0 ft_binarizer() - Assigned values based on and Jaccard distance (MinHash) next word into a code
0
AVRO spark_read_avro() 1 threshold p=x p=2
CC BY SA Posit Software, PBC • [email protected] • posit.co • Learn more at spark.rstudio.com • sparklyr 1.7.5 • Updated: 2022-02
Data Science in Spark with sparklyr : : CHEAT SHEET
Modeling Sessions
REGRESSION CLUSTERING UTILITIES YARN CLIENT
ml_linear_regression() - Linear regression. ml_bisecting_kmeans() - A bisecting k-means ml_call_constructor() - Identifies the associated 1. Install RStudio Server on an edge node
algorithm based on the paper sparklyr ML constructor for the JVM 2. Locate path to the clusterʼs Spark Home Directory, it
ml_aft_survival_regression() - Parametric normally is “/usr/lib/spark”
survival regression model named accelerated ml_lda() | ml_describe_topics() | ml_log_likelihood() ml_model_data() - Extracts data associated with a
3. Basic configuration example
failure time (AFT) model | ml_log_perplexity() | ml_topics_matrix() - LDA topic Spark ML model
conf <- spark_config()
model designed for text documents. ml_standardize_formula() - Generates a formula conf$spark.executor.memory <- "300M"
ml_generalized_linear_regression() - GLM conf$spark.executor.cores <- 2
ml_gaussian_mixture() - Expectation maximization string from user inputs, to be used in `ml_model`
conf$spark.executor.instances <- 3
ml_isotonic_regression() - Currently for multivariate Gaussian Mixture Models (GMMs) constructor conf$spark.dynamicAllocation.enabled<-"false"
implemented using parallelized pool adjacent ml_kmeans() | ml_compute_cost() ml_uid() - Extracts the UID of an ML object. 4. Open a connection
sc <- spark_connect(master = "yarn",
violators algorithm. Only univariate (single |ml_compute_silhouette_measure() - Clustering with spark_home = "/usr/lib/spark/",
feature) algorithm supported support for k-means ML Pipelines version = "2.1.0", config = conf)
Classification to Binary Classification. Performs ml_als() | ml_recommend() - Recommendation using ft_sql_transformer() - Creates a Pipeline step based LOCAL MODE
reduction using one against all strategy. Alternating Least Squares matrix factorization on the SQL statement passed to the command. No cluster required. Use for learning purposes only
TREE EVALUATION ft_dplyr_transformer() - Creates a Pipeline step 1. Install a local version of Spark: spark_install()
ml_decision_tree_classifier()|ml_decision_tree() ml_clustering_evaluator() - Evaluator for clustering based on one or several dplyr commands. 2. Open a connection
sc <- spark_connect(master="local")
|ml_decision_tree_regressor() - Classification ml_evaluate() - Compute performance metrics ft_dplyr_transformer() ml_linear_regression()
and regression using decision trees ml_pipeline() ml_fit() ml_save() KUBERNETES
ml_binary_classification_evaluator() |
ml_gbt_classifier()|ml_gradient_boosted_trees() ml_binary_classification_eval() | 1. Use the following to obtain the Host and Port
ft_bucketizer()
| ml_gbt_regressor() - Binary classification and system2("kubectl", "cluster-info")
ml_classification_eval() - A set of functions to
spark.rstudio.com/guides/pipelines 2. Open a connection
regression using gradient boosted trees calculate performance metrics for prediction models. sc <- spark_connect(config =
ml_random_forest_classifier() - Classification More Info spark_config_kubernetes(
FREQUENT PATTERN
and regression using random forests. "k8s://https://[HOST]>:[PORT]",
ml_fpgrowth() | ml_association_rules() | account = "default",
ml_feature_importances() | ml_freq_itemsets() - A parallel FP-growth algorithm image = "docker.io/owner/repo:version"))
ml_tree_feature_importance() - Feature to mine frequent itemsets.
Importance for Tree Models CLOUD
ml_freq_seq_patterns() | ml_prefixspan() - Databricks - spark_connect(method = "databricks")
PrefixSpan algorithm for mining frequent itemsets. Qubole- spark_connect(method = "qubole")
spark.rstudio.com therinspark.com
CC BY SA Posit Software, PBC • [email protected] • posit.co • Learn more at spark.rstudio.com • sparklyr 1.7.5 • Updated: 2022-02
String manipulation with stringr : : CHEAT SHEET
The stringr package provides a set of internally consistent tools for working with character strings, i.e. sequences of characters surrounded by quotation marks.
2 4 str_locate(string, pattern) Locate the str_match(string, pattern) Return the str_trim(string, side = c("both", "le ", "right"))
4 7
NA NA
positions of pattern matches in a string. NA NA
first pattern match found in each string, as Trim whitespace from the start and/or end of
3 4 Also str_locate_all(). str_locate(fruit, "a") a matrix with a column for each ( ) group in a string. str_trim(str_pad(fruit, 17))
pattern. Also str_match_all().
0 str_count(string, pattern) Count the number str_match(sentences, "(a|the) ([^ +])") str_squish(string) Trim whitespace from each
3
1
of matches in a string. str_count(fruit, "a") end and collapse multiple spaces into single
2 spaces. str_squish(str_pad(fruit, 17, "both"))
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor on Twitter • stringr 1.5.0 • Updated: 2023-05
ft
ft
ft
ft
Need to Know Regular Expressions - Regular expressions, or regexps, are a concise language for
describing patterns in strings.
[:space:]
new line
Pattern arguments in stringr are interpreted as MATCH CHARACTERS see <- function(rx) str_view_all("abc ABC 123\t.!?\\(){}\n", rx)
regular expressions a er any special characters [:blank:] .
have been parsed. string regexp matches example
(type this) (to mean this) (which matches this) space
In R, you write regular expressions as strings, a (etc.) a (etc.) see("a") abc ABC 123 .!?\(){} tab
sequences of characters surrounded by quotes \\. \. . see("\\.") abc ABC 123 .!?\(){}
("") or single quotes('').
\\! \! ! see("\\!") abc ABC 123 .!?\(){} [:graph:]
Some characters cannot be represented directly \\? \? ? see("\\?") abc ABC 123 .!?\(){}
in an R string . These must be represented as \\\\ \\ \ see("\\\\") abc ABC 123 .!?\(){} [:punct:] [:symbol:]
special characters, sequences of characters that \\( \( ( see("\\(") abc ABC 123 .!?\(){}
have a specific meaning., e.g. . , : ; ? ! / *@# | ` = + ^
\\) \) ) see("\\)") abc ABC 123 .!?\(){}
Special Character Represents \\{ \{ { see("\\{") abc ABC 123 .!?\(){} - _ " ' [ ] { } ( ) ~ < > $
\\ \ \\} \} } see( "\\}") abc ABC 123 .!?\(){}
\" " \\n \n new line (return) see("\\n") abc ABC 123 .!?\(){} [:alnum:]
\n new line \\t \t tab see("\\t") abc ABC 123 .!?\(){}
Run ?"'" to see a complete list \\s \s any whitespace (\S for non-whitespaces) see("\\s") abc ABC 123 .!?\(){} [:digit:]
\\d \d any digit (\D for non-digits) see("\\d") abc ABC 123 .!?\(){}
0 1 2 3 4 5 6 7 8 9
Because of this, whenever a \ appears in a regular \\w \w any word character (\W for non-word chars) see("\\w") abc ABC 123 .!?\(){}
expression, you must write it as \\ in the string \\b \b word boundaries see("\\b") abc ABC 123 .!?\(){}
that represents the regular expression. [:digit:]
1
digits see("[:digit:]") abc ABC 123 .!?\(){} [:alpha:]
1
Use writeLines() to see how R views your string [:alpha:] letters see("[:alpha:]") abc ABC 123 .!?\(){} [:lower:] [:upper:]
1
a er all special characters have been parsed. [:lower:] lowercase letters see("[:lower:]") abc ABC 123 .!?\(){}
[:upper:]
1
uppercase letters see("[:upper:]") abc ABC 123 .!?\(){} a b c d e f A B C D E F
writeLines("\\.") 1
# \. [:alnum:] letters and numbers see("[:alnum:]") abc ABC 123 .!?\(){} g h i j k l GH I J K L
[:punct:] 1 punctuation see("[:punct:]") abc ABC 123 .!?\(){}
mn o p q r MNOPQR
writeLines("\\ is a backslash") [:graph:]1 letters, numbers, and punctuation see("[:graph:]") abc ABC 123 .!?\(){}
# \ is a backslash
[:space:]1 space characters (i.e. \s) see("[:space:]") abc ABC 123 .!?\(){} s t u v w x S T U VWX
[:blank:]1 space and tab (but not new line) see("[:blank:]") abc ABC 123 .!?\(){} y z Y Z
. every character except a new line see(".") abc ABC 123 .!?\(){}
INTERPRETATION 1 Many base R functions require classes to be wrapped in a second set of [ ], e.g. [[:digit:]]
Patterns in stringr are interpreted as regexs. To
change this default, wrap the pattern in one of:
ALTERNATES alt <- function(rx) str_view_all("abcde", rx) QUANTIFIERS quant <- function(rx) str_view_all(".a.aa.aaa", rx)
regex(pattern, ignore_case = FALSE, multiline = example example
regexp matches regexp matches
FALSE, comments = FALSE, dotall = FALSE, ...)
Modifies a regex to ignore cases, match end of ab|d or alt("ab|d") abcde a? zero or one quant("a?") .a.aa.aaa
lines as well of end of strings, allow R comments [abe] one of alt("[abe]") abcde a* zero or more quant("a*") .a.aa.aaa
within regex's , and/or to have . match everything a+ one or more quant("a+") .a.aa.aaa
including \n. [^abe] anything but alt("[^abe]") abcde
str_detect("I", regex("i", TRUE)) [a-c] range alt("[a-c]") abcde 1 2 ... n a{n} exactly n quant("a{2}") .a.aa.aaa
1 2 ... n a{n, } n or more quant("a{2,}") .a.aa.aaa
fixed() Matches raw bytes but will miss some n ... m a{n, m} between n and m quant("a{2,4}") .a.aa.aaa
characters that can be represented in multiple ANCHORS anchor <- function(rx) str_view_all("aaa", rx)
ways (fast). str_detect("\u0130", fixed("i")) regexp matches example
^a start of string anchor("^a") aaa GROUPS ref <- function(rx) str_view_all("abbaab", rx)
coll() Matches raw bytes and will use locale
specific collation rules to recognize characters a$ end of string anchor("a$") aaa Use parentheses to set precedent (order of evaluation) and create groups
that can be represented in multiple ways (slow).
regexp matches example
str_detect("\u0130", coll("i", TRUE, locale = "tr"))
(ab|d)e sets precedence alt("(ab|d)e") abcde
LOOK AROUNDS look <- function(rx) str_view_all("bacad", rx)
boundary() Matches boundaries between
characters, line_breaks, sentences, or words. regexp matches example Use an escaped number to refer to and duplicate parentheses groups that occur
str_split(sentences, boundary("word")) a(?=c) followed by look("a(?=c)") bacad earlier in a pattern. Refer to each group by its order of appearance
a(?!c) not followed by look("a(?!c)") bacad string regexp matches example
(?<=b)a preceded by look("(?<=b)a") bacad (type this) (to mean this) (which matches this) (the result is the same as ref("abba"))
(?<!b)a not preceded by look("(?<!b)a") bacad \\1 \1 (etc.) first () group, etc. ref("(a)(b)\\2\\1") abbaab
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at stringr.tidyverse.org • Diagrams from @LVaudor on Twitter • stringr 1.5.0 • Updated: 2023-05
ft
ft
ft
2. Use the function below whose name replicates the order. Each component in place. d ## "2017-11-01" Valid units are second, minute, hour, day, week, month, bimonth,
accepts a tz argument to set the time zone, e.g. ymd(x, tz = "UTC"). quarter, season, halfyear and year.
ymd_hms(), ymd_hm(), ymd_h(). 2018-01-31 11:59:59 date(x) Date component. date(dt) rollback(dates, roll_to_first = FALSE, preserve_hms = TRUE) Roll back to
2017-11-28T14:02:00 ymd_hms("2017-11-28T14:02:00") last day of previous month. Also rollforward(). rollback(dt)
year(x) Year. year(dt)
2017-22-12 10:00:00
ydm_hms(), ydm_hm(), ydm_h().
ydm_hms("2017-22-12 10:00:00")
2018-01-31 11:59:59 isoyear(x) The ISO 8601 year.
epiyear(x) Epidemiological year. Stamp Date-times
mdy_hms(), mdy_hm(), mdy_h(). stamp() Derive a template from an example string and return a new
11/28/2017 1:02:03 2018-01-31 11:59:59 month(x, label, abbr) Month. function that will apply the template to date-times. Also
mdy_hms("11/28/2017 1:02:03") month(dt) stamp_date() and stamp_time().
dmy_hms(), dmy_hm(), dmy_h(). day(x) Day of month. day(dt) 1. Derive a template, create a function
1 Jan 2017 23:59:59 dmy_hms("1 Jan 2017 23:59:59") Tip: use a
2018-01-31 11:59:59 wday(x, label, abbr) Day of week. sf <- stamp("Created Sunday, Jan 17, 1999 3:34") date with
ymd(), ydm(). ymd(20170131) qday(x) Day of quarter. day > 12
20170131 2. Apply the template to dates
sf(ymd("2010-04-05"))
mdy(), myd(). mdy("July 4th, 2000") 2018-01-31 11:59:59 hour(x) Hour. hour(dt) ## [1] "Created Monday, Apr 05, 2010 00:00"
July 4th, 2000
dmy(), dym(). dmy("4th of July '99") 2018-01-31 11:59:59 minute(x) Minutes. minute(dt)
4th of July '99
2001: Q3 yq() Q for quarter. yq("2001: Q3") 2018-01-31 11:59:59 second(x) Seconds. second(dt) Time Zones
R recognizes ~600 time zones. Each encodes the time zone, Daylight
07-2020 my(), ym(). my("07-2020") 2018-01-31 11:59:59 UTC tz(x) Time zone. tz(dt) Savings Time, and historical calendar variations for an area. R assigns
one time zone per vector.
2:01 hms::hms() Also lubridate::hms(), week(x) Week of the year. week(dt)
hm() and ms(), which return x
J F M A M J isoweek() ISO 8601 week. Use the UTC time zone to avoid Daylight Savings.
periods.* hms::hms(seconds = 0, J A S O N D epiweek() Epidemiological week.
minutes = 1, hours = 2) OlsonNames() Returns a list of valid time zone names. OlsonNames()
x
J F M A M J quarter(x) Quarter. quarter(dt) Sys.timezone() Gets current time zone.
J A S O N D
2017.5 date_decimal(decimal, tz = "UTC") 5:00 6:00
semester(x, with_year = FALSE)
date_decimal(2017.5)
x
J F M A M J Semester. semester(dt) 4:00 Mountain Central 7:00 with_tz(time, tzone = "") Get
the same date-time in a new
now(tzone = "") Current time in tz J A S O N D Pacific Eastern time zone (a new clock time).
(defaults to system tz). now() am(x) Is it in the am? am(dt) Also local_time(dt, tz, units).
pm(x) Is it in the pm? pm(dt) with_tz(dt, "US/Pacific")
today(tzone = "") Current date in a PT
MT
January
CT ET
xxxxx dst(x) Is it daylight savings? dst(d)
xxx tz (defaults to system tz). today()
force_tz(time, tzone = "") Get
fast_strptime() Faster strptime. leap_year(x) Is it a leap year? the same clock time in a new
leap_year(d) 7:00 7:00
fast_strptime(“9/1/01”, “%y/%m/%d”) Pacific Eastern time zone (a new date-time).
Also force_tzs().
parse_date_time() Easier strptime. update(object, ..., simple = FALSE) 7:00 7:00 force_tz(dt, "US/Pacific")
parse_date_time(“09-01-01”, "ymd") update(dt, mday = 2, hour = 1) Mountain Central
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at lubridate.tidyverse.org • lubridate 1.9.2 • Updated: 2023-05
ft
Math with Date-times — Lubridate provides three classes of timespans to facilitate math with dates and date-times.
Math with date-times relies on the timeline, Periods track changes in clock times, Durations track the passage of Intervals represent specific intervals Not all years
which behaves inconsistently. Consider how which ignore time line irregularities. physical time, which deviates from of the timeline, bounded by start and are 365 days
the timeline behaves during: clock time when irregularities occur. end date-times. due to leap days.
A normal day nor + minutes(90) nor + dminutes(90) interval(nor, nor + minutes(90)) Not all minutes
nor <- ymd_hms("2018-01-01 01:30:00",tz="US/Eastern") are 60 seconds due to
leap seconds.
fft
ff
ft
fft
fft
ft
ft
fft
ff
fft
ft
forwards
Open in new Save Find and
backwards/ window replace
Compile as Run
notebook selected
code
Import data History of past
with wizard commands to
run/copy
Manage
external
View
memory
databases usage
R tutorials
Control
and more in Source Pane Turn on at Tools > Project Options > Git/SVN
A• Added M• Modified
Check Render Choose Configure Insert D• Deleted R• Renamed
?• Untracked
Package Development
Click next to line number to Highlighted line shows where
RStudio opens plots in a dedicated Plots pane RStudio opens documentation in a dedicated Help pane add/remove a breakpoint. execution has paused
Create a new package with
File > New Project > New Directory > R Package
Navigate Open in Export Delete Delete
Enable roxygen documentation with recent plots window plot plot all plots Home page of Search within Search for
Tools > Project Options > Build Tools helpful links help file help file
Roxygen guide at Help > Roxygen Quick Reference
See package information in the Build Tab Viewer pane displays HTML content, such as Shiny
apps, RMarkdown reports, and interactive visualizations
GUI Package manager lists every installed package
Install package Run devtools::load_all()
and restart R and reload changes
Stop Shiny Publish to shinyapps.io, Refresh
Install Update Browse app rpubs, RSConnect, … Run commands in Examine variables Select function
Packages Packages package site environment where in executing in traceback to
Clear output execution has paused environment debug
Run R CMD and rebuild
check View(<data>) opens spreadsheet like view of data set
Customize Run Click to load package with Package Delete
package build package library(). Unclick to detach version from
options tests package with detach(). installed library
Filter rows by value Sort by Search Step through Step into and Resume Quit debug
or value range values for value code one line out of functions execution mode
at a time to run
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07
ff
ft
ff
Visual Editor
workspace, and working Start new R Session Close R Session
Choose Choose Insert Jump to Jump Run directory associated with a in current project in project
Check Render output output code previous to next selected Publish Show file project. It reloads each when
spelling output format location chunk chunk chunk lines to server outline you re-open a project.
T H J
Back to
Source Editor
Block (front page) Active shared
format collaborators
Name of
current
Lists and Links Citations Images File outline project
Insert blocks, Select
block
citations, Insert and Share Project R Version
quotes More
formatting equations, and edit tables with Collaborators
Clear special
formatting characters
Insert
verbatim
code
Run Remote Jobs
Run R on remote clusters
(Kubernetes/Slurm) via the
Job Launcher
Add/Edit
attributes Monitor Launch a job
launcher jobs
Run launcher
jobs remotely
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at rstudio.com • Font Awesome 5.15.3 • RStudio IDE 1.4.1717 • Updated: 2021-07
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ft
ff
Factors stored displayed Change the order of levels Change the value of levels
R represents categorical integer
1 1= a a 1= a
data with factors. A factor vector 3 23 == bc c 23 == bc a 1= a a 1= b fct_relevel(.f, ..., a er = 0L) a 1= a v 1= v
2= x
fct_recode(.f, ...) Manually change
is an integer vector with a 2 b c 2= b c 2= c Manually reorder factor levels. c 2= b
z levels. Also fct_relabel() which obeys
3= c 3= a fct_relevel(f, c("b", "c", "a")) 3= c 3= z purrr::map syntax to apply a function
levels attribute that stores levels 1 a b b b x
a set of mappings between or expression to each level.
a a a v fct_recode(f, v = "a", x = "b", z = "c")
integers and categorical values. When you view a factor, R fct_infreq(f, ordered = NA) Reorder
displays not the integers, but the levels associated with them. fct_relabel(f, ~ paste0("x", .x))
levels by the frequency
c 1= a c 1= c in which they appear in the
= c 2= c c 2= a data (highest frequency first).
a a 1= a Also fct_inseq(). a 1= a 2 1=2 fct_anon(f, prefix = "")
c c 2= b a a c 2= b 2=1 Anonymize levels with random
3= c
f3 <- factor(c("c", "c", "a")) 3= c 1 3=3
integers.
b b fct_infreq(f3) b 3 fct_anon(f)
a a a 2
b 1= a b 1= b fct_inorder(f, ordered = NA)
a 2= b a 2= a Reorder levels by order in which
they appear in the data. a 1= a x 1= x fct_collapse(.f, …, other_level = NULL)
a 1= a a
c 2= b c 2= c Collapse levels into manually defined
c 2= b b fct_inorder(f2) 3= c groups.
3= c c b x fct_collapse(f, x = c("a", "b"))
b a x
a a 1= a a 1= c fct_rev(f) Reverse level order.
2= b 2= b f4 <- factor(c("a","b","c"))
b 3= c b 3= a
c c fct_rev(f4) fct_lump_min(f, min, w = NULL,
Inspect Factors a
c
1= a
2= b
3= c
a
Other
1= a
2 = Other
other_level = "Other") Lumps together
factors that appear fewer than min
times. Also fct_lump_n(),
a 1= a f n fct_count(f, sort = FALSE, a 1= a a 1= c fct_shi (f) Shi levels to le or b Other
fct_lump_prop(), and
c 2= b
a 2 prop = FALSE) Count the 2= b 2= a right, wrapping around end. a a
3= c number of values with each b 3= c b 3= b
fct_lump_lowfreq().
b b 1 c c fct_shi (f4) fct_lump_min(f, min = 2)
level. fct_count(f)
a c 1
fct_match(f, lvls) Check for
lvls in f. fct_match(f, "a") a 1= a a 1= a fct_shu le(f, n = 1L) Randomly a 1= a a 1= a fct_other(f, keep, drop, other_level =
2= b 2= c permute order of factor levels. 2= b 2= b "Other") Replace levels with "other."
a 1= a a 1= a fct_unique(f) Return the b 3= c b 3= b
c 3= c
Other
3 = Other
c c fct_shu le(f4) fct_other(f, keep = c("a", "b"))
b 2= b
b 2= b unique values, removing b b
a duplicates. fct_unique(f) a a
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at forcats.tidyverse.org • Diagrams inspired by @LVaudor on Twitter • forcats 1.0.0 • Updated: 2023-05
ft
ff
ff
ft
ff
ft
ft
ft
ft
Share Outputs render*() and *Output() functions work together to add R output to the UI. numericInput(inputId, label, value,
min, max, step, width)
Share your app in three ways: DT::renderDataTable(expr, options, dataTableOutput(outputId)
searchDelay, callback, escape, env, quoted, passwordInput(inputId, label, value,
1. Host it on shinyapps.io, a cloud based outputArgs) width, placeholder)
service from RStudio. To deploy Shiny apps:
renderImage(expr, env, quoted, deleteFile, imageOutput(outputId, width, height, radioButtons(inputId, label,
Create a free or professional outputArgs) click, dblclick, hover, brush, inline) choices, selected, inline, width,
account at shinyapps.io choiceNames, choiceValues)
renderPlot(expr, width, height, res, …, alt, env, plotOutput(outputId, width, height, click,
Click the Publish icon in RStudio IDE, or run: quoted, execOnResize, outputArgs) dblclick, hover, brush, inline) selectInput(inputId, label, choices,
rsconnect::deployApp("<path to directory>") selected, multiple, selectize, width, size)
Also selectizeInput()
renderPrint(expr, env, quoted, width, verbatimTextOutput(outputId,
2. Purchase RStudio Connect, a outputArgs) placeholder)
publishing platform for R and Python. sliderInput(inputId, label, min, max,
value, step, round, format, locale, ticks,
rstudio.com/products/connect/ renderTable(expr, striped, hover, bordered, tableOutput(outputId) animate, width, sep, pre, post,
spacing, width, align, rownames, colnames, timeFormat, timezone, dragRange)
digits, na, …, env, quoted, outputArgs)
3. Build your own Shiny Server
rstudio.com/products/shiny/shiny-server/ renderText(expr, env, quoted, outputArgs, sep) textOutput(outputId, container, inline) submitButton(text, icon, width)
(Prevent reactions for entire app)
renderUI(expr, env, quoted, outputArgs) uiOutput(outputId, inline, container, …)
htmlOutput(outputId, inline, container, …) textInput(inputId, label, value, width,
placeholder) Also textAreaInput()
These are the core output types. See htmlwidgets.org for many more options.
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at shiny.rstudio.com • shiny 1.7.4 • Updated: 2023-05
f
f
fff
ft
Reactivity UI - An app’s UI is an HTML document. Layouts
Reactive values work together with reactive functions. Call a reactive value from within the arguments of one Use Shiny’s functions to assemble this HTML with R. Combine multiple elements
of these functions to avoid the error Operation not allowed without an active reactive context. fluidPage( into a "single element" that
textInput("a","") Returns has its own properties with a
) HTML panel function, e.g.
## <div class="container-fluid"> wellPanel(
## <div class="form-group shiny-input-container"> dateInput("a", ""),
## <label for="a"></label> submitButton()
## <input id="a" type="text" )
## class="form-control" value=""/>
## </div> absolutePanel() navlistPanel()
## </div> conditionalPanel() sidebarPanel()
fixedPanel() tabPanel()
headerPanel() tabsetPanel()
Add static HTML elements with tags, a list inputPanel() titlePanel()
of functions that parallel common HTML mainPanel() wellPanel()
tags, e.g. tags$a(). Unnamed arguments
will be passed into the tag; named Organize panels and elements into a layout with a
arguments will become tag attributes. layout function. Add elements as arguments of the
layout functions.
Run names(tags) for a complete list. sidebarLayout()
tags$h1("Header") -> <h1>Header</h1> ui <- fluidPage(
sidebarLayout(
The most common tags have wrapper functions. You side main sidebarPanel(),
do not need to prefix their names with tags$ panel
panel mainPanel()
CREATE YOUR OWN REACTIVE VALUES RENDER REACTIVE OUTPUT )
*Input() functions library(shiny) render*() functions ui <- fluidPage( )
# *Input() example
(see front page) ui <- fluidPage(
(see front page) h1("Header 1"),
fluidRow()
ui <- fluidPage( hr(),
textInput("a","","A"),
textInput("a","","A") Each input function creates textOutput("b") br(), ui <- fluidPage(
)
a reactive value stored as ) Builds an object to p(strong("bold")), row
column col fluidRow(column(width = 4),
input$<inputId>. display. Will rerun code in p(em("italic")), column(width = 2, o set = 3)),
server <- body to rebuild the object p(code("code")), fluidRow(column(width = 12))
#reactiveValues example
function(input,output){
whenever a reactive value a(href="", "link"), column
reactiveValues(…) output$b <- )
server <- renderText({ in the code changes. HTML("<p>Raw html</p>")
input$a )
function(input,output){ Creates a list of reactive }) Also flowLayout(), splitLayout(), verticalLayout(),
rv <- reactiveValues() Save the results to
rv$number <- 5 values whose values you } fixedPage(), and fixedRow().
} can set. output$<outputId>.
shinyApp(ui, server) To include a CSS file, use includeCSS(), or
1. Place the file in the www subdirectory Layer tabPanels on top of each other,
CREATE REACTIVE EXPRESSIONS PERFORM SIDE EFFECTS and navigate between them, with:
2. Link to it with:
library(shiny) reactive(x, env, quoted, library(shiny)
observeEvent(eventExpr, ui <- fluidPage( tabsetPanel(
label, domain) handlerExpr, event.env, tags$head(tags$link(rel = "stylesheet", tabPanel("tab 1", "contents"),
ui <- fluidPage( ui <- fluidPage(
textInput("a","","A"), event.quoted, handler.env, type = "text/css", href = "<file name>")) tabPanel("tab 2", "contents"),
textInput("a","","A"),
textInput("z","","Z"), Reactive expressions: actionButton("go","Go") handler.quoted, ..., label, tabPanel("tab 3", "contents")))
textOutput("b"))
• cache their value to ) suspended, priority, domain, To include JavaScript, use includeScript() or
server <-
function(input,output){
reduce computation server <- autoDestroy, ignoreNULL, 1. Place the file in the www subdirectory ui <- fluidPage( navlistPanel(
re <- reactive({ • can be called elsewhere function(input,output){ ignoreInit, once) 2. Link to it with:
tabPanel("tab 1", "contents"),
• notify dependencies observeEvent(input$go,{ tabPanel("tab 2", "contents"),
paste(input$a,input$z)}) print(input$a) Runs code in 2nd tabPanel("tab 3", "contents")))
output$b <- renderText({
re()
when invalidated }) argument when reactive tags$head(tags$script(src = "<file name>"))
Call the expression with }
}) values in 1st argument ui <- navbarPage(title = "Page",
}
function syntax, e.g. re(). change. See observe() for IMAGES To include an image:
shinyApp(ui, server) tabPanel("tab 1", "contents"),
shinyApp(ui, server)
alternative. 1. Place the file in the www subdirectory tabPanel("tab 2", "contents"),
2. Link to it with img(src="<file name>") tabPanel("tab 3", "contents"))
REACT BASED ON EVENT REMOVE REACTIVITY
eventReactive(eventExpr, library(shiny) isolate(expr)
Themes
library(shiny)
ui <- fluidPage( valueExpr, event.env, ui <- fluidPage( Runs a code block.
textInput("a","","A"), event.quoted, value.env, textInput("a","","A"),
actionButton("go","Go"),
value.quoted, ..., label, textOutput("b")
Returns a non-reactive Use the bslib package to add existing themes to Build your own theme by customizing individual
textOutput("b")
) domain, ignoreNULL, ) copy of the results. your Shiny app ui, or make your own. arguments.
server <- ignoreInit) server <- bs_theme(bg = "#558AC5",
function(input,output){ function(input,output){ library(bslib)
re <- eventReactive( Creates reactive output$b <- fg = "#F9B02D",
ui <- fluidPage(
input$go,{input$a})
output$b <- renderText({
expression with code in renderText({
theme = bs_theme( ...)
isolate({input$a})
re() 2nd argument that only }) bootswatch = "darkly",
})
} invalidates when reactive } ... ?bs_theme for a full list
values in 1st argument shinyApp(ui, server) ) of arguments.
shinyApp(ui, server)
change. )
bs_themer() Place within the server function to
bootswatch_themes() Get a list of themes. use the interactive theming widget.
CC BY SA Posit So ware, PBC • [email protected] • posit.co • Learn more at shiny.rstudio.com • shiny 1.7.4 • Updated: 2023-05
ft
ff
Informative and Elegant The function cox.zph() from survival package may be used to test the
proportional hazards assumption for a Cox regression model fit. The
The function ggforest() from the survminer package creates a forest plot
for a Cox regression model fit. Hazard ratio estimates along with confiden-
with survminer graphical verification of this assumption may be performed with the
function ggcoxzph() from the survminer package. For each covariate it
ce intervals and p-values are plotter for each variable.
library("survival")
produces plots with scaled Schoenfeld residuals against the time. library("survminer")
lung$age <- ifelse(lung$age > 70, ">70","<= 70")
w
www
ww
library("survival")
Survival Curves fit <- coxph(Surv(time,
ftest <- cox.zph(fit)
status) ~ sex + age, data = lung)
fit <- coxph( Surv(time, status) ~ sex + ph.ecog + age, data = lung)
fit
ftest . ## Call:
The ggsurvplot() function creates ggplot2 plots from survfit objects. ## rho chisq p ## coxph(formula = Surv(time, status) ~ sex+ph.ecog+age, data=lung)
## sex 0.1236 2.452 0.117 ##
Strata + sex=1 + sex=2 ## age -0.0275 0.129 0.719
w
www
library("survival")
ww
## coef exp(coef) se(coef) z p
fit <- survfit(Surv(time,status) ## GLOBAL NA 2.651 0.266 ## sex -0.567 0.567 0.168 -3.37 0.00075
~ sex, data = lung) 1.00 library("survminer") ## ph.ecog 0.470 1.600 0.113 4.16 3.1e-05
++ ggcoxzph(ftest) ## age>70 0.307 1.359 0.187 1.64 0.10175
w
ww
++
class(fit) ++++ ##
Survival probability
## [1] "survfit"
0.75 +
+ ++++
Global Schoenfeld Test p: 0.2656 ## Likelihood ratio test=31.6 on
+
++ ++ Schoenfeld Individual Test p: 0.1174 Schoenfeld Individual Test p: 0.7192 ## n= 227, number of events= 164
+
+++ ++ sex 0.57 <0.001 ***
library("survminer")
(N=228) (0.41 − 0.79)
w
www
w
+++ ++
0.50 + 0.3
3
ggsurvplot(fit, data = lung) +
● ●
+ ggforest(fit)
● ● ●
● ●
●● ● ● ●
●● ● ●
+++++
● ● ● ●●
● ● ●
● ● ● ●
0.2
● ● ●
●● ● ● ● ●
●
●
●● ● ● ●
● ● ● ● ● ●
● ● ●● ● ●
+++
● ● ● ● ● ●
2
● ● ●
● ● ●
+
● ● ● ●●
0.25
●●
● ● ●
● ●
●● ● ● ● ● ●
w
ww
●
0.1
● ● ● ● ● ● ●
● ●
+
● ● ●
●
1
● ● ●
● ●
ph.ecog 1.60 <0.001 ***
+++ +
● ● ● ● ●
●● ● ● ●
(N=228) (1.28 − 2.00)
● ●
● ● ●
++
● ● ● ●
● ●
0.0
● ●
●
● ● ● ● ●
0.00
● ● ●
● ● ●● ● ● ●● ●●
● ●
0
● ● ● ● ●
● ● ● ●
●
−0.1
● ● ●
● ●● ● ● ●
●
● ● ●
Time
●
● ● ● ● ●
● ● ● ●
−1
● ●
●
●
●● ●
−0.2
● ● ● ●
●● ●
Use the fun argument to set the transformation of the survival curve.
● ●●●● ● ●
●● ● ●
(N=228) (0.94 − 1.96)
−2
● ● ●● ● ● ● ● ●
● ● ●● ●●● ● ● ●● ● ● ●
● ●
● ●● ●● ●● ●
● ● ●● ●● ● ● ● ●●●
● ● ●
● ●● ● ● ● ● ● ● ● ● ● ●●●● ● ●
● ●
−0.3
● ● ●● ● ● ● ●● ●●●
● ● ●● ●
● ●
● ●
● ● ● ● ●
● ●
E.g. "event" for cumulative events, "cumhaz" for the cumulative 55 130 180 270 340 430 560 730 55 130 180 270 340 430 560 730
n.events: 164, p.value.log: 6.4e−07 0.4 0.6 0.8 1 1.21.41.61.82
Time Time
hazard function or "pct" for survival probability in percentage. AIC: 1463.37, concordance: 0.64 Hazard ratio
ggsurvplot(fit, data = lung, fun = "event") The function ggadjustedcurves() from the survminer package plots
ggsurvplot(fit, data = lung, fun = "cumhaz")
The function ggcoxdiagnostics() plots different types of residuals as a
Adjusted Survival Curves for Cox Proportional Hazards Model. Adjusted
w
wwww
function of time, linear predictor or observation id. The type of residual is
Survival Curves show how a selected factor influences survival estimated
+ + Strata + sex=1 + sex=2 selected with type argument. Possible values are "martingale",. "devian-
Strata sex=1 sex=2
from a Cox model.
ce", "score", "schoenfeld", "dfbeta"', "dfbetas", and "scaledsch".
1.00
++ ++ Note that these curves differ from Kaplan Meier estimates since they
+++ + The ox.scale argument defines what shall be plotted on the OX axis.
+ 3 + present expected ssurvival based on given Cox model.
+ + Possible values are "linear.predictions", "observation.id", "time".
0.75
+++ + + library("survival")
Cumulative hazard
Logical arguments hline and sline may be used to add horizontal line or variable Male Female
Cumulative event
+++++ +
library("survminer")
+ + 2 smooth line to the plot.
+++ 1.00
0.50 ++ ++ residuals
lung$sex <- ifelse(lung$sex == 1,
++ ++ + library("survival")
++ ++ +++ "Male", "Female")
●
Survival rate
++
++++++++++++ sex + age, data = lung)
● ●
strata(sex),
● ●
●
0.00 0
Group Cases
●
data = lung)
● ●
ggcoxadjustedcurves(fit, data=lung)
ox.scale = "linear.predictions")
● ●● ●
●
100 ●
++ −1 ●
0.00
ters you have full control over
●
pval = TRUE,
●
p = 0.0013 ++
●
Residuals (type = schoenfeld)
0 10
fun = "pct", ggcoxadjustedcurves(fit, data=lung, 0.75
● ●
●
Survival rate
● ●
Time 0
size = 1,
● ●
●
● ●
●
0.50
●
0
linetype = "strata",
●
● ●
+ + −0.5
●
palette = c("#E7B800",
● ● ●
●
● ●
●
Female 90 53 21 3 0
legend.labs = c("Male",
●
−1.5 0.00
0 250 500 750 1000
"Female")) 200 400 600 200 400 600 200 400 600 0 250 500 750 1000
Time Time time
This onepager presents the survminer package [Alboukadel Kassambara, Marcin Kosinski 2017] in version 0.4.0.999 CC BY Przemysław Biecek http://github.com/pbiecek
See https://github.com/kassambara/survminer/ for more details. https://creativecommons.org/licenses/by/4.0/
rmarkdown : : CHEAT SHEET SOURCE EDITOR
RENDERED OUTPUT file path to output document
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at rmarkdown.rstudio.com • rmarkdown 2.9.4 • Updated: 2021-08
ft
ft
ft
ft
ft
ft
MS Word
MS PPT
HTML
PDF
Use the document's YAML header to set an output IMPORTANT OPTIONS DESCRIPTION When you render a
format and customize it with output options. anchor_sections Show section anchors on mouse hover (TRUE or FALSE) X document, rmarkdown:
--- citation_package The LaTeX package to process citations ("default", "natbib", "biblatex") X 1. Runs the code and embeds
title: "My Document" results and text into an .md
author: "Author Name" code_download Give readers an option to download the .Rmd source code (TRUE or FALSE) X
file with knitr.
output: code_folding Let readers to toggle the display of R code ("none", "hide", or "show") X
html_document: Indent format 2 characters, 2. Converts the .md file into the output format with
toc: TRUE css CSS or SCSS file to use to style document (e.g. "style.css") X Pandoc.
indent options 4 characters
--- dev Graphics device to use for figure output (e.g. "png", "pdf") X X HTML
knitr pandoc
.Rmd .md PDF
df_print Method for printing data frames ("default", "kable", "tibble", "paged") X X X X DOC
OUTPUT FORMAT CREATES
html_document .html fig_caption Should figures be rendered with captions (TRUE or FALSE) X X X X
Save, then Knit to preview the document output.
pdf_document* .pdf highlight Syntax highlighting ("tango", "pygments", "kate", "zenburn", "textmate") X X X The resulting HTML/PDF/MS Word/etc. document will
word_document Microso Word (.docx) includes File of content to place in doc ("in_header", "before_body", "a er_body") X X be created and saved in the same directory as
powerpoint_presentation Microso Powerpoint (.pptx)
the .Rmd file.
keep_md Keep the Markdown .md file generated by knitting (TRUE or FALSE) X X X X
odt_document OpenDocument Text Use rmarkdown::render() to render/knit in the R
keep_tex Keep the intermediate TEX file used to convert to PDF (TRUE or FALSE) X
rtf_document Rich Text Format console. See ?render for available options.
latex_engine LaTeX engine for producing PDF output ("pdflatex", "xelatex", or "lualatex") X
md_document Markdown
github_document
ioslides_presentation
Markdown for Github
ioslides HTML slides
reference_docx/_doc
theme
docx/pptx file containing styles to copy in the output (e.g. "file.docx", "file.pptx")
Theme options (see Bootswatch and Custom Themes below) X
X X
Share
Publish on RStudio Connect
slidy_presentation Slidy HTML slides toc Add a table of contents at start of document (TRUE or FALSE) X X X X to share R Markdown documents
beamer_presentation* Beamer slides toc_depth The lowest level of headings to add to table of contents (e.g. 2, 3) X X X X securely, schedule automatic
* Requires LaTeX, use tinytex::install_tinytex()
toc_float Float the table of contents to the le of the main document content (TRUE or FALSE) X updates, and interact with parameters in real time.
Also see flexdashboard, bookdown, distill, and blogdown.
Use ?<output format> to see all of a format's options, e.g. ?html_document rstudio.com/products/connect/
RStudio® is a trademark of RStudio, PBC • CC BY SA RStudio • [email protected] • 844-448-1212 • rstudio.com • Learn more at rmarkdown.rstudio.com • rmarkdown 2.9.4 • Updated: 2021-08
ft
ft
ft
ft
Data Visualization in R: ggvis Cheat Sheet
by shanly3011 via cheatography.com/20988/cs/3865/
The 4 essential components are The properties for points are - It returns a It returns a Similarly, we have compute_c‐
- fill,x, y, stroke, dataset with 2 dataset ount() or the in- built function
1. Data strokeWidth, stroke‐ variables, one with 4 layer_bars()
2. Coordinate system Opacity, fill, opacity, named pred_ and variables,
Syntax: compute_smooth()
3. Marks fillOpacity, shape, the other resp_ x, x2 ,y
Long way: faithful %>%
4. Properties size ,y2.
compute_density(~w‐
Syntax Example- Sample code: Syntax: compute_smooth() aiting) %>% ggvis(‐
faithful %>% ` Long way: mtcars %>% ~pred_, ~resp_) %>%
ggvis(~waiting,~er‐ faithful %>% ` compute_smooth(mpg ~ layer_lines()
uptions) %>% layer_‐ ggvis(~waiting, wt) %>% ggvis(~pr‐
points() %>% ~eruptions, fillOp‐ ed_,~resp_) %>% In-built: faithful %>%
add_axis("x", title = acity = ~eruptions, layer_lines() ggvis(~waiting, fill
"Waiting period", size := 100, fill := "‐ := "green") %>%
values = c(1,2,‐ red", stroke := "‐ In-built: mtcars %>% layer_densities()
3,4,5,6,7), subdivide red", shape := "cro‐ ggvis(~wt,~mpg) %>%
= 9) ss") %>% layer_smooths()
We use the piping operator layer_points()
'%>%' for our syntaxes. Syntax: compute_bin()
Properties for lines Long way: faithful %>%
Mapping Vs Setting propeties The properties for lines include - compute_bin(~wa‐
cheatography.com/shanly3011/
R color cheatsheet R Color Palettes
This is for all of you who don’t know anything
Finding a good color scheme for presenting data about color theory, and don’t care but want
can be challenging. This color cheatsheet will help! some nice colors on your map or figure….NOW!
R uses hexadecimal to represent colors TIP: When it comes to selecting a color palette,
Hexadecimal is a base-16 number system used to describe DO NOT try to handpick individual colors! You will
color. Red, green, and blue are each represented by two
characters (#rrggbb). Each character has 16 possible waste a lot of time and the result will probably not
symbols: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F: be all that great. R has some good packages for
color palettes. Here are some of the options
“00” can be interpreted as 0.0 and “FF” as 1.0
Packages: grDevices and grDevices
i.e., red= #FF0000 , black=#000000, white = #FFFFFF
colorRamps palettes
Two additional characters (with the same scale) can be grDevices comes with the base cm.colors
topo.colors
added to the end to describe transparency (#rrggbbaa) installation and colorRamps
terrain.colors
R has 657 built in color names Example: must be installed. Each palette’s heat.colors
To see a list of names: function has an argument for rainbow
colors() peachpuff4 the number of colors and see P. 4 for
These colors are displayed on P. 3. transparency (alpha): options
R translates various color models to hex, e.g.: heat.colors(4, alpha=1)
• RGB (red, green, blue): The default intensity scale in R > #FF0000FF" "#FF8000FF" "#FFFF00FF" "#FFFF80FF“
ranges from 0-1; but another commonly used scale is 0-
255. This is obtained in R using maxColorValue=255. For the rainbow palette you can also select start/end color
alpha is an optional argument for transparency, with the (red = 0, yellow = 1/6, green = 2/6, cyan = 3/6, blue
same intensity scale. = 4/6 and magenta = 5/6) and saturation (s) and value (v):
rgb(r, g, b, maxColorValue=255, alpha=255) rainbow(n, s = 1, v = 1, start = 0, end = max(1, n - 1)/n, alpha = 1)
• HSV (hue, saturation, value): values range from 0-1, with
optional alpha argument Package: RcolorBrewer
hsv(h, s, v, alpha)
• HCL (hue, chroma, luminance): hue describes the color and This function has an argument for the number of
ranges from 0-360; 0 = red, 120 = green, blue = 240, etc. colors and the color palette (see P. 4 for options).
Range of chroma and luminance depend on hue and each brewer.pal(4, “Set3”)
other > "#8DD3C7" "#FFFFB3" "#BEBADA" "#FB8072“
hcl(h, c, l, alpha)
To view colorbrewer palettes in R: display.brewer.all(5)
A few notes on HSV/HLC There is also a very nice interactive viewer:
HSV is a better model for how humans perceive color. http://colorbrewer2.org/
HCL can be thought of as a perceptually based version of
the HSV model….blah blah blah…
## My Recommendation ##
Without delving into color theory: color schemes based Package: colorspace
on HSV/HLC models generally just look good. These color palettes are based colorspace
default palettes
on HCL and HSV color models. diverge_hcl
The results can be very diverge_hsl
aesthetically pleasing. There terrain_hcl
are some default palettes: sequential_hcl
rainbow_hcl
rainbow_hcl(4)
"#E495A5" "#ABB065" "#39BEB1" "#ACA4E2“
HCL
hue
Option 2
If you want to control which colors are
associated with the levels of a variable, I find it
Select # of colors in palette easiest to create a variable in the data:
iris$color <- factor(iris$Species,
Save palette for future R sessions: levels=c("virginica", "versicolor", "setosa"),
• txt file with hex codes labels=rainbow_hcl(3))
• .R file with a function describing
how to generate the palette. plot(Sepal.Length ~ Sepal.Width,
source can be used to import the col=as.character(color), pch=16, data=iris)
function into R; but one
complication is that you have to Continuous variables
open the .R file and name the
function to use it. Option 1
• Copy values into relevant Break into categories and assign colors:
colorspace functions. iris2 <- subset(iris, Species=="setosa")
Diverging color schemes:
diverge_hcl(7, h = c(260, 0), c = color <- cut(iris2$Petal.Length,
100, l = c(28, 90), power = 1.5) breaks=c(0,1.3,1.5,2), labels=sequential_hcl(3))
Sequential color schemes:
sequential_hcl(n, h, c.= c(), l=c(), Or, break by quantiles (be sure to include 0 & 1):
power) color <- cut(iris2$Petal.Length,
Qualitative color schemes: breaks=quantile(iris$Petal.Length, c(0, 0.25, 0.5,
rainbow_hcl(n, c, l, start, end) (for
qualtitative schemes; start/ end 0.75, 1)), labels=sequential_hcl(3))
refer to the H1/H2 hue values) plot(Sepal.Width ~ Sepal.Length, pch=16,
col=color, data=iris2)
Display color scheme with
different plot types Option 2
Fully continuous gradient:
data <- data.frame("a"=runif(10000),
"b"=runif(10000))
color=diverge_hcl(length(data$a))[rank(data$a)]
Page 2, Melanie Frazier
Useful Resources:
A larger color chart of R named colors:
http://research.stowers-
institute.org/efg/R/Color/Chart/ColorChart.pdf
http://students.washignton.edu/mclarkso/docu
ments/colors Ver2.pdf