Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

Introduction to R

Kevyn Stefanelli

Financial Econometrics with R 2021-2022

General lines
This is a useful guide to use the software R for the “Financial Econometrics with R” course.
The reader can find the R commands in the grey areas of the file, while outside them they can find the
relative explanation and useful notes. You can also find separately a R script containing only the code with
comments. You can copy-pasty the commands/code from the R-script to the R-console, or type them directly
there. Copy-pasting is fast, typing it he best way to learn.
Important: Copy-pasting from Word or pdf documents produces errors in R, sometimes. Pay attention
when you copy "", ”, -, and the other special characters. You can copy-pasting in R and just replace these
characters.
The following guide is divided in paragraphs and each one of them contains a specific topic and the relative
exercises. These lasts present different level of difficulties: (∗) “easy”, (∗∗) “intermediate”, (∗ ∗ ∗) “advance”.
In order to pass the exam you must be able to complete all the (∗) “easy” exercises. The (∗∗) “intermediate”
and the (∗ ∗ ∗) “advance” exercises will enhance your final grade progressively.
You must bring your personal computer in class to do live exercises.
The exam is composed by two different tests:
1. Short Multiple Choice exam (30 mins): you will have 10 questions to be completed in 30mins.
Your questions are chosen randomly from a set of questions of the same difficulty. There will be a time
window open for 24 hours during which you can start and complete the test. This test accounts for
40% of the final grade for the Introduction to R course. You will be asked to upload the script you
used to make calculations.
2. Long Multiple Choice exam (4 hours): you will be provided with a dataset and a list of questions
to be completed. Your questions are chosen randomly from a set of questions of the same difficulty.
Again, you will be asked to provide the entire R script used to answer these questions. This test
accounts for 60% of the final grade of the Introduction to R evaluation.
Further details on exam dates and procedures will be provided soon. In both tests you can use all the material
provided, but you must complete them on your own.

1
How to download R
You can download R from the Comprehensive R Archive Network (CRAN) following this link:
https://cran.r-project.org/
You can find video tutorial on the Blackborad page of the course to download the software R. Alternativerly,
here you can follow the instruction to download R for your specific operation system (Windows, macOS, or
Linux):

and then:

macOS

The download will starts automatically.

Windows

:
The download will starts automatically.

Linux

choose your option and follow the instructions.

2
3
Introduction to R
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It
is a open source software, that means that is totally free to use and realized/updated/optimized by the R
community. Once we run the software R we will see a new window with the simbol “>”, which means that R
is ready to get instructions.

Basic commands
R is a calculator:
# R basic operations
6+4
3-2
2*4
9/2
1.2^2

However, it is better to work on a R script instead on the R command window. A script is a text file where
you can save your code and then analyze/re-execute/modify it later. To open a new script on R
• Windows: File –> New Script
• Mac: File –> New Rd Document/New Document
Once you write on a R script a (some) lines of code, to execute it (them) on the command window: 1. Select
the line(s) 2. Press
• on W indows: CTRL + R
• on M ac: cmd + Enter
You can also add lines that will not be compiled (executed) by R. You need just to add “#” at the beginning
of the line(s). These lines are usually called “comments”.
ADVICE FOR YOUR CAREER (and for any other course you will do): always write as many comments as
possible into your code. They will be extremely useful when you will use the code again in the future and
you will have forgotten why/how you wrote your code!
When you finish working with R you can save your script on your computer to open it the next time.

R Objects
R is an object oriented language, which means that we can store “things”, values, in objects. For example, we
can assign to the letter “x” the value 5 and, from now on, when we say “x” R responds 5. All these objects
(or variables) are contained in a space called Global Environment.
In a variable we can store different types of values. Among others we enumerate:
1. numeric (integer or double): 1,2,3,4,0.5,-2.3;
2. character (string of): "Kevyn", "Rome", "Nice";
3. boolean: TRUE (or just T), FALSE (or just F);

4
How do we define (assign a value to) a variable?
# assign to x the value 2 (to comment the script start a line with #)
x = 2
# assign to y the value -1.4
y <- -1.4
# alternatively
4 -> z
# print (visualize) them
x
y
z
# define a as a character variable
a='Hey! How are you?'
a

There are some rules in defining variables:


1. We cannot assign a name to a variable that starts with a digit (z1 = 4 ok!, 1z=4 no!!!).
2. We cannot use symbols (!"£\$%&/()=’?ˆ*§°#@)
3. R is case sensitive (“a” is different from “A”).

Let’s see what our environment contains now:


# display the names of the objects which are currently stored within R
ls()
# create another variable and store it
italy = 0.1 ; france = 3.9 ; alps = italy*france # ";" means return
# repeat ls() and see the update
ls()

And we can also remove some (all) of them. To remove objects the function rm is available:
# remove x
rm("x")
# remove everything
rm(list = ls())

We can also update the value of an object:


# define a variable called K
K = "What time is it?"
K
# now we update K and define it as a numeric variable
K = 5
K

5
Vectors in R
We can define a vector of values and store it into a variable. A vector is just a collection of values stored
in the same object. Vectors in R are introduced by “c()” and their elements are separated by a comma as
follows:
# define a vector containing five numbers
myfirstvector = c(0,2.4,1.1,-1.3,-0.6)
myfirstvector
# define a vector containing three characters
ID = c("Kevyn", "Stefanelli","Rome")
ID

Note: If a vector contains at least one character element, then all the vector assumes the character type.

Furthermore, we can define vectors as a sequence of integer consecutive values as:


# define a vector containing number 1 to 10
onetoten = c(1:10)
onetoten
# or just
onetoten = 1:10 # there is no need to write c()
# for reverted sequences:
tentoone = 10:1
tentoone

To define the step of a sequence (the distance between each couple of values of a sequence) we can use the
function seq() that works as follows:
seq(): It requires three inputs:
1. from: the starting point.
2. to: the ending point.
3. (optional) “by” to determinate the step or “length.out” to determine how many (equispaced) points we
want between the starting and the ending point.
Note: for the first two inputs there is no need to express “from” and “to” (R already knows it), while for the
third element (or parameter) is required to specify “by” or “length.out”.

# define a vector containing the sequence from 1 to 2 by 0.25


onetotwo = seq(1,2, by=0.25)
onetotwo
# define a vector containing a sequence from 1 to 2 of length 10
onetotwo2 = seq(1,2, length.out = 10)
onetotwo2

6
We can define a vector by replicate a value/string n times using the function rep():

# repeat the number "1" 10 times


rep1 = rep(1,10)
rep1
# repeat the word "Hello" 5 times
rep2 = rep("Hello",5)
rep2

We can fill a vector with random numbers using several functions including runif(). This function creates a
vector of n random numbers coming from a Uniform distribution and comprised in the interval [a, b], with
a ≤ b. The Uniform distribution ensures that each number in this interval has the same probability to be
selected. The syntax of the function is the following:

runif (n, min, max)

Example: define a random vector of 10 elements between 0 and 100 from a Uniform distribution.

# define a random vector of 10 elements between 0 and 100 from a Uniform distribution
randv = runif(10,min=0,100)

Note: the default values of runif () for min and max are 0 and 1. Then, if we specify only n this function
returns a vector of values uniformely chosen between 0 and 1.

# define a random vector of 100 elements between 0 and 1 from a Uniform distribution
randv = runif(100)

IMPORTANT: runif, as any other simulating function, gives new values everytime you execute it, because
it provedes random numbers. To avoid this issue and obtain the same value of the previous code execution,
you can “block” the random numbers generator seed with the function set.seed(), that requires as input
a string of values (e.g., set.seed(1234)).
For example:

# first randv:
randv = runif(10)
randv
# second randv:
randv = runif(10)
randv

7
Conversely, using set.seed():

# fix the random generator seed:


set.seed(1234)
randv = runif(10)
randv
# again
set.seed(1234)
randv = runif(10)
randv

Similarly to the function runif (), we enumerate the function rnorm():

rnorm(n, mean, sd)

which
√ returns n random values from a Normal distribution with mean = mean and standard deviation
( variance) equal to sd.
For example:

# define v_randn as a vector of length 10 containing draws from a Normal distribution


# with parameters mean = 10 and sd = 4
v_randn = rnorm(10,mean=10,sd=4)
v_randn

Note: always remember to fix the random numbers generator seed.

We can round our results to a fixed decimal (e.g., first, second, third,. . . ) or define (a vector of) integer
numbers (rounding the decimal number at the units) using the function round() as follows:

round(x, k)

which rounds the object x (a number, a vector, . . . ) at the k decimal. If we do not specify k it rounds x at
the closest integer (k = 0 by default).

8
R built-it functions
R contains a large amount of native (built-in) functions. They also are objects which contain one (or more)
actions. In other words, we can use them to make calculations, plots, or defining particular objects. They are
characterized by a name followed by round brackets “()”. In the brackets we insert the inputs (objects to be
processed) and the function returns an output (the results of the process).
We have already seen a couple of them:
1. c(): create a vector. Inputs are the vector elements.
2. seq(): generate regular sequences.
3. rep(): replicate values.
4. . . .

Once you have installed R, you have a lot of built-in functions ready to use. For example:
• min(x): returns the minima of the input values x.
• max(x): returns the maxima of the input values x.
• mean(x): returns the arithmetic mean of the input values x.
• median(x): returns the median of the input values x.
• sd(x): returns the standard deviation of the input values x.
• var(x): compute the variance (covariance) of the input vector (matrix/vectors) x.
• abs(x): compute the absolute value of the input values x.
• exp(x): compute the exponential value of the input values x (ex ).
• log(x): compute the natural logarithm of the input values x.
• sqrt(x): compute the square root of the input values x.
• sort(x): sort the vector x by ascending/descending order or alphabetic order (for characters), default:
ascending.

R has an online manual which contains all the information about these and the other functions. To consult
the manual we can type:
• ?f unction_name: returns the specific page of the function with Description, Arguments, Details, See
Also, and Examples sections.
• ??word: returns all the manual pages where this "word" is located.

Let’s see an example on R:


# consult the help of the function "median"
?median
# search in the manual the word "median"
??median

9
Now, define a vector of 10 observations and try these functions:
# define a vector
vector10 = c(0,-2,-1,-5,3,1,-8,2,9,4)

# try these functions


min(vector10)
max(vector10)
mean(vector10)
median(vector10)
sd(vector10)
# sort by ascending order
sort(vector10)
# sort by descending order
sort(vector10, decreasing = T)

R Packages
R is (again) an open source software. Then everyone can write their own functions (including us, as we will
see later) and then publish them in the R repository (the online large warehouse of functions from where
we download R). These functions are collected in packages (like warehouse boxes) which comprises a list of
functions often referred to a particolar topic (e.g. graphics, statitics, . . . ).
Every functions we have used until now is contained in one of the pre-installed packages (e.g. base, util, . . . ),
but when we do something more specific we have to download and install new packages. Then, we can use
them by simply loading these packages in our R session, without installing them again.
We can do everything with just two rows in R. Try to install and download MASS, a package containing a
large amount of datasets to use:
# install the package MASS
install.packages("MASS")
# now MASS is on our computer.
# We have just to load (say to R that we need) MASS, because we are going to use it:
library("MASS")
# To see all the function contained in a package:
help(package = "MASS")
# To see al the packages currently loaded in our R session:
search()

10
R "homemade" functions
We can write our own functions, which are algorithms (from the easiest ones to those most complex) that do
exactly what we ask them for. Basically, we write a function that given a (or a sequence of) value(s), called
input(s), returns one (or more) value(s), called output(s). The syntax is straightforward:

name = f unction( inputs ){f ormula}


- name: the name we choose to assing to the function.
• inputs: one or more R objects. Multiple inputs must be separated by a comma (e.g. fuction(param1,
param2)).
• formula: one or more R actions (commands) which process the inputs and return an output.

Illustrative Examples:

1. Write a function that given two numbers returns their sum.(*)

# define sum_of_2 as a function that given two numbers a and b returns their sum
sum_of_2 = function(a,b) {a+b}

# now we can easy compute the sum of two numbers


sum_of_2(2,3)
# or the sum of two variables, let's say x=2 and y = 9
x=2; y=9
sum_of_2(x,y)

Note: Now, our function is stored in the Global Environment for further uses with the name “sum_of _2”.

11
2. Write a function that calculates exp(x+4) for a user-specified x and evaluate it for the values 2 and -4.
(*)

# Write a function that calculates e^(x+4) for a user-specified x


exp4= function(x) {exp(x+4)}

# evaluate it
exp4(2)
exp4(-4)

3. The following equation represents the probability density function (pdf) of a Normal distribution:

(  2 )
1 1 x−m
f (x, m, s) = √ exp − ,
2πs2 2 s

where x, m, s are user-specified numbers (x numeric, m = sample mean and s sample standard deviation).

Write a function called “fNormal” that calculates this pdf and evaluate this last considering x=2, m=-3, and
s=1.5. (**)

# Ex3: implementation of the Normal pdf


fNormal = function(x,m,s){
y = 1/sqrt(2*pi*s^2) * exp(-0.5 * ( (x-m) / s )^2)
y
}

fNormal(2,-3,1.5)

Note: pi is a variable already defined in R and accounts for the π.

4. Given m = 2 and s = 1, create x as a vector of the first 100 integers between [1:100] and evaluate the
fNormal function defined in the previous step for each point of the vector x. (**)

We can evaluate the function on an entire vector of values as follows:


# Evaluate fNormal on the vector x of length 100
# define x
x = 1:100
# then define y using fNormal
y = fNormal(x,2,1)
y

Note: If the formula contains only one command we can also avoid the {}.

12
5. Focus on the built-in sd() function.

The sd() built-in function computes the sample standard deviation, the unbiased estimate. Which is the
difference with the standard deviation of the population?

Standard Deviation (SD)


Sample Population
rP
rP
n 2 N
(xi −x̄) (xi −x̄)2
s= i=1
n−1 σ = i=1
N

Table 1: Difference between sample and population SD. n and N are the size of the sample and of the
population, respectively, and x̄ represents the sample mean.

So, if we need the population standard deviation, called also biased SD, we cannot use the built-in function.
Let’s see our alternatives doing the following exercise:

Given the vector w


w = (1, −1, 5, 6, 1, −6, 8, 9, 1, 3),
compute the biased SD. (**)

Alternative A: compute the biased SD manually

# define the vector w


w = c(1,-1,5,6,1,-6,8,9,1,3)
# compute the arithmetic mean of w
m_w = mean(w)
# define N as the length of w
N = length(w)
# compute the population SD:
sd_pop = sqrt(sum((w-m_w)^2)/N)
# see the difference with the results of sd()
sd_pop
sd(w)

Alternative B: adjust the results of the built-in function.


The difference between the two formulas in Table 1 are just the denominator. So we can pass from one to the
other as follows:
√ √
s= √ n σ σ= √n−1
s
n−1 n

13
Then:
# define the vector w
w = c(1,-1,5,6,1,-6,8,9,1,3)
# define N as the length of w
N = length(w)
# compute the population SD:
sd_pop2 = sd(w)*(sqrt(N-1)/sqrt(N))
# see the difference with the results of sd()
sd_pop

Take Home Exercises (Part 1)


1. Write a function that takes as input two vectors x and y and returns their difference (x − y). (*)
2. Write a function that takes a vector as input and returns the product of its elements (hint: prod(v)
computes the product of the elements of v). (*)
√ 1
3. Write a function that takes a number as input and return its cubic root (remember: 3 x = x( 3 ) ). (**)
4. Write a function that takes a number x as input and returns y defined as follows: (***)
x + 2x − 3
y=
log(x) − 2

5. Write a function that given the (unbiased) sample standard deviation as input returns the (biased)
population standard deviation (as Ex5 page 12). (**)
6. Define two vectors u and g both of length 10, where u is the results of a draw from a Uniform distribution
in the interval [0,10], while g is the results of a draw from a Normal distribution with mean 5 and
sample standard deviation equal to 0.5. Then, use the function defined in the first exercise to compute
the difference between u and g. Set the random numbers generator seed equal to "092021". (***)

14
Matrices
In R we can define a matrix using the function matrix().
We can define a matrix in different ways.
Define (initialize) a variable A as a 3x2 matrix as:
 
1 2
A = 3 4
5 6

1st method: fill the matrix with the vector c(1, 2, 3, 4, 5, 6) and specifies (at least) the number of rows/columns
of A. Once we specify the number of rows, R automatically knows the number of colums, and viceversa.
Finally, we have to specify how R has to fill the matrix (by rows vs by colums).
Summarizing, we define in R a generic matrix M as:


M = matrix data, nrow, ncol, byrow
In the case of our matrix A we will do:


A = c(1, 2, 3, 4, 5, 6), nrow = 3, ncol = 2, byrow = T RU E
So we choose as “data” a vector containing the sequence of number between 1 and 6, then we specify the
number of rows and colums (only one is strictly necessary), and finally in which way we want to fill the
matrix, so we specify byrow = TRUE (or just T). Then:

# define a matrix A containing the integer between 1 and 6


A = matrix(data=1:6,nrow=3,ncol=2,byrow = TRUE)
A
# it is fine also to pre-define a vector and then use it to fill the matrix as follows:
# define a vector containing the number between 1 and 6
vA = c(1,2,3,4,5,6)
# alternatively:
vA = 1:6
# or
vA = seq(1,6,by=1)
# now, fill the matrix with the vector vA
A = matrix(data=vA,nrow=3,ncol=2,byrow = TRUE)
A
# see what happens when we do not specify the number of columns/row
A = matrix(data=vA,nrow=3,byrow = TRUE)
A
A = matrix(data=vA,ncol=2,byrow = TRUE)
A

15
Let’s create B, a 3x3 squared matrix containing the values between 1 and 9 and fill it by column:

# define a matrix B containing the integer between 1 and 9 (fill by column)


B = matrix(data=1:9,nrow=3,ncol=3,byrow = FALSE)
B
# or alternatively
B = matrix(data=1:9,nrow=3,byrow = FALSE)
B

Note: by default “byrow” is set as “FALSE”. So we need to specify it only if we want to fill the matrix by
row:
# define a matrix B containing the integer between 1 and 9
B = matrix(data=1:9,nrow=3)
B
# it works fine without specifying the number of columns and the way to fill the matrix

2nd method: Matrices ar just a collection of (row/column) vectors. Then we can define a matrix binding
(by row or by column) pre-created vectors. In order to do that, we use the functions named rbind() (to bind
vectors by row) and rbind() (to bind vectors by column). They require as inputs just the collection of vectors
of the same length to bind toghether.

Example:

Let be v1 = (2, 0, 1), v2 = (1, 0, 3), v3 = (−1, −1, −1) column vectors. We want to define C and D as
   
2 1 −1 2 0 1
C = 0 0 −1 D= 1 0 3
1 3 −1 −1 −1 −1

# define three column vectors


v1 = c(2,0,1)
v2 = c(1,0,3)
v3 = rep(-1,3)
# now we define C binding these three vectors by columns and D by rows

C = cbind(v1,v2,v3)
D = rbind(v1,v2,v3)
C
D

16
Element position in vectors and matrices
We can select/extract an element from a vector/matrix by indicating its position. To select an element of an
object we need to use the “[ ]”. In particular:

• Vectors have one dimensions, then to extract the element of position k from the vector v:

v[k]

• Matrices have two dimensions (rows and columns), then we need to specify the row(s) and column(s)
numbers in the squared brackets ("[ ]"), separated by a comma. To extract the element in row i and
column j from the matrix M :
M [i, j]

Let’s see a couple of examples. Define a vector w and a matrix E as follows


 
1 2 −1 2
0
0 1 −2 0
w = (1, −1, 7, 2, 0) E= 
4 −1 3 −3
5 0 −1 0

# define w and E
w = c(1,-1,7,2,0)
E = matrix(c(1,0,4,5,2,1,-1,0,-1,-2,3,-1,2,0,-3,0), nrow = 4)

# print the element of w in position 5


w[5]
# save in a variable called "d" the second element of w
d = w[2]
d
# print the element of E in row 2 and column 1
E[2,1]
# print the element of E in row 1 and column 2
E[1,2]
# print the entire second column of E
E[,2] # --> "= all the elements in the second column"
# print the entire first row of E
E[1,] # --> "= all the elements in the first row"

# select only the first two rows of E


E[c(1,2),] # remember to specify c(1,2) as a vector!!
# select from the second to the forth elements of the third row of E
E[3,2:4]
# select only the elements in the first two rows and in the third and fifth columns of E
E[1:2, c(3,5)]

# define G as a new matrix equal to E without the third column of this last
G = E[,-3]
G

# define H removing the rows 1 and 4 from E

17
H = E[-c(1,4),]
H

There are two other useful tools to work with the dimension/position of the elements of vectors and matrices:
• length(x): returns the length of the object (vector) x.
• dim(x): returns the dimension of the object (matrix, and other) x.

Example :

# retrieve the length of the vector w


length(w)
# retrieve the dimension of the matrix E
dim(E)
# retrieve the dimension of the matrix H
dim(H)

## get the number of rows of H


# dim() returns a vector of length 2 containing in position 1
# the number of rows and in position 2 the number of columns, then:
dim(H)[1]
# get the number of columns of H
dim(H)[2]

18
Basic matrix calculus with R

In R we can do math operations (sum, difference, product, ec..) between a matrix and a scalar simply as
follows. Let A be a matrix (or a vector, column vector = matrix with 1 column) and b a scalar.

Operation R Command
Sum/Diff: A± b A± b
Product: A·b A*b
A
Division: b A/b

Table 2: Operations between a vector/matrix and a scalar.

Given two matrices A and B, the following table reports the most common mathematical matrix operations.

Operation R Command
Sum: A+B A+B
Product: A·B A%*%B
Determinant: det(A) det(A)
Inverse: A−1 solve(A)
Transpose: A’ t(A)
Main diagonal: diag(A) diag(A)

Table 3: Matrix operations.

Example

Given two matrices A and B and a scalar c


   
1 1 1 1 4 7
A = 0 1 2 B = 2 5 8 c = 3,
1 −1 1 3 6 9

Compute:

1. A + A
2. A · B
3. det(B)
4. A−1
5. A0

19
# define the three object A,B and c
A = matrix(c(rep(1,3), 0:2,c(1,-1,1)),nrow=3,byrow = T)
B = matrix(1:9, nrow=3,byrow = F)
c=3
A;B;c

# 1
A+A
# 2
A%*%B
# 3
det(B)
# 4
solve(A)
# 5
t(A)

Take Home Exercises

Given A, B and c the three objects just defined compute:


1. A · B
2. det(A · B)
3. A ∗ A−1
A·B
4. c

5. A · B 0
6. (A − B) · c
7. diag(B) · c
8. A[2, 3] · B[3, 3]

20
Conditions in R
The comparison between two or more objects is important in all the programming lenguage. The following
table summarizes all the math condition operators and how to express them in R.

Condition R
< <
> >
≤ <=
≥ >=
= ==
6= !=
and &
or |

Table 4: Condition operators in R

Note: The output of a comparison between two (or more) objects is a variable of type boolean (i.e.,
TRUE/FALSE).

Exercise:

Let A be a 4x4 matrix as follows:  


1 0 4 5
−1 −3 4 3
A= .
2 1 −6 2
0 0 2 4

Write the R code to answer the following questions relatively to the matrix A:
1. Is the element in position [1,1] greater than that in position [4,1]? (*)
2. Is the element in position [3,2] different from that in position [2,2]? (*)
3. Is the element in position [1,2] equal to that in position [4,2] and that in position [4,3]? (**)
4. Is the element in position [3,1] lower than that in position [3,3] or than that in position [4,2]? (**)
5. Is the sum of the first row of A lower than that of the second column of A? (***)
6. Is the mimimum of the last column of A greater than or equal to the maximum of the last row of A?
(***)

21
# define A
A = matrix(c(1,0,4,5,-1,-3,4,3,2,1,-6,2,0,0,2,4), ncol=4, byrow = T)

# 1
A[1,1] > A[4,1]
# 2
A[3,2] != A[2,2]
# 3
A[1,2] == A[4,2] & A[1,2] == A[4,3]
# 4
A[3,1] < A[3,3] | A[3,1] < A[4,2]
# 5
sum(A[1,]) < sum(A[,2])
# 6
min(A[,ncol(A)]) >= max(A[nrow(A),])

Dataframes
A dataframe is a “special” matrix which can contains more information and presents several desiderable
features. Each column of a dataframe (or dataset) has a name and represent a variable, hence an observed
feature of the statistical units (which are reported in rows).
We can create our first dataframe starting from a list of vectors. We select a class of 20 people coming from
different countries of the world. For each one of them we collect names, ages, heights, nationalities, gender,
and the final grades in the math exam.
# define a vector of names
names = c("Andrew","Anna","Alice","Antony",
"Barbara", "Brian","Boris","Barney",
"Claudia","Cliff","Cecilia","Clara",
"David","Dora","Denise","Donatello",
"Emma","Elise","Esteban","Elon")
# define a vector of ages
ages = c(20,22,27,25,18,22,26,21,19,24,
27,23,22,19,23,28,22,24,25,19)
# define a vector of heights
heights = c(180,170,155,175,150,197,178,182,183,170,
175,178,170,160,175,194,180,165,172,183)
# define a vector of nationalities
nationalities = c("France","Scotland","Italy","Poland",
"France","India","UK","Poland",
"Italy","Scotland","UK","France",
"Mexico","USA","France","Germany",
"USA","France","Spain","Poland")

# define the gender ("male" = 0, "female" = 1)


gender = c(0,1,1,0,1,0,0,0,1,0,1,1,0,1,1,0,1,1,0,0)
# define a variable for the exam grades
grades = c(16,18,19,18,15,14,15,18,17,20,
20,19,15,16,18,14,20,15,19,17)
# create a dataset called "class" containing these variables
class = data.frame(names,ages,heights,nationalities, gender, grades)

22
We obtain the following dataset:

names ages heights nationalities gender grades


Andrew 20 180 France 0 16
Anna 22 170 Scotland 1 18
Alice 27 155 Italy 1 19
Antony 25 175 Poland 0 18
Barbara 18 150 France 1 15
Brian 22 197 India 0 14
Boris 26 178 UK 0 15
Barney 21 182 Poland 0 18
Claudia 19 183 Italy 1 17
Cliff 24 170 Scotland 0 20
Cecilia 27 175 UK 1 20
Clara 23 178 France 1 19
David 22 170 Mexico 0 15
Dora 19 160 USA 1 16
Denise 23 175 France 1 18
Donatello 28 194 Germany 0 14
Emma 22 180 USA 1 20
Elise 24 165 France 1 15
Esteban 25 172 Spain 0 19
Elon 19 183 Poland 0 17

Table 5: My first dataset

We can call these variables by using their names as follows:

1. dataframe_name$variable_name
2. dataframe_name[,’variable_name’]
For example:
# print the names of the units
class$names
# or
class[,'names']

We enumerate, among the useful functions related to dataframes:

• class(dataf rame$variable):return the type of a variable (numeric, integer, character, ...).


• head(dataframe): print the first five rows of a dataframe.
• dim(dataframe): print the number of rows and columns.
• names(dataframe): print the names of the dataframe columns.
• nrow(dataframe): returns the number of rows of the dataframe.
• ncol(dataframe): returns the number of columns of the dataframe.

23
We can add a variable to our dataset by calling the dataset and define a variable inside it as follows:

We want to add the variable HStudy which indicates the number of hours spent studying in the last week by
our students. So:

# add a variable called HStudy to our dataset class


class$HStudy = c(34,36,39,37,31,30,32,37,35,37,39,38,31,34,37,31,40,32,39,35)

Note: the new variable must have the same number of elements of the others comprised in the dataset.

We can also add new variables as transformation of pre-existing variables in the dataset. For example, we
can add a new variable called “MinStudy” which express the variable HStudy (currently measured in hours)
in minutes. In formula:

M inStudy = HStudy · 60

# add a variable called HStudy to our dataset class


class$MinStudy = class$HStudy*60

We can also export the dataset (or a matrix) we have created in a ‘.csv’ file (comma separated values) to use
it on Excel.
# export our dataframe
write.csv(class, file="class.csv")

Illustrative exercises: R is learning by doing.


The dataset “class” contains the following variables: names, ages, heights, nationalities, gender. Write the R
code to obtain:
1. how many students are there in the class? (*)
2. the arithmetic mean of the variable "ages". (*)
3. the median grades. (*)
4. the highest and the lowest grades. (*)
5. the absolute frequencies for the variable "nationalities". (**)
6. define two subsamples separating females and males. Which group presents the highest grade, in mean?
(***)

24
Solutions:

1. How many students are there in the class?


To address this question we can simply count the number of rows of our dataframe. In fact, each row
corresponds to a student. Then:
# count the unit in our sample = n rows dataframe
nrow(class) # return the number of rows of a dataframe/matrix

2. Compute the arithmetic mean of the variable “ages”.


The variable ages is stored in the dataset class. Then:

# compute the arithmetic mean of the variable ages


mean(class$ages) # with $ we select a variable in a dataframe

3. The median grades and 4. the highest and the lowest grades.
Again, we know how to compute median, maximum and minimum. Then:

# compute the median


median(class$grades)
# compute the minimum
min(class$grades)
# compute the maximum
max(class$grades)

5. Compute the absolute frequencies for the variable “nationalities”.


In R we can use the function table() to build a contingency table and then to compute the absolute frequencies.

# compute the absolute frequencies


table(class$nationalities)

Note: to compute the relative frequencies we divide by the total number of observation, then nrow(class):

# compute the relative frequencies


table(class$nationalities)/nrow(class)

25
5. Define two subsamples separating females and males. Which group presents the highest grade, in mean?

We can split the dataset into two subsets according to a specific condition. In this case, each unit will belong
to the new “female” or “male” datasets according to the variable “gender”. To express a condition we refer
to those operators presented in Table 4. In particular, for each dataset we select only the rows where the
expressed condition is TRUE.

# Female = 1, Male = 0.
female = class[class$gender==1,]
male = class[class$gender==0,]

In words:

We select all the rows (the first space in the squared brakets, [], before “,”) where the condition (female:
class$gender==1, and male: class$gender==0) is TRUE and then we select all the columns (because we do
not specify one or more variables after the “,” and before the last “]”).

Now we have two different datasets and we can compute their average grades separately as follows:

# Compute the two means


mean(female$grades)
mean(male$grades)
# compare them using a condition
mean(female$grades) > mean(male$grades)

Note:
Sometimes we need matrices and not dataframes (e.g., matrix calculus). To transform a dataframe (or a
subset of it) into a matrix: as.matrix(dataframe_name).
# 7. Transform the last three column of the dataset class into a matrix called Grades
Grades = as.matrix(class[,(ncol(class)-2):ncol(class)])

Take Home Exercises Write the R code to answer the following questions.
1. Which is the most represented country in the class? (*)
2. Do the mean and the median of the variable "grades" coincide? (*)
3. Which is the proportion of females in the class? (*)
4. Our class has a median age over 24, isn’t it? (*)
5. Where does Emma come frome? (**)
6. Is David older than Brian? (**)
7. Are the polish taller than the scottish, in mean? (***)

26
8. Which is the name of the tallest person in the class? (***)

27
Import data
We can import vector or matrices defined from an external sources. We identify three main different types of
files (data vectors/dataframes):

1. R datasets: already existing in your R.


2. Online datasets: load them directly from the web.
3. Local datasets: import data from your computer.

Example
# Import data in R

# 1. load existing datasets


data(iris)
# 2. import (read) data from the web (in this case it is a .csv)
fishing = read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/COUNT/fishing.csv')
# 3. local data
# You have to find where is your document (again a .csv file)
# In general:
data = read.csv("Path where your CSV file is located in your computer\\dataname.csv")
data = read.csv('C:\\...\\dataname.csv') # Windows
data = read.csv("/Users/.../dataname.csv") # Mac

# Alternatively, we can see where R is working (in which working directory (WD))
getwd()
# and then set the working directory where the dataset is contained
setwd("C:\\Users\\...\\WD") # Windows
setwd("/Users/.../WD") # Mac
# and then just read the file without specifying the WD
data = read.csv('dataname.csv')

Note: Remember: Windows requires \\, while Mac: /. Note:


If you have an Excel file (.xls) you can use read.table() instead of read.csv.
Alternatively, you can install and then load the package readxl, which is strongly recommended (in-
stall.packages("readxl")).

28
R for graphical representations
R is a powerful tool for graphical representations. It contains a large list of (customizable) functions to
represent our data. We can see its potenzial with the following demo:

demo(graphics)

Following some basic commands to represent our data:

Scatter plots

plot() This function provides a scatterplot of our data. It requires as input two vectors of coordinates (x and
y) and it is fully customizable according to our preferences (x is optional). For example, specifying the extra
(that means optional) input "type" we can decide how to make our scatterplot. In particular:

• type = "l" represent a line the x-y dots with a line.


• type = "p" represents only the x-y points.
• type = "b" both lines and points.
• ...and so on
The default type is “p”.

We can customize our plot also adding a title specifying main as an extra input. For other graphical parameters
and tons of examples digit ?plot on the R console.

Let’s make our first plot:

# retrieve our class dataset


class = read.csv("/Users/.../class.csv")
# plot ages (x) vs grade (y)
plot(class$ages, class$grades)
# this is a simple plot. Let's customize it.
plot(class$ages, class$grades, pch=18, cex=1, main= "Ages vs Grades",
xlab = "Ages", ylab = "Grades", col="blue")

Here is a list of plot function parameters:


• pch: an integer between 0 and 25 that specifies the shape of points:

• cex: the dimension of the points (0.5x, 1x, 1.2x, 2x, ...).

29
• main: a string that indicates the title of the plot.
• xlab/ylab: specify the x and the y-labels.
• col: the color of points.

Bar and Pie plots

We can represent frequency distributions using arbitrarely a bar plot (barplot()) or a pie plot (pie) as follows:

# represent the frequency distributions of the variable nationalities

# save the absolute frequencies in a variable


nat_freq = table(class$nationalities)
# barplot
barplot(nat_freq) # ugly
barplot(nat_freq, las=2, col="orange2",
main= "Nationalities distribution in the class") # nice
# "las" parameter: 1 horizontal x labels, 2 vertical labels!

# to represent the relative frequencies instead of those absolute


barplot(nat_freq/sum(nat_freq), las=2, col="orange2",
main= "Nationalities distribution in the class")

# ALTERNATIVE: pie plot


pie(nat_freq, main = "Nationalities distribution in the class")

Note: You can choose the color you prefer. In this case we have 10 nationalities, hence you can specify
a character vector of 10 colors. Furthermore, there are predefined palettes of colors. To see all the colors
available in R: http://www.stat.columbia.edu/ tzheng/files/Rcolor.pdf

Histograms

We use a histogram when we represent continuous variables. Let’s represent the distribution of heights in
our dataset class:

# represent the distribution of heights in our dataset


hist(class$heights) # nice, but we can do better
# specifies a vector of 10 colors
colors = c("red4","red3","red2","red1","orange1",
"yellow" ,"green1","green2","green3","green4")
hist(class$heights, breaks = 10, col=colors, xlab = "Heights (in cm)",
ylab="Absolute Frequencies", main = "Heights distribution in the class")
# to represent the relative frequencies just add freq=F (or FALSE)
hist(class$heights, breaks = 10, col=colors, xlab = "Heights (in cm)",
ylab="Relative Frequencies", main = "Heights distribution in the class", freq = F)

30
Take Home Exercises
Load the dataset “iris”, which is already present in R and answer the following questions:

1. Represent the distribution on the length of the flowers petals. ("P etal.Length") through a histogram.
2. Draw a pie plot of the variable "Species".
3. Make a scatter plot which shows the relation between P etal.Length and P etal.W idth.

Feel free to provide basic plots (∗) or to customize them (∗∗)/(∗ ∗ ∗).

To satisfy the great artist who is in you:


https://www.r-graph-gallery.com/

31
Control Flow Statements
We can ask R to return a result according to particular conditions or launch an iterative process (repeat a set
of actions) specifying an exit rule (when to stop). For this type of procedures we need conditional statements
and loops.

Conditional Statements

We can ask R to do something exclusively if a condition is satysfied and also to do something else if the
same condition is not met. The syntax is straightforward:
• if(condition) do something.
• if(condition) do something else do something else.
• if(condition1) do something else if (condition2) do something dif f erent else do something else
(multiple conditions, you can add as many conditions you want).
Examples:
# conditional statements on R

### the "if" statement


# 1. Write a function that returns the message
# "You insert a number greater than 5"
# if the object a is greater than 5
a = 7
if(a >=5) "You insert a number greater than 5"

### the "if" - "else" statement

# 2. write a function that given a number x as input


# returns if this number is an even or an odd number
# try this function for x = 4 and x = 3
even_odd = function(x){
if (x %% 2 == 0) y= "even" else y="odd"
y # it means: return (print on the R console) y
}

even_odd(3); even_odd(4)
# %%: returns the remainder after the number is divided by divisor (2).

## if - else if - ... - else statement


# 3. write a function that given a number x as input returns
# a string of character that tells if the number is negative,
# zero or positive.
# try the function for x = {-2,0,1}
numb_sign = function(x){
if(x<0) y= "It is a NEGATIVE number"
else if (x==0) y = "It is ZERO"
else y = "It is a POSITIVE number"
y
}

numb_sign(-2); numb_sign(0); numb_sign(1);

32
Take Home Exercises
1. Your vending machine

Define a catalogue of four goods (e.g. “Coffee”, “Cappuccino”, “Tea”, and “Hot Water”) and assign them
the relative prices (e.g. 1, 1.50, 0.80, 0.05). Using the conditional statements, write a function that given a
string of character containing one of the four goods returns its price. If the input is different from those four
products the function returns "Sorry, product not available.".
Try to order a Coffee, a Tea and a Ginseng.
(Remember: R is case sensitive: a 6=A)
2. Saving when I do shopping

The following table contains a list of groceries which sell tomatoes, potatoes, zucchinis, and eggplants along
with their prices per kg.

Shops Tomatoes Potatoes Zucchinis Eggplants


TheLaughingTomato 0.99 0.23 0.54 0.34
PoorZucchini 0.98 0.28 0.50 0.34
BadApple 1.07 0.17 0.48 0.39
ToPearOrNotToPear 0.93 0.22 0.57 0.33

I have to buy for my restaurant:


• 2 kg of tomatoes;
• 3 kg of potatoes;
• 5 kg of zucchinies;
• 4 kg of Eggplants.
Where I should go shopping to save money?
Write an R function that takes as input a dataframe called Grocery, which is a copy of the table above, and
returns a string containing the name of the cheapest grocery shop.

33
Loops
Loops allow us to repeat the same set of actions/procedures n times. We can specify how many times R
must repeat this set of action according to:

• a fixed number of times, using a counter ⇒ for statement.


• as long as a condition is met ⇒ while statement

Graphically a loop works in this way:

START

i=1

do
something YES

is still
i = i+ 1
i < n?

END NO

i = counter
n = number iterations
/ exit condition

for

The syntax

for (counter in initial_value : ending_value) do something

Note: Again, we do not need {} when we demand R for only one action.

34
Example
Ex1: The grocery shop at the end of the street is interested in evaluate the amount of watermelons sold in
the last week. The owner register these quantities in a vector called “wm” as follows:

wm = [10, 5, 4, 6, 8, 11, 3].

In particular, he is satisfied when he managed to sell at least 5 watermelons. Write a function in R that
returns the total number of days when the shop sold at lest 5 watermelons.
Solution:
Our strategy consists in assessing how many elements of wm are not lower than 5.
step1: defining an empty vector s containing the number of successes:

wm[i] ≥ 5, s[i] = 1
s= ,
wm[i] < 5, s[i] = 0
with i = {1, 2, . . . , 7} the index of the f or loop. Then, we sum the element of the new vector s and return
the number of days when they say more than 5 watermelons.

wm_affair = function(vector){
s = c() # define an empty vector, REQUIRED if we want to fill it in a for loop
n = length(vector) # define n as the length of the input vector
# for loop
for (i in 1:n) {
if (vector[i]>=5) s[i] = 1 else s[i]=0
}
# the for loop is ender here, when i reaches n. Then, s is filled.
result = sum(s)
result
}

# Let's try our function.


# define wm:
wm = c(10,5,4,6,8,11,3)
# and apply our function:
wm_affair(wm)
# let's see if it works also for another week, like wm2
wm2 = c(1,5,8,1,5,7,2)
wm_affair(wm2)

While

Syntax

while (condition for looping) { do something }

Examples

35
Ex1: The owner of the grocery shop continues tracking the sales of watermelon, but the summer is nearly
over. His wm vector contains now 30 days of observations:

wm_tot = [10, 5, 4, 6, 8, 11, 3, 1, 5, 8, 1, 5, 7, 4, 9, 7, 2, 3, 9, 6, 2, 1, 2, 3, 4, 1, 0, 0, 2, 1].

His objective is the same: see how many days he sold more than 5 watermelons. But this time he wants to
consider only the first 20 days, because they are the only one reliables for his analysis. Write a code in R to
see how many times he manages to sell at least 5 watermelons in the first 20 days.
Solution
# Write a code in R to see how many times he manages to sell
# at least 5 watermelons in the first 20 days.
wm_tot = c(10,5,4,6,8,11,3,1,5,8,1,5,7,4,9,7,2,3,9,6,2,1,2,3,4,1,0,0,2,1)

# we solve it without defining a function:


# define another empty vector of 0 if day <5 wm, and 1 otherwise
s_tot = c()
# initialize i, our index
i = 1
while (i<20) {
if(wm_tot[i]>=5) s_tot[i] = 1 else s_tot[i] = 0
i = i+1 # at each iteration we increase the value of i by 1
}
sum(s_tot)

Take Home Exercises

1. Given the following vector ω


ω = (1, 7, 4, 7, 4, 8, 4, 2, 5, 8),
write a for loop which at each iteration displays "Greater than 4" if the element in position i is greater
than 4, and "Not greater than 4" otherwise. (*)
2. Print the same message using the same condition of Q1, but only for the first 4 elements.
3. load the dataset iris, already existing in R. Add a variable colled F lowerT ype to the dataset using a
loop function according to the following rule: (*)


T ype1, if Sepal.Lengthi ≥ 5.4
F lowerT ypei =
T ype2, otherwise

36

You might also like