‘Data Analytics & Visualisation’

Minor ‘Data Science’

Hogeschool Rotterdam, CMI
Week 1

Skills Data Scientist

Let us introduce
ourselves …

Statistics Visualization


a n d:
Practical Information
• Data Visualisation on Thursday 9:00 - 11:30

• Project lessons Thursday 12:00 - 16:00.

• Other days for homework, project group working meetings.

• We use Microsoft Teams to share learning materials and

communicate about the course.

• Assessment will be done with a written Exam and a

The final grade is average between two partial grades.
average between two partial grades.


Big Data
Data Explosion
Insight, Decision, Action
Why Visualization?

We live in the age of Big Data.

Human beings process visual information
60.000 x faster than textual information!

Useful in two phases…

1. In exploration phase for yourself, trying to

understand and possibly find hidden patterns in the data set.
data set.

2. In presentation phase for your audience, trying to

communicate insights and trigger actions.

Static Infographics
Tangible Visualisations
Data Animations

Interactive Visualisations

In-Class Assignment 1.1

Look for interesting visualizations of the Corona-virus

pandemic, that is now sweeping over the world.
• What data was used for the Viz?
• What story is told by the Viz?

Truth or Beauty?
• Absolute values

• Relative values (ratio’s, • Cummulative values

percentages, per capita numbers)
numbers) • Logarithmic scale

• Aggregate • …

• Filter

• Summarize

R for data collection, ltering, cleaning,

wrangling, slicing, dicing, munching,
crunching, modelling, …, graphing,
plotting and drawing.

Shiny for user interaction.

Assignment 1.2
Follow the tutorial on R Data Structures and Graphics.
Make notes of things you don’t understand.
Base Data Types
Number of children, Floor in a
Numbers Discrete Numbers
building, …

Continuous Numbers Temperature, Time, Length, …

Text Western Script (UTF-8) “Rotterdam”

Other Scripts: Chinese,

Arabic, …


Man || Woman,
Categories Pass || Fail || Inconclusive…

Real World Data Types

Container Data Structures
• Vector: one dimension, elements have same data type

• List: one dimension, elements may have different data


• Matrix: two dimensions, elements have same data type

• Dictionary: key, value pairs, values may have different

data types

• Table or Data.Frame: each column is a vector!

Interactive R Demo
Example Data Sets
• R contains a number of example Data Sets

• Display available Data Sets in R: > data()

• Once you have chosen one you can nd its data

structure with: > str(data.set)

• And its description with: > help(data.set)


Exploring Large Data Sets

• To see a rst few rows use: > head(data.set)

• To see the last rows use: > tail(data.set)

• To determine the size: > length(data.set)

• Parts of the data can be selected with square brackets

data.set[…], e.g. > data.set[3, 4], >
data.set[1:5,] or > data.set[“column name”]

• To get the contents of a single column (one attribute)

use: > data.frame$
Basic Plotting
• Graph or Scatter diagram: > plot(x,y)

• Histogram: > hist(x)

• Barplot: > barplot(x)

• Much more possibilities can be found in package

ggplot2/ggplot2.pdf. We will come back to this
package later. For now, we use plotting from the base

Line Graph
Scatter Diagram
Histogram versus Bar Graph
Assignment 1.3
Install Studio on your laptop. Explore some of the data
sets that are packaged with R.
• What container data structure, what data types?
• Which plots can be used to explore the data?

Install RStudio on your laptop
Homework Assignment: Iris
• Use the built-in data.frame “iris”. For all the plots, make sure that you have
human-readable titles and clear labelling (please don’t use just the variable

• Use > help(iris) to understand what attributes there are. Make sure that you
understand what they all mean.

• Make a histogram with > hist() with 20 bins of petal width for the Iris Setosa.

• Make a scatterplot of sepal length versus petal length. Show each of the three
species of iris on the same plot with a colored legend to separate them.

• Make a scatterplot of sepal length versus sepal width for all irises whose petal
width is larger than 1.5

• Make one more plot that shows something interesting about the differences
between the species of Irises.

If something is unclear or you need additional help please contact me!

Email: [email protected]

Or send me a msg on Teams!

Week 2

Any Questions, New Insights
Part 1 - Chapter 1:3

human-readable titles and clear labelling (please don’t use just the variable

Recap …
base data types, vectors,
data.frames, subsetting,

Base Data Types

Discrete Numbers (whole Number of children, Floor in
numbers, count) a building, …

Continuous Numbers Temperature, Time, Length,

(broken numbers, weigh) …

Text Western Script (UTF-8) “Rotterdam”

Other Scripts: Chinese,

Arabic, …

Logical (a.k.a.

Categories (in R Man || Woman,

Factors with Levels) Pass || Fail || Inconclusive…

Container Data Types

Homogenous Heterogenous

0D 25

Vector, c(2, 4, 6) or
List, list(1, “Rotterdam”,
1D c(“Amsterdam”,

2D Matrix Data.Frame

multi-D Array
• Construct your own with: c(1, 2, 3)
• Vector can be named: my.vector <- c(“A”, “B”, “C”)
• Also elements of vector can be named (e.g. built-in
data set islands). Use names() to manipulate them.

> str(islands)
Named num [1:48] 11506 5500 16988 2968 16 ...
- attr(*, "names")= chr [1:48] "Africa" "Antarctica" "Asia" "Australia" ...
> head(islands)
Africa Antarctica Asia Australia Axel Heiberg Baf n
11506 5500 16988 2968 16 184


Data Frames (1 of 2)
• Construct your own with: data.frame(col1 =
c(1,2,3), col2 = c(“A”, “B”, “C”))
• A data frame can be named: my.df <-
data.frame(col1 = c(1,2,3), col2 = c(“A”, “B”,
• Also elements of vector can be named (e.g. built-in
data set women). Use rownames() or colnames()
to manipulate them.

Data Frames (2 of 2)
> str(women)
'data.frame': 15 obs. of 2 variables:
$ height: num 58 59 60 61 62 63 64 65 66 67 ...
$ weight: num 115 117 120 123 126 129 132 135 139 142 ...
> head(women)
height weight
1 58 115
2 59 117
3 60 120 Observation
4 61 123
5 62 126
6 63 129


Filtering / Subsetting with

[…] and $
Vector, one dimensional, with one number, e.g. my.vector[3], or one
name my.vector[“First”], or
• Vector with selected elements, e.g. c(2, 4, 5)
• Vector with names
• Logical statement, that is TRUE or FALSE for each element
Data Frame, two dimensional, with two numbers separated by a
comma, e.g. my.df[3,1] or two names my.df[“First”, “B”], or
• Two vectors with selected elements
• Two vectors with names
• Two Logical statements, that are TRUE or FALSE for each element
• In case you leave space in front of or behing “,” empty, everything
is selected

For convenience is it possible to

select a complete column (which is
by the way a named vector) of a
Data Frame with “$”.
E.g. weights <- women$weight

Prede ned R Functions

> help()
> str()
> length()
input or output or
> head() actual parameters return value
> nrow()
> ncol()
> summary()

> table()
> plot() default parameter
values side effects?
> barplot() e.g. generating a plot
> boxplot()

Explore real-world Data Sets




Import Data Sets from
les on the Internet
Data Structures in Memory
(Vector, Matrix, Data.Frame,
List, …) are different from Data
Structures stored in a File.
Data File Formats






Lots of File Formats

CSV: Comma Separated
JavaScript Object Notation

• Lists: ["Amsterdam", "Buenos Aires", "Cairo"]


• Dictionary: {“Name”: “Jan Kroon”, “City”:

“Utrecht”, “Children”: [“Lente”,

• Lists of Dictionaries, Dictionaries containing Lists.

R Code Hints
• > read.csv() for reading “comma separated value”  les (“.csv”).
• > read.csv2() variant used in countries that use a comma “,” as
decimal point and a semicolon “;” as eld separators.
• > read.delim() for reading  “tab-separated value”  les (“.txt”). By
default, point (“.”) is used as decimal points.
• > read.delim2()for reading  “tab-separated value”  les (“.txt”). By
default, comma (“,”) is used as decimal points.
• > install.packages(“rjson”)
library("rjson")


json_file <- ""

json_data <- fromJSON(paste(readLines(json_file), collapse=""))

Visualisation to explore
data sets

like to explore?
What would you

Source: Dr. Andrew Abela “Chart Chooser”

2D Math / Stats plots
Line Plot
• Purpose: Explore the development of variable
values over time. Is the variable almost constant?
Does it increase or decrease? How fast does it
change: linear growth? explosive growth?
exponential growth?

• Example:
> time <- seq(from=0, to=10, by=0.1)
> growth <- time*time
> plot(x=time, y=growth, type=“l”)
visual comparison
over time

Trend Line

Ice Cream Sales

Seasonal / Cyclic Pattern
• Purpose: Get an impression of differences between

• Example: built-in data set precip

> barplot(precip, horiz = TRUE)

Stacked Barplot if the x parameter is a data.frame.

visual comparison
between variables

• Purpose: Get an impression of the distribution of a
variable in a data set: Center, Quartiles, Outliers…?

• Example: built-in data set islands

> boxplot(islands)

visual inspection
of distribution

• Purpose: Get an impression of the distribution of a
variable in a data set: Symmetrical or Skewed?
Uniform distribution? Normal distribution? otherwise

• Example: built-in data set islands

> hist(islands, breaks=10)

visual inspection
of distribution

Scatter plot (Strooi diagram)

• Purpose: Compare two variables of the same
person of object (proefpersoon of proefobject). Is
there a relationship?

• Example: built-in data set women

> plot(x=women$height, y=women$weight)

visual inspection
of relationship

3D plot
> persp(volcano)
Heat Map
> image(volcano)
Contour Graph (isobaren,
> contour(volcano)
Multiple Data Sets in a
single plot

• plot() creates a new plot

• points(), line(), text(), legend(), …

add data to the existing plot.

• boxplot() with multiple arguments plots multiple

box-plots side by side.

visual comparison
between variables

Geospatial Plots
London Cholera Map
Dr. John Snow,1854

Coördinates on Earth
Longitude, Latitude
Different Projections …
… result in different maps.

with different properties:

Dates and Times

Time Zones
Date and Time formats
Quick Demo
> df <- data.frame(Date = c("10/9/2009 0:00:00", "10/15/2009
> as.Date(df$Date, "%m/%d/%Y %H:%M:%S”)

> mytime<-ymd_hms("2015-08-14-05-30-00", tz="America/Halifax")

> mytime

> # Leap year?

> leap_year(mytime)

> # Time differences?

> date1<-ymd_hms("2017-06-20-03-45-23")
> date2<-ymd_hms("2017-10-07-21-02-19")
> difftime(date2,date1)

Full tutorial:

Advanced package
• tidyverse : The tidyverse is a set of packages that work in harmony because they share common data representations and API design
common data representations and API design

> install.packages("tidyverse")
> library(“tidyverse”)
• Contains a lot of packages that are useful in the data science, for this Course

• Midwest dataset: a build in dataset

• Try on your own: 

> data("midwest", package = "ggplot2")
> ggplot(midwest, aes(x=area, y=poptotal))

• Midwest dataset: a build in dataset

• Try on your own: 

> data("midwest", package = "ggplot2")
> ggplot(midwest, aes(x=area, y=poptotal))

• Midwest dataset: a build in dataset

• Try on your own: 

> data("midwest", package = "ggplot2")
> ggplot(midwest, aes(x=area, y=poptotal))
+ geom_point()

You need to specify what graph you wanna do!

> g <- ggplot(midwest, aes(x=area, y=poptotal)) +
geom_point() + geom_smooth(method="lm")
# set se=FALSE to turnoff con dence bands

# Delete the points outside the limits

> g + xlim(c(0, 0.1)) + ylim(c(0, 1000000))

In Class Assignment 2.1

1. Use (built-in) data set Seatbelts to visualize the impact
of introduction of the law to use seatbelts. (First turn
Seatbelts in data.frame format with data.frame(Seatbelts))

2. Use (built-in) data set LakeHuron to explore whether

there is a seasonal pattern in the water heights. What is
the trend?

3. Use the (built-in) dataset state.x77.

a) Make sure the object is a data frame, if not change it to a data
b) Find out how many states have an income of less than 4300.
c) Find out which is the state with the highest income.
d) What are possible causes of high murder rates?

Week 3

Seatbelts Data Set

Seatbelts Data Set (continued)
Seatbelts Data Set (continued)
Seatbelts Data Set (continued)
Exercise 2: 

With the dataset swiss, create a data frame of only the rows 1, 2, 3,4,5,6,7 and only the
variables Examination, Education and Infant.Mortality.
b) Create a row that will be the total sum of the column, name it Total.

>tm <- swiss[1:10,c('Examination', 'Education', 'Infant.Mortality')]
>tm["Total" ,] <- colSums(tm)
c) Create a new variable that will be swisse the proportion of
Examination (Examination / Total)
>tm$swissbe <-tm$Examination / tm$Examination[length(tm$Examination)]

Exercise 3
For the dataset state.x77 

a. Remove column Frost 

> keep <- c("Population" ,"Income" , "Illiteracy", "Life.Exp" ,
"Murder" , "HS.Grad" , "Area" )
> sta <- sta[keep]

b. Add a variable to the data frame which should categorize the level of illiteracy: [0,1) is low,
[1,2) is some, [2, inf) is high.
>sta$illlvl <- ifelse(sta$Illiteracy<1, 'low',ifelse(sta$Illiteracy
<2, 'some', ifelse(sta$Illiteracy>=2,'high', NA)))
Different data structures
Data Cleaning
How is it done in R?
Data transformation
> install.packages(‘dyplr’)
This package will allow you to manipualte the data easily.

An example:

There is a dataset in Teams which we will use to work on

> df <- read_delim(“heartatk4R.txt")

And see what msg do you have on the screen? How can we x it?

df <- read_delim("heartatk4R.txt",
df <- read_delim("heartatk4R.txt",
Data transformation
Data transformation
# pipe operator;
data is send to
the next step

# sort in ascending order;

desc(AGE) for descending order

Data transformation
Data transformation

# adds new variables and

preserves existing ones
Data transformation
Data transformation

# lter the dataset

according to conditions given

Data transformation

Can you explain to me what does this code do?

Missing values
df <- df %>% drop_na()
But this might lead to probles such as:

- data bias

- data loss
Possible solutions:

-add more data: (calculate mean, median),(linear regression), many more …

By know you should know how to identfy if your data has any outliers

How to deal with outliers ?

1.Removing rows with outliers from your

2.Consider outliers & inliers separately

3.Remove & replace via imputation

Remove duplicates
Remove duplicates
Modify Data Elements:
> dutch.number
> # Cast number to a character string
> d.n <- as.character(dutch.number)
> # Substitution
> result <- gsub(“,”, “.”, d.n)
> # Cast character string to num
> international.number <-
Sorting a data.frame by
> head(mtcars)
> # Sort a column (a vector)
> sort(mtcars$mpg)
> # Sort the whole data.frame
> order(mtcars$mpg)
> mtcars[order(mtcars$mpg),]
> mtcars[order(-mtcars$mpg),]
> mtcars[order(mtcars$mpg,
-mtcars$cyl), ]
Merging two data.frames
> head(area, 3)
Continent Country Land.Area.2013
1 Europe Netherlands 16164
2 Europe Belgium 11787
3 Europe France 210026
> head(inhab, 3)
Continent Country Inhabitants.2016
1 Europe Netherlands 16987330
2 Europe Belgium 11358379
3 Europe France 64720690
> merge(area, inhab)
Continent Country Land.Area.2013 Inhabitants.2016
1 Asia China 3700000 7466964280
2 Asia India 1240000 1324171354
3 Europe Belgium 11787 11358379
Aggregating data
> aggregate(sales$Cars.Sold, list(sales=sales$Year), sum)
sales x
1 2001 209
2 2002 209
3 2003 209
> aggregate(sales$Cars.Sold, list(sales=sales$Month), sum)
sales x
1 April 54
2 August 27
3 December 48
4 February 21
5 January 36
6 July 51
7 June 99
8 March 45
9 May 75
10 November 54
11 October 69
12 September 48
> df[ , 2]
> df[2, ]
> df[2, 2]
> df[df$var1 == “Male”, ]
> subset(df, df$var1 != “Female)
Add trend lines

> abline(a=0, b=1, col=“blue”)

a denotes the intercept

b denotes the slope

y = a + b*x
Homework / In class
Week 4

Explore real-world Data Sets





Practical Problems …
CBS, Kaggle,
You can’t nd data. collections ,Google Data Set Search,

clean!, gsub(), use lubridate library for
Data is polluted, in wrong format. Dates and Times, use options of
read.csv(): StringsAsFactors = F

Data is distributed over multiple Data Frames. Use merge()

Data is distributed over multiple Data


Data is too detailed, need summaries,

totals per category.
Data Transformations:
ltering, sorting, wrangling, slicing,
dicing, munching, crunching,
merging, aggregating, …

Matching of arguments
R functions arguments can be matched positionally or by
name. So the following calls to sd are all equivalent
> mydata <- rnorm(100)
> sd(mydata)
> sd(x = mydata)
> sd(x = mydata, na.rm = FALSE)
> sd(na.rm = FALSE, x = mydata)
> sd(na.rm = FALSE, mydata)

Even though it’s legal, it is discouraged messing around with

the order of the arguments too much, since it confuses fellow

Add trend lines

> abline(a=0, b=1, col=“blue”)

a denotes the intercept

b denotes the slope

y = a + b*x
Visualisation to
present data sets
Prede ned R Functions
> help()
> str()
input or output or > head()
actual parameters return value
> nrow()
> ncol()
… > summary()
> plot()
> install.packages()
default values > library()
side effects?
> merge()

> aggregate()

User De ned R Functions
Functions can be created using the function() keyword and are stored as R
objects just like anything else. In particular, they are R objects of class “function”.
f <- function(<formal parameters>) {

Functions in R are “ rst class objects”, which means that they can be treated
much like any other R object.
• Functions can be passed as arguments to other functions.
• Functions can be nested, so that you can de ne a function inside of another
• The return value of a function is the last expression in the function body to be evaluated.



Shiny for User Interaction

User Interface
Shiny generates HTML
Data-driven Outputs
Recap UI
Server Function
See what this code do
ui <- fluidPage(
selectInput("dataset", label = "Dataset", choices = ls("package:datasets")),

▪fluidPage() is a layout function that sets up the basic visual structure of the page
▪selectInput() is an input control that lets the user interact with the app 

by providing a value. In this case, 

It’s a select box with the label “Dataset” a

nd lets you choose one of the built-in datasets that come with R.
▪verbatimTextOutput() and tableOutput() are output controls that tell Shiny where to put
rendered output. verbatimTextOutput() displays code and tableOutput() displays tables

See what this code do

server <- function(input, output, session) {
output$summary <- renderPrint({
dataset <- get(input$dataset, "package:datasets")

output$table <- renderTable({

dataset <- get(input$dataset, "package:datasets")
shinyApp(ui, server)

See what happens if you remove red or green box !

Reactive programming
Reactive programming is another programming paradigm, it is 

programming with asynchronous data streams.

You are able to create data streams of anything, not just from click and hover events.
Streams are cheap and ubiquitous, anything can be a stream: variables, user inputs,
properties, caches, data structures, etc. 

 The key idea of reactive programming is to specify a graph of dependencies so that

when an input changes, all related outputs are automatically updated. 
The input argument is a list-like object that contains all the input data sent
from the browser, named according to the input ID. For example, if your UI
contains a numeric input control with an input ID of count, like so:
ui <- fluidPage(
numericInput("count", label = "Number of values",
value = 100)

then you can access the value of that input with input$count. It will
initially contain the value 100, and it will be automatically updated as the
user changes the value in the browser.

Unlike a typical list, input objects are read-only. If you attempt to

modify an input inside the server function, you’ll get an error:
server <- function(input, output, session) {
input$count <- 10

shinyApp(ui, server)
#> Error: Can't modify read-only reactive value

This error occurs because input re ects what’s happening in the

browser, and the browser is Shiny’s “single source of truth”. If you
could modify the value in R, you could introduce inconsistencies,
where the input slider said one thing in the browser,
and input$count said something different in R. 


One more important thing about input: it’s selective about who is allowed to read it.
To read from an input, you must be in a reactive context.
Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:



Read this:
Page Lay-out with Panels
ui1 <- fluidPage(

titlePanel("EduCode 'Functions of two variables'"),



selectInput(inputId = 'chosen.function', label = 'Function

description: ', choices = c('f(x,y) = x + y', 'f(x,y) = x * y', 'f(x,y) =
x^2 + y^2', 'f(x,y) = 100*sin(x + y)/sqrt(x^2 + y^2)')),

sliderInput(inputId = 'angle', label = '3D view angle: ', min=0,

max=360, value=90)




tabPanel("3D Plot", plotOutput("Three.D.plot")),

tabPanel("Contour Graph", plotOutput("Contour.graph")),

more to explore:

Create an app that greets the user by name. You don’t know all the functions you need to do this
yet, so I’ve included some lines of code below. Think about which lines you’ll use and then copy
and paste them into the right place in a Shiny app.
ui <- fluidPage(textInput("name", "What's your name?"), textOutput("greeting"))


server <- function(input, output, session) {

output$greeting <- renderText({

paste0("Hello ", input$name)


Suppose your friend wants to design an app that allows the user to set a
number (x) between 1 and 50, and displays the result of multiplying this
number by 5. This is their rst attempt:

Data storytelling
Data storytelling is the concept of building a
compelling narrative based on complex data
and analytics that help tell your story and
influence and inform a particular audience.
•Adding value to your data and insights.

•Interpreting complex information and

highlighting essential key points for the audience.

•Providing a human touch to your data.

•Offering value to your audience and industry.

The star of
the show: DATA

•Think about your theory. What do you want to prove or disprove? What do you think the
data will tell you?
•Collect data. Collate the data you’ll need to develop your story.
•Define the purpose of your story. Using the data you gathered, you should be able to write
what the goal of your story is in a single sentence.
•Think about what you want to say. Outline everything from the intro to the conclusion.
•Ask questions. Were you right or wrong in your hypothesis? How do these answers shape the
narrative of your data story?
•Create a goal for your audience. What actions would you like them to take after reading
your story?

And this is where the Data

Visualisation comes in
•Reveal patterns, trends, and findings from an
unbiased viewpoint.
•Provide context, interpret results, and articulate
•Streamline data so your audience can process
•Improve audience engagement.

Build your narrative

As you tell your story, you need to use your data as
supporting pillars to your insights. Help your audience
understand your point of view by distilling complex
information into informative insights. Your narrative and
context are what will drive the linear nature of your data

Use visuals to enlighten
Visuals can help educate the audience on your theory. When
you connect the visual assets (charts, graphs, etc.) to your
narrative, you engage the audience with otherwise hidden
insights that provide the fundamental data to support your
theory. Instead of presenting a single data insight to support
your theory, it helps to show multiple pieces of data, both
granular and high level, so that the audience can truly
appreciate your viewpoint.

Show data to support
Humans are not naturally attracted to analytics, especially
analytics that lack contextualization using augmented
analytics. Your narrative offers enlightenment, supported by
tangible data. Context and critique are integral to the full
interpretation of your narrative. Using business analytic
tools to provide key insights and understanding to your
narrative can help provide the much-needed context
throughout your data story.

A Good Plot contains …
1. Title (What is plotted?)

2. Axis titles including Units

3. Numbers on all axes

4. Legend labeling of all lines or dots if more than one

5. Legible (Colours visible on screen, on projection and in print)

6. Source (Where do underlying Data Sets come from?)

Avoid Data
Stephen Few’s pitfalls
1. Exceeding the boundaries of a single screen
2. Supplying inadequate context for the data
3. Displaying excessive detail or precision
4. Expressing measures indirectly
5. Choosing inappropriate media of display
6. Introducing meaningless variety
7. Using poorly designed display media
8. Encoding quantitative data inaccurately
9. Arranging the data poorly
10. Ineffectively highlighting what’s important
11. Cluttering the screen with useless decoration
12. Misusing or overusing color
13. Designing an unappealing visual display

Levels of Understanding
1. Describe data sets (Descriptive Statistics /
Summary Statistics, Visualization)

2. Understand, explain some relations between

variables (Inferential Statistics, Detect patterns)

3. Predict new, unseen, future values!

4. “What If …” analysis, predict effect of possible


New Terminology
• Feature: What we called Variable

• Label: Variable that we are interested in

• Regression: Predict future numerical outcomes based

on historical data

• Classi cation: Predict future categorical outcomes

based on historical (labelled) data

• Clustering: Group (unlabelled) Observations in clusters


You can reduce the dimensionality

(number of variables) by selecting the
most signi cant variables (variables that
have the most in uence on the outcome).
You can see the relation between two
numerical variables in a scatter
diagram. You can calculate this relation
with Pearson’s correlation coëf cient.

Pearsons Correlation
Coef cient
is a measure of linear correlation between two sets of data
• Strong / Weak?

• Direction?

• Linear / Non-linear?

• Pearsons correlation
coëf ciënt, number
between -1 and +1.

Nice Viz:


How to discover

• Make some plots! Of course …

• Typically make scatter plots of each pair of


• Can you see a relationship? Weak or Strong?

Demo: Wine Quality

• R makes this easy: pairs()
Correlation does NOT
imply Causation
Types of causation
If A and B are correlated then there are different possibilities for

• A causes B

• B causes A

• C causes A and B (‘lurking factor’)

• A causes C which causes B (or vice versa, indirect causation)

• A causes B and B causes A (cyclic or bi-directional)

• There is no connection between A and B at all (coincidence)


How to discover
• Make some plots! Of course …

• Typically make scatter plots of each pair of


• Can you see a relationship? Weak or Strong?

• Describe the relationship using functions (not only

linear: y = a.x + b but also quadratic, exponential)

R demo: USJudgeRatings

• R makes this easy: pairs()

In general, plot the the independent values on the
(horizontal) x-axis and the dependant values on the
(vertical) y-axis.

y = f(x)

Multivariate dependencies:

z = f(x, y)

Some families of functions

• Linear, one variable: f(x) = a*x +b

• Linear, multiple variables: f(x,y) = a*x + b*y + c

• Polynomic (quadratical, cubic, ..): f(x) = a*x^2 + b*x +

c, f(x) = a*x^3 + b*x^2 + c*x + d

• Exponential: f(x) = 10^x

• Logarithmic: f(x) = log(x)

• Gaussian:

Why we need to be quantitive

Later on we are going to try to use some variables to predict
others, this requires tting a sensible function to the
available data.

These problems come in three main categories

1. You have a theoretical model for how the variables
should be related
2. You have no theoretical model and have to guess
something from the data
3. A combination of the two due to e.g. some unexpected


Guessing functions
Often there is not a single right answer.

Which function is good enough?

• Needs to describe the major features of the data
• Should be minimal, as simple as will work
• May well not be unique, you can try tting multiple
functional forms and see which works best.


What features count …

Things to look for and check

• behaviour as x -> +/-

in nite

• turning points (maxima,

minima), gradient = 0

• crossing points with the


Be creative with scale on y-axis

Interpolation and
Horse Manure Crisis (1894)
Interpolation and Prediction
• Interpolation is estimating a value between two
nearest known data points.

• Extrapolation (or Prediction if the Data Set is a

time series) is estimating a value outside the range
of the Data Set using all data points.

• The problem with Extrapolation / Prediction is that

there will always be a trend break (Dutch: trend
breuk) somewhere in the future but it is unknown

Fitting a function
Linear Regression
• We want to get a function that describes our data well,
but we know that there are some uncertainties that
cause some scatter in the data points.

• Linear function of one variable:

>fit1D <-lm(y~x)or glm(y~x)

• Linear function of two variables:

>fit2D <- lm(y~x+z)or glm(y~x+z)

• Get some statistics on how good the t is:

> summary(fit1D)


R: Linear Fit, one variable

# linear fit (one variable)

fit <- glm(y~x)

co <- coef(fit)

abline(fit, col=“blue”, lwd=2)

R: Linear Fit, more variables

# linear fit (two variable)

fit <- glm(y~x+z)

co <- coef(fit)

persp(fit, …)
Non-Linear Regression
• We want to get a function that describes our data
well, but we know that there are some uncertainties
that cause some scatter in the data points.

• Non-Linear function: nls(y ~ f(x), data = …,

start = list(p0 = …, p1 = …, …))

• (Calculating sensible starting parameters will make

your life easier.)

R : Polynomial Fit
# polynomialial fit

f <- function(x,a,b,c){a*x^2 + b*x + c}

fit <- nls(y~f(x,a,b,c), data = …, start

= c(a=1, b=1, c=1))

co <- coef(fit)

curve(f(x, a=co[1], b=co[2], c=co[3]),

add=TRUE, col=“pink”, lwd=2)
R : Exponential Fit
# exponential fit

f <- function(x,a,b){a*exp(b*x)}

fit <- nls(y~f(x,a,b), data = …, start

=c(a=1, b=1))

co <- coef(fit)

curve(f(x, a=co[1], b=co[2]), add=TRUE,

col=“green”, lwd=2)
R: Logarithmic Fit
# logarithmic fit

f <- function(x,a,b){a*log(x) + b}

fit <- nls(y~f(x,a,b), data = …, start

=c(a=1, b=1))

co <- coef(fit)

curve(f(x, a=co[1], b=co[2]), add=TRUE,

col=“orange”, lwd=2)
Under Fitting and Over Fitting
• If you t a straight line through data points with a
non-linear functional relationship, then you will not
be able to well describe the behavior of the data.
This is called Under Fitting.

• If you de ne a suitably complex function, you can

get it to pass through all your data points (like with
the splines). However, this does not mean that the
features in your function really exist! They are
probably caused by statistical noise. This is called:
Over Fitting.

Check for Under / Over Fitting

• Look at the data points and the t and use your


• Ask yourself if the t describes the data well?

• Ask yourself if the function you have used is the

simplest one that could describe the data?

• Test the t on a subset of the data (training set)


What is the best t?

A. Visual inspection: choose the most simple function
that looks good.

B. Separate the available data in ‘training data’ (>

90%) used to t a model, and ‘test data’ (the rest)
to test the model. Use the least squares method to
calculate distance between predictions (values
calculated with tted function) and observations
(measured values) of the test data.

“With four parameters I can t an elephant, and
with ve I can make him wiggle his trunk.”

–John von Neumann

Residual deviance

• Residual is the difference between the observed

value oi and the predicted/expected value ei.

• Residual deviance is the sum of absolute (i.e. made

positive) value of deviance. The higher the residual
deviance, the worse the t.

Residual deviance

• Residual is the difference between the observed

value y and the predicted/expected value y.

• Residual deviance is the sum of absolute values

(i.e. made positive) of all deviance. The higher the
residual deviance, the worse the t.

R squared
R person correlation
2 vs

• Brie y describe the difference between the two

concepts above.

Homework: three simple

Data Analytics Process

Find Data Sets


Research Clean (Tidy) Target

Question? Audience?
Visualize Model


What do we do with other types of


1. Image datasets
2. Natural Language

Photo’s and Movies:

Image Processing
Image processing
Image processing
How do machines store
R, G, B (and Alfa) channels
1. Grayscale pixel values
2. Mean pixel value of
3. Extract edge feature
Image processing
Image processing
• Segment an image into useful regions

• Perform measurements on certain areas

• Determine what object(s) are in the scene

• Calculate the precise location(s) of objects

• Visually inspect a manufactured object

• Construct a 3D model of the imaged object

•  Find “interesting” events in a video

Magick package
Natural Langauge processing
NLP- many problems

§ Enraged Cow Injures Farmer with Ax

§ Teacher Strikes Idle Kids

§ Hospitals Are Sued by 7 Foot Doctors

§ Ban on Nude Dancing on Governor’s Desk

§ Iraqi Head Seeks Arms

§ Stolen PainEng Found by Tree

§ Kids Make NutriEous Snacks

§ Local HS Dropouts Cut in Half

NLP- even more ideas to solve it

• Term Frequency (TF) is the ratio of number of
times a word occurred in a document to the total
number of words in the document.

• Inverse Document Frequency (IDF) is the

logarithm of (total number of documents divided by
number of documents containing the word).

New Feature TF-IDF is

• Highest when word occurs many times within a small

number of documents (thus lending high
discriminating power to these documents)
• Lower when the term occurs in many documents
(thus offering a less pronounced relevance signal)
• Lowest when the term occurs virtually in all
Grammar is the way in which words are
put together to form proper sentences
What it all has in common?

All of them use feature extraction

The Curse of Dimensionality
Often a Data Set has so much features (variables) that a Data
Analyst does not know where to begin: he/she suffers from
The Curse of Dimensionality.

Up to now we looked at Feature Selection to reduce the

number of variables (dimensionality reduction): Which
features contribute the most to the studied effect?

It is sometimes better not to select existing features

(variables), but to construct new features from the existing
features. We call this Feature Extraction.


“How does the dependent variable

change when the independent
variable(s) change?”
y = b0 + b1*x + e, where:
•b0 and b1 are known as the regression beta
coef cients or parameters:
◦b0  is the  intercept  of the regression line;
that is the predicted value when x = 0.
◦b1 is the slope of the regression line.
•e  is the  error term  (also known as
the  residual errors), the part of y that can be
explained by the regression model
Regression in R
The residuals are the difference between
the actual values and the predicted

So how do we want to interpret this? 

Median shouls be around 0, as we want
our prediction to be symmetrical for both
Q - Q Plot
Q–Q plot (quantile-quantile plot) is a
probability plot, a graphical method for
comparing two probability distributions by
plotting their quantiles against each other.

> qqnorm(cars$speed, pch = 1, frame = FALSE)

> qqline(model$fitted.values, col = "steelblue", lwd = 2)

Regression in R
Residual standard error: The residual standard error is a
measure of how well the model ts the data.

R-squared: It tells us what percentage of the

variation within our dependent variable that the
independent variable is explaining. In other words,
it’s another method to determine how well our
model is tting the data. 
Regression in R
With linear regression we are building a linear model of

y = b0 + b1*x
y = 0.16557(x) + 8.28931

 It is telling us how much

uncertainty there is with our
coef cient. The standard error is
often used to create con dence

Remainder: how did
we do it so far

Extract new feature: BodyMassIndex <- Weight / Height^2

Add column with Feature

Use this new feature for predictions

All togehter

With what we have learned so far we can distinguis 7 main properties, when it comes to the data visualization
The basis: rst three of
seven elements
• Data: the actual variables to be

• Aesthetics: visual
characteristics that represent
data, e.g. position, size, color,
shape, transparency

• Geometries: the shapes we

use to represent our data


Three more, advanced

• Facets: rows and columns of


• Statistics: summaries and

mathematical models

• Coordinates: the plotting space

we are using

Finally add design element.

• Theme: non-data (meta-data or

eye candy)
What do to/ good advices
with your data
Indicate measurement
Distinguish between
measurements (current, history)
and predictions (future), e.g. by
using solid and dotted lines.
Show, emphasize trend
lines and patterns.
Look at Data Sets from
every angle!
Fitting a function (a very
simple mathematical
Check for Under / Over Fitting
• Look at the data points and the t and use your

• Ask yourself if the t describes the data well?

• Ask yourself if the function you have used is the

simplest one that could describe the data?

• Tune the t on a subset of the data (training set) and

test it on the remaining (labelled) data (testing set).


What features count …

Things to look for and check match:

• behaviour as x -> +/- in nite

• turning points (maxima, minima), gradient = 0

• crossing points with the axes


Some Examples 1

Some Examples 2
Some Examples 3

t i l l v i s i b l e
Da t a s
s e l e c t e d
Sca l e s
r e d d o t s
e t r y : c o l o
Some Examples 4

n t a d d e d
a l el e m e
S t a t i s t i c
Florence Nightingale
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Bad Practice
Typical Exam
Theory Exam

• See sample questions

Exam Example Questions
Shotgun questions (a short answer is sufficiënt)
1. Discuss the four V’s that are often used to describe what Big Data is.
2. What is the difference a bar graph and a histogram?
3. Mention at least two file formats and discuss the differences between them.
4. Suppose you have a data set with many variables. Which R function can be used to quickly investigate relation
between all the variables.

Case Study
2. Visualization Design
Since 2014 there are earthquakes in Groningen a region in the north of Holland where
natural gas is pumped up. Initially NAM, the responable company, denied
responsability for the earthquakes and the collateral damage to houses. In 2014 the
Dutch government decided to put a cap on the quantity of gas that could be pumped
up per year. This cap was lowered in the subsequent years, when the earthquakes did
not stop.
• · 2014; Decision to limit the winning of natural gas to 42.5 billion cube meters and
with 80% around Loppersum (where the heaviest quakes occurred).

• · January 2015: Decision to lower the cap to 39.4 billion cube meters.

• · June 2015: Decision to lower the cap to 30 billion cube meters.

The NAM decided not to reduce gas production at every pump location, but only at a
select number of pump locations (marked with an orange colour on the map below).

Translation Dutch – English
Meer gaswinning – More pumping up of natural
Minder gaswinning – Less pumping up of natural
Groninger gasveld – Natural gas field of
Actieve productie lokaties – Active production
Productie lokaties met verminderde gaswinning –
Production locations with decreased pumping up
of natural gas
Gasleidingen – Gas pipes

1. Suppose you have earthquake data (date, time,
latitude, longitude, magnitude and depth) and
information about NAM pumps (latitude, longitude,
quantity of gas pumped up each month). Design and
draw a visualization that shows whether there is a
relationship between the reduction of pumped up
natural gas and the earthquakes.
Good Plot

• What are the properties of a good plot? Can you

describe 3 of the properties? And give examples of
bad practices?

You might also like