Download as pdf or txt
Download as pdf or txt
You are on page 1of 948

Multivariate Data Analysis

6th Edition
An introduction to Multivariate Analysis, Process
Analytical Technology and Quality by Design

Kim H. Esbensen
and
Brad Swarbrick

with contributions from Frank Westad, Pat Whitcombe and Mark


Anderson
Published by CAMO Software AS:

Norway
CAMO Software AS
Gaustadalléen 21
N-0349 Oslo, Norway
Tel: (+47) 223 963 00

United States
CAMO Software Inc.
33300 Egypt Lane
Suite L100
Magnolia, TX 77354, United States
Phone: +1 732 726 9200

The Unscrambler® is a trademark of CAMO Software AS.


Design-Expert® is a trademark of Stat-Ease, Inc.

ISBN 978-82-691104-0-1
© 2018 CAMO Software AS

All Rights reserved. No part of this publication may be reproduced, stored or transmitted, in
any form or by any means, except with the prior permission in writing of the publishers.

Cover art by Gry Andrea Esbensen Norang.

6th Edition

Digital book(s) (epub and mobi) produced by Booknook.biz.


Contents

Preface

Chapter 1. Introduction to multivariate analysis


1.1 The world is multivariate
1.2 Indirect observations and correlations
1.3 Data must carry useful information
1.4 Variance, covariance and correlation
1.5 Causality vs correlation
1.6 Hidden data structures—correlations again
1.7 Multivariate data analysis vs multivariate statistics
1.8 Main objectives of multivariate data analysis
1.8.1 Data description (exploratory data structure modelling)
1.8.2 Discrimination and classification
1.8.3 Regression and prediction
1.9 Multivariate techniques as geometric projections
1.9.1 Geometry, mathematics, algorithms
1.10 The grand overview in multivariate data analysis
1.11 References

Chapter 2: A review of some fundamental statistics


2.1 Terminology
2.2 Definitions of some important measurements and concepts
2.2.1 The mean
2.2.2 The median
2.2.3 The mode
2.2.4 Variance and standard deviation
2.3 Samples and representative sampling
2.3.1 An example from the pharmaceutical industry
2.4 The normal distribution and its properties
2.4.1 Graphical representations
2.5 Hypothesis testing
2.5.1 Significance, risk and power
2.5.2 Defining an appropriate risk level
2.5.3 A general guideline for applying formal statistical tests
2.5.4 A Test for Equivalence of Variances: The F-test
2.5.5 Tests for equivalence of means
2.6 An introduction to time series and control charts
2.7 Joint confidence intervals and the need for multivariate analysis
2.8 Chapter summary
2.9 References

Chapter 3: Theory of Sampling (TOS)


3.1 Chapter overview
3.2 Heterogeneity
3.2.1 Constitutional heterogeneity (CH)
3.2.2 Distributional heterogeneity (DH)
3.3 Sampling error vs practical sampling
3.4 Total Sampling Error (TSE)—Fundamental Sampling Principle
(FSP)
3.5 Sampling Unit Operations (SUO)
3.6 Replication experiment—quantifying sampling errors
3.7 TOS in relation to multivariate data analysis
3.8 Process sampling—variographic analysis
3.8.1 Appendix A. Terms and definitions used in the TOS literature
3.9 References
Chapter 4: Fundamentals of principal component analysis
(PCA)
4.1 Representing data as a matrix
4.2 The variable space—plotting objects in p-dimensions
4.2.1 Plotting data in 1-d and 2d space
4.2.2 The variable space and dimensions
4.2.3 Visualisation in 3-D (or more)
4.3 Plotting objects in variable space
4.4 Example—plotting raw data (beverage)
4.4.1 Purpose
4.4.2 Data set
4.5 The first principal component
4.5.1 Maximum variance directions
4.5.2 The first principal component as a least squares fit
4.6 Extension to higher-order principal components
4.7 Principal component models—scores and loadings
4.7.1 Maximum number of principal components
4.7.2 PC model centre
4.7.3 Introducing loadings—relations between X and PCs
4.7.4 Scores—coordinates in PC space
4.7.5 Object residuals
4.8 Objectives of PCA
4.9 Score plot–object relationships
4.9.1 Interpreting score plots
4.9.2 Choice of score plots
4.10 The loading plot–variable relationships
4.10.1 Correlation loadings
4.10.2 Comparison of scores and loading plots
4.10.3 The 1-dimensional loading plot
4.11 Example: city temperatures in europe
4.11.1 Introduction
4.11.2 Plotting data and deciding on the validation scheme
4.11.3 PCA results and interpretation
4.12 Principal component models
4.12.1 The PC model
4.12.2 Centring
4.12.3 Step by step calculation of PCs
4.12.4 A preliminary comment on the algorithm: NIPALS
4.12.5 Residuals—the E-matrix
4.12.6 Residual variance
4.12.7 Object residuals
4.12.8 The total squared object residual
4.12.9 Explained/residual variance plots
4.12.10 How many PCs to use?
4.12.11 A note on the number of PCs
4.12.12 A doubtful case—using external evidence
4.12.13 Variable residuals
4.12.14 More about variances—modelling error variance
4.13 Example: interpreting a PCA model (peas)
4.13.1 Purpose
4.13.2 Data set
4.13.3 Tasks
4.13.4 How to do it
4.13.5 Summary
4.14 PCA modelling—the NIPALS algorithm
4.15 Chapter summary
4.16 References

Chapter 5: Preprocessing
5.1 Introduction
5.2 Preprocessing of discrete data
5.2.1 Variable weighting and scaling
5.2.2 Logarithm transformation
5.2.3 Averaging
5.3 Preprocessing of spectroscopic data
5.3.1 Spectroscopic transformations
5.3.2 Smoothing
5.3.3 Normalisation
5.3.4 Baseline correction
5.3.5 Derivatives
5.3.6 Correcting multiplicative effects in spectra
5.3.7 Other general preprocessing methods
5.4 Practical aspects of preprocessing
5.4.1 Scatter effects plot
5.4.2 Detailed example: preprocessing gluten–starch mixtures
5.5 Chapter summary
5.6 References

Chapter 6. Principal Component Analysis (PCA)—in practice


6.1 The PCA overview
6.2 PCA—Step by Step
6.3 Interpretation of PCA models
6.3.1 Interpretation of score plots—look for patterns
6.3.2 Summary—interpretation of score plots
6.3.3 Interpretation of loading plots—look for important variables
6.4 Example: alcohol in water analysis
6.5 PCA—what can go wrong?
6.5.1 Is there any information in the data set?
6.5.2 Too few PCs are used in the model
6.5.3 Too many PCs are used in the model
6.5.4 Outliers which are truly due to erroneous data were not removed
6.5.5 Outliers that contain important information were removed
6.5.6 The score plots were not explored sufficiently
6.5.7 Loadings were interpreted with the wrong number of PCs
6.5.8 Too much reliance on the standard diagnostics in the computer program
without thinking for yourself
6.5.9 The “wrong” data preprocessing was used
6.6 Outliers
6.6.1 Hotelling’s T2 statistic
6.6.2 Leverage
6.6.3 Mahalanobis distance
6.6.4 Influence plots
6.7 Validation score plot and PCA projection
6.7.1 Multivariate projection
6.7.2 Validation scores
6.8 Exercise—detecting outliers (Troodos)
6.8.1 Purpose
6.8.2 Data set
6.8.3 Analysis
6.8.4 Summary
6.9 Summary: PCA in practice
6.10 References

Chapter 7. Multivariate calibration


7.1 Multivariate modelling (X, Y): the calibration stage
7.2 Multivariate modelling (X, Y): the prediction stage
7.3 Calibration set requirements (training set)
7.4 Introduction to validation
7.4.1 Test set validation
7.4.2 Other validation methods
7.4.3 Modelling error
7.5 Number of components/factors (model dimensionality)
7.5.1 Minimising the prediction error
7.6 Univariate regression (y|x) and MLR
7.6.1 Univariate regression (y|x)
7.6.2 Multiple linear regression, MLR
7.7 Collinearity
7.8 PCR—Principal component regression
7.8.1 PCA scores in MLR
7.8.2 Are all the possible PCs needed?
7.8.3 Example: prediction of multiple components in an alcohol mixture
7.8.4 Weaknesses of PCR
7.9 PLS-regression (PLSR)
7.9.1 PLSR—a powerful alternative to PCR
7.9.2 PLSR (X, Y): initial comparison with PCA(X), PCA(Y)
7.9.3 PLS—NIPALS algorithm
7.9.4 PLSR with one or more Y-variables
7.9.5 Interpretation of PLS models
7.9.6 Loadings (p) and loading weights (w)
7.9.7 The PLS1 NIPALS algorithm
7.10 Example—interpretation of PLS1 (octane in gasoline) part 1:
model development
7.10.1 Purpose
7.10.2 Data set
7.10.3 Tasks
7.10.4 Initial data considerations
7.10.5 Always perform an initial PCA
7.10.6 Regression analysis
7.10.7 Assessment of loadings vs loading weights
7.10.8 Assessment of regression coefficients
7.10.9 Always use loading weights for model building and understanding
7.10.10 Predicted vs reference plot
7.10.11 Regression analysis of octane (Part 1) summary
7.10.12 A short discourse on model diagnostics
7.10.13 Residuals in X
7.10.14 Q-residuals
7.10.15 F-residuals
7.10.16 Hotelling’s T2 statistic
7.10.17 Influence plots for regression models
7.10.18 Always check the raw data!
7.10.19 Which objects should be removed?
7.10.20 Residuals in Y
7.11 Error measures
7.11.1 Calculating the SEL for a reference method
7.11.2 Further estimates of model precision
7.11.3 X–Y relation outlier plots (T vs U scores)
7.11.4 Example—interpretation of PLS1 (octane in gasoline) Part 2: advanced
interpretations
7.11.5 Sample elimination
7.11.6 Variable elimination
7.11.7 X–Y relationship outlier plot
7.12 Prediction using multivariate models
7.12.1 Projected scores
7.12.2 Prediction influence plots
7.12.3 Y-deviation
7.12.4 Inlier statistic
7.12.5 Example—interpretation of PLS1 (octane in gasoline) Part 3: prediction
7.13 Uncertainty estimates, significance and stability—Martens’
uncertainty test
7.13.1 Uncertainty estimates in regression coefficients, b
7.13.2 Rotation of perturbed models
7.13.3 Variable selection
7.13.4 Model stability
7.13.5 An example using data from paper manufacturing
7.13.6 Example—gluten in starch calibration
7.13.7 Raw data model
7.13.8 MSC data model
7.13.9 EMSC data model
7.13.10 mEMSC data model
7.13.11 Comparison of results
7.14 PLSR and PCR multivariate calibration—in practice
7.14.1 What is a “good” or “bad” model?
7.14.2 Signs of unsatisfactory data models—a useful checklist
7.14.3 Possible reasons for bad modelling or validation results
7.15 Chapter summary
7.16 References

Chapter 8. Principles of Proper Validation (PPV)


8.1 Introduction
8.2 The Principles of Validation: overview
8.3 Data quality—data representativity
8.4 Validation objectives
8.4.1 Test set validation—a necessary and sufficient paradigm
8.4.2 Validation in data analysis and chemometrics
8.5 Fallacies and abuse of the central limit theorem
8.6 Systematics of cross-validation
8.7 Data structure display via t–u plots
8.8 Multiple validation approaches
8.9 Verdict on training set splitting and many other myths
8.10 Cross-validation does have a role—category and model
comparisons
8.11 Cross-validation vs test set validation in practice
8.12 Visualisation of validation is everything
8.13 Final remark on several test sets
8.14 Conclusions
8.15 References

Chapter 9. Replication—replicates—but of what?


9.1 Introduction
9.2 Understanding uncertainty
9.3 The Replication Experiment (RE)
9.4 RE consequences for validation
9.5 Replication applied to analytical method development
9.6 Analytical vs sampling bias
9.7 References

Chapter 10. An introduction to multivariate classification


10.1 Supervised or unsupervised, that is the question!
10.2 Principles of unsupervised classification and clustering
10.2.1 k-Means clustering
10.3 Principles of supervised classification
10.4 Graphical interpretation of classification results
10.4.1 The Coomans’ plot
10.5 Partial least squares discriminant analysis (PLS-DA)
10.5.1 Multivariate classification using class differences, PLS-DA
10.6 Linear Discriminant Analysis (LDA)
10.7 Support vector machine classification
10.8 Advantages of SIMCA over traditional methods and new
methods
10.9 Application of supervised classification methods to
authentication of vegetable oils using FTIR
10.9.1 Data visualisation and pre-processing
10.9.2 Exploratory data analysis
10.9.3 Developing a SIMCA library and application to a test set
10.9.4 SIMCA model diagnostics
10.9.5 Developing a PLS-DA method and application to a test set
10.9.6 Developing a PCA-LDA method and application to a test set
10.9.7 Developing a SVMC method and application to a test set
10.9.8 Conclusions from the Vegetable Oil classification
10.10 Chapter summary
10.11 References

Chapter 11. Introduction to Design of Experiment (DoE)


Methodology
11.1 Experimental design
11.1.1 Why is experimental design useful?
11.1.2 The ad hoc approach
11.1.3 The traditional approach—vary one variable at a time
11.1.4 The alternative approach
11.2 Experimental design in practice
11.2.1 Define stage
11.2.2 Design stage
11.2.3 Analyse stage
11.2.4 Improve stage
11.2.5 The concept of factorial designs
11.2.6 Full factorial designs
11.2.7 Naming convention
11.2.8 Calculating effects when there are many experiments
11.2.9 The concept of fractional factorial designs
11.2.10 Confounding
11.2.11 Types of variables encountered in DoE
11.2.12 Ranges of variation for experimental factors
11.2.13 Replicates
11.2.14 Randomisation
11.2.15 Blocking in designed experiments
11.2.16 Types of experimental design
11.2.17 Which optimisation design to choose in practice
11.2.18 Important effects
11.2.19 Hierarchy of effects
11.2.20 Model significance
11.2.21 Total sum of squares (SStotal)
11.2.22 Sum of squares regression (SSReg)
11.2.23 Residual sum of squares (SSError)
11.2.24 Model degrees of freedom (υ)
11.2.25 Example: building the ANOVA table for a 23 full factorial design
11.2.26 Supplementary statistics
11.2.27 Pure error and lack of fit assessment
11.2.28 Graphical tools used for assessing designed experiments
11.2.29 Model interpretation plots
11.2.30 The chemical process as a fractional factorial design
11.2.31 An introduction to constrained designs
11.3 Chapter summary
11.4 References

Chapter 12. Factor rotation and multivariate curve resolution


—introduction to multivariate data analysis, tier II
12.1 Simple structure
12.2 PCA rotation
12.3 Orthogonal rotation methods
12.3.1 Varimax rotation
12.3.2 Quartimax rotation
12.3.3 Equimax rotation
12.3.4 Parsimax rotation
12.4 Interpretation of rotated PCA results
12.4.1 PCA rotation applied to NIR data of fish samples
12.5 An introduction to multivariate curve resolution (MCR)
12.5.1 What is multivariate curve resolution?
12.5.2 How multivariate curve resolution works
12.5.3 Data types suitable for MCR
12.6 Constraints in MCR
12.6.1 Non-negativity constraints
12.6.2 Uni-modality constraints
12.6.3 Closure constraints
12.6.4 Other constraints
12.6.5 Ambiguities and constraints in MCR
12.7 Algorithms used in multivariate curve resolution
12.7.1 Evolving factor analysis (EFA)
12.7.2 Multivariate curve resolution–alternating least squares (MCR–ALS)
12.7.3 Initial estimates for MCR-ALS
12.7.4 Computational parameters of MCR–ALS
12.7.5 Tuning the sensitivity of the analysis to pure components
12.8 Main results of MCR
12.8.1 Residuals
12.8.2 Estimated concentrations
12.8.3 Estimated spectra
12.8.4 Practical use of estimated concentrations and spectra and quality checks
12.8.5 Outliers and noisy variables in MCR
12.9 MCR applied to fat analysis of fish
12.10 Chapter summary
12.11 References

Chapter 13. Process analytical technology (PAT) and its role


in the quality by design (QbD) initiative
13.1 Introduction
13.2 The Quality by Design (QbD) initiative
13.2.1 The International Conference on Harmonisation (ICH) guidance
13.2.2 US FDA process validation guidance
13.3 Process analytical technology (PAT)
13.3.1 At-line, online, inline or offline: what is the difference?
13.3.2 Enablers of PAT
13.4 The link between QbD and PAT
13.5 Chemometrics: the glue that holds QbD and PAT together
13.5.1 A new approach to batch process understanding: relative time modelling
13.5.2 Hierarchical modelling
13.5.3 Classification–classification hierarchies
13.5.4 Classification–prediction hierarchies
13.5.5 Prediction–prediction hierarchies
13.5.6 Continuous pharmaceutical manufacturing: the embodiment of QbD and PAT
13.6 An introduction to multivariate statistical process control
(MSPC)
13.6.1 Aspects of data fusion
13.6.2 Multivariate statistical process control (MSPC) principles
13.6.3 Total process measurement system quality control (TPMSQC)
13.7 Model lifecycle management
13.7.1 The iterative model building cycle
13.7.2 A general procedure for model updating
13.7.3 Summary of model lifecycle management
13.8 Chapter summary
13.9 References
Preface

The field of chemometrics is the application of multivariate data


analysis (MVA) methodology to solve chemistry-based problems, and
it has developed into a mature scientific discipline since its start in the
early 80’s. Although there are numerous applications in daily use in
many industries, the knowledge and use of methods are not so
widespread. While the academic MVA research toolbox is well
established and ever increasing, its industrial counterpart has slowly
gained momentum, particularly over the last decade. This could be
due to many reasons, such as the advancement of industrial data
management solutions, the semi-continuous financial crises that drive
industry to smarter manufacturing paradigms and intelligent data
mining to reduce costs and waste. Whatever the reasons, it is clear
that chemometrics and multivariate analysis have a big role to play in
connecting the dots between research and the new industrial
paradigm with advanced and more robust sensors, agile
manufacturing processes and extensive product quality control. The
necessity of proper procedures for collecting and analysing data has
become even more important as everyone is into “Big Data” these
days; however, a better term might be “Smart Data”. Welcome to the
multivariate world!
This book aims at conveying a universal philosophy within the field
of multivariate data analysis. The authors believe that all processes or
systems are multivariate in nature until proven otherwise, and
therefore must be analysed, modelled and understood as such. The
topic of proper data quality is emphasised. There are two important
aspects in any field that requires experimentation—the ability to
understand the outputs of the experiment and the ability to put this
newfound knowledge into use, for business gain or deeper scientific
understanding. Whether the experiment involves measurements
generated by scientific instruments, sensory data, manufacturing
process data or psychometric variables, the two main questions to be
answered of data analysis are: can the process/experiment be
understood and how can this data be turned into useful information?
This book will, first, arm the reader with the multivariate tools
available to understand the data they generate and develop valid and
robust models from such data and, second, show how these models
can be applied in real-world situations. Chapter one, Introduction,
elaborates on the above themes in full depth. In particular, the data
analysis scope presented in this book a.o. is also a warning against
too trigger-happy machine learning forays—the data analyst must be
in charge.
The revision of this book includes an extended chapter on
validation. Over the years, it has become clear that validation as a
concept is not presented explicitly in most educational institutions.
One reason might be that it does not belong in a specific science
subject and therefore “falls between many stools”. The same can also
be said about the principles of Design of Experiments (DoE), although
its usage should be mandatory for anyone performing trials or
observing processes. DoE is an important tool for understanding
causality in a system and for confirming/rejecting test hypotheses. In
combination with multivariate analysis, the symbiosis of DoE and
MVA gives the best of both worlds. It is also a fundamental basis for
chemometrics that the owner of the data should also analyse them,
with subject matter and domain specific background knowledge in
mind.
The basic topics in the 5th edition have all been extended
significantly, making the book thoroughly revised, updated and
modernised, a.o. with a comprehensive chapter on basic statistics, a
long-needed fundamental chapter on sampling which allows a new
holistic view on both data quality and on the concept of “replication”.
Finally, the book now contains a tour de force introduction to Process
Analytical Technology (PAT) and Quality by Design (QbD)
implementation. Although PAT has a strong connection to the
pharmaceutical industry, the principles apply with equal force to
many other application areas in technology and industry.
While the multivariate methods lend themselves to empirical
analysis of data sampled from science, technology and nature, i.e.
any system with multiple underlying structures, there is nothing that
prevents the use of first principle models in combination with actual
observations.
CAMO Software’s philosophy is not to describe all methods in the
world into its Unscrambler® platform, but to provide methods that are
versatile and suited for any kind of data, regardless of their size and
properties. The authors and CAMO believe that the focus should be
on graphical presentation of results and their interpretation rather
than tables with p-values. This is related to the distinction between
significance and relevance. With a high number of objects any test for
significance between two groups or correlation between two
variables will be statistically significant. Thus, a table of p-values does
not necessarily show if a model is suitable for predicting selected
properties such as product quality at the individual level.
When this is said and done, it is realised that summarising the
important findings from a project or study is often efficiently done
with bullet points or univariate statistics. Our message is that the
multivariate methods provide the fastest insight into complex data to
arrive at the correct conclusions and to avoid “searching for
correlations”. The situation is that even after 40 years of multivariate
methods and particularly multivariate calibration, it is not known to
the majority of people that selectivity is not needed to predict quality
of a product or classify/identify samples such as raw materials.
Being a data analyst is about practicing the methods and software
on your own data.
We wish you all the best, and may your models be with you!
Prior to this edition of the book, previous editions included
exercises with detailed description of how to operate The
Unscrambler® software package. In this edition, however, the focus is
on the analytical methods, their interpretation and finally, a topic that
is not well covered to date, implementation of models including their
proper use. For more details, see www.camo.com or videos on
YouTube.
The 6th edition of this textbook/software education package
represents a fundamentally updated, revised, extended and
augmented new product. The present textbook is a result of an
inspiring and pleasant joint effort of the two main authors in every
aspect involved: didactic concept and design, revising trusted old
chapter cores, de novo writing of ~50% new material, production of
new illustrations and editorial subject-matter proofing. The final result
owes a great debt of gratitude to two external reviewers with
extensive didactic chemometric experience. We express our most
sincere gratitude for their comprehensive treatment of the first draft
manuscript with well-reflected suggestions for an optimised learning
“flow” as well as sharp insights concerning critical presentation
details.
This book is now in its 6th edition (2017), being released on the
basis of a total of 33,000 earlier copies; there is also a Russian
language version (abbreviated contents). For the first author, it has
been an immense privilege to serve the chemometric community, and
indeed beyond, with this book since 1994 (1st Edition). The present 6th
edition is—finally—one with which to be truly satisfied. A profound
thank you goes to Brad for stepping in with gusto!

Kim H. Esbensen, Brad Swarbrick (authors—with a mission)


Frank Westad (contributing author, CAMO CSO)
Copenhagen, Sydney, Oslo, 1 August 2017
Chapter 1. Introduction to
multivariate analysis

This chapter presents an overview of the elements of multivariate


data analysis, the most typical data analysis objectives and a set of
methodological issues surrounding multivariate data analysis: not
only how to, but also why do multivariate data analysis in this, or that,
particular way. More than 35 years of chemometric experience on
behalf of the lead and contributing co-authors and CAMO Software
lies behind this sixth completely updated and revised edition of
Multivariate Data Analysis—An Introduction. This has resulted in a
unified, comprehensive data analysis philosophy including the utmost
quality demands for software (The Unscrambler®) and with respect to
teaching and documentary materials. This approach also makes
stringent demands on the relevance, generalisation potential and
quality of the selected data sets provided as examples of data
analytical workflows for the reader (some data sets have survived all
six editions, but over the years many new data sets have also been
introduced).
Most notably, the philosophy behind the current textbook
distinguishes itself from other approaches by specific didactic use of
geometric projections as a critical lead-in to the more traditional
algebraic formulations of the data analysis methods. Certain aspects
of the Theory of Sampling (TOS) are also included as a basis for a
thorough understanding of the essential validation issues, which have
to do with a necessary understanding of the characteristics of
heterogeneity. This book sets itself the goal to leave the reader with
complete understanding of all data error sources “from field (or from
plant) to data analysis”. Thus, the chemometric world does not start
with a data matrix, but rather with sampling (acquiring samples for
analysis) or with an experimental plan of how to acquire (measure)
data and with what sensors/instruments, issues that critically
influence the way chemometric data analysis is carried out and the
way the resulting models are to be validated.

1.1 The world is multivariate


Nature is multivariate, as are most data-generating pathways from
science, technology and industry, in the sense that any one particular
phenomenon a data analyst would like to study in quantitative detail
often overwhelmingly depends on several factors. For instance, the
weather depends on influencing variables such as wind, air pressure,
temperature, dew point (and many others)—besides the obvious
seasonal variations (also giving this issue a time series aspect). The
health status of a human individual most certainly depends on a
multitude of factors, including genes, social position, eating habits,
stress, environmental impact and much more… that may need to be
taken into account before diseases can be characterised, understood
and cured in a proper, holistic context. The tone quality of a violin?
The environmental impact from car exhaust gasses? The climate
crisis is by now very well understood to be a result of a plethora of
influencing factors—add to that political decision-making issues. All
these examples certainly depend on a score of factors, as do very
many technological systems: think of the absorption spectrum of just
one analyte in the presence of other analytes, other matrix elements
and perhaps externally introduced contaminants. No matter to which
science one directs data analytical attention, properties only very,
very rarely depend on one- and only one-variable.
This is also the case in other scientific disciplines in which
underlying causal relationships give rise to observable data, for
instance economics, sociology, psychology as well as in technology.
In the industrial sphere the multivariate dictum is the rule in all of the
processing and manufacturing sectors.
Accordingly, data analytical methods dealing with only one
variable at a time, known as univariate methods, often turn out to be
of limited use in today’s complex contexts. It is still necessary to
master these univariate methods, however, as they often carry
important marginal information, and they are a very powerful, natural
stepping stone into the multivariate realm—while always keeping in
mind that they are often insufficient for a complete data analysis.
In this book, attention is given to the basic univariate statistics and
approaches that apply to a wide-ranging series of scientific and
technological examples that naturally lead to multivariate problems
and associated multivariate data. However, focus shall
overwhelmingly be on systems which exhibit such a degree of
complexity where, typically, more than three variables are needed.

1.2 Indirect observations and correlations


It is thus often necessary to sample, observe, study or measure more
than one variable. When the measuring or recording instruments
correspond directly to the phenomenon being investigated,
everything is fine. For instance, if one wishes to determine the
temperature, a first thought would no doubt be to use a thermometer
—the readings would constitute a direct univariate measurement.
Unfortunately, this is very seldom the situation with anything but the
simplest of systems. When it is not possible to measure or observe a
desired parameter or variable directly, one is forced to turn to indirect
measurement, which is the situation in which multivariate data are
most often generated. So, for example, if the temperature is high
enough to melt the thermometer or it is simply not practical to use a
thermometer, one would have to determine the temperature
indirectly. In this context it is worth mentioning that a thermometer
also needs to be calibrated; in fact it does not directly measure the
temperature as such but the volume expansion of mercury. This is an
example of an analytical calibration in the strict univariate sense. To
determine the temperature of the furnace load, an alternative way
would be to use Infrared (IR)-emission spectroscopy and then
estimate the temperature from the recorded IR-spectrum. This would
be an indirect measurement-measuring something else to determine
the desired quantity; and it is not only temperature which can be
derived from such a multivariate spectral signal (e.g. chemistry,
certain physical parameters …). In the particular case of the blast
furnace, one would have to use many spectral wavelengths to do the
job; in other words, an indirect multivariate characterisation. This is a
typical feature of nearly all types of such indirect observation and
therefore underscores an obvious need for a multivariate approach in
the ensuing data analysis or data modelling.

1.3 Data must carry useful information


The basic assumption underlying multivariate data analysis,
multivariate data modelling, is that the measured data actually carry
information about the desired properties and experimental objectives.
If the objective is to find, say, the concentration of a particular
chemical substance in a liquid mixture, the measurements performed
on the mixture must in some way reflect the concentration of that
substance. Quite clearly the mere act of measuring many variables is
in itself not necessarily a sufficient condition for multivariate data
analysis to bring forth information. This may not always seem to be
an obvious issue in many complex real-world situations. Still,
throughout this book it will be shown how powerful multivariate
methods are and, indeed, it will often be tempting to use them on any
measuring problem. But not even multivariate methods can help in
cases where the data do not contain relevant information about the
property that is being sought.
The implied information in the data will depend on how well the
problem to be solved has been defined, and whether one has
performed the observations appropriately. “Appropriately” is always a
problem-dependent issue; much will be said on this topic in this and
subsequent chapters. The data analyst has a very clear responsibility
to provide, or request (if data are supplied from another party),
meaningful data. Often it is much more important which variables
have been measured than how many observations have actually been
made. In spectroscopy for instance, one often works under “the
burden of availability”, meaning that one cannot freely select an
optimal wavelength region for the problem at hand, but is rather often
forced to choose from an available wavelength range set by the
instrumentation. How many is an issue that often also crops up in the
form of “replicated measurements”, which is discussed in more detail
in chapter 9. There is major confusion regarding this issue in
quantitative spectroscopy, and this discussion is started here.
Collecting more spectra is often very easy indeed. Just because it is
easy to collect more spectra does not necessarily mean that the
additional spectra also carry additional information relative to solving
the problem at hand, over and above the simplest analytical
uncertainty issue is reflecting nothing but the smallest contribution to
the total Measurement Uncertainty (MU).
This is in contrast to standard statistical methods where the
number of observations is the only game in town; here the minimum
number of observations depends on the number of parameters to be
determined: the number of observations is usually stipulated to have
to exceed the number of parameters to be estimated by a factor of
10. This is not necessarily the case with the multivariate projection
methods treated in this book. In one particular sense, variables and
objects may, at least to some degree, stabilise one another.
But it is equally important that one has the ability to select
appropriate measurement ranges of the variables in the data
acquisition setup, or in the experimental plan (in the situation in which
the person doing data analysis also has full control of the data
acquisition). More details on Experimental Design are provided in
chapter 11. In multivariate data analysis on one data matrix alone, an
effective spanning of the experimental space (X) is of paramount
importance. In a regression setting it is the spanning of the response
space (Y) which is the highest priority.
Indirect characterisation requires a quantitative relationship
between the set of measured variables and the property of interest for
which the indirect multivariate observation stands in proxy. If the
values of the measurement variables change, the value of the indirect
property must change accordingly (given a salient associated
uncertainty). Mathematically, this is formulated in such a way that the
desired property (response), called Y, is a function of the measured
variables (predictors), termed X. Thus Y is typically dependent on
several, many or very many variables and X is therefore represented
as a vector, called the measurement vector or the object vector. This
is why the response variable is often known as the dependent
variable(s) and the predictors are known as the independent
variables. The property of interest may, for example, represent a type
of measurement that has to be carried out by a more cumbersome,
laborious or expensive analytical method (a reference method). The
X-variables on the other hand may often be “more direct variables”,
which typically could be faster, easier, less expensive or more
efficient to obtain; such as spectroscopic measurements and/or
others that can be carried out instrumentally, automatically or
otherwise. This X–Y relationship is central to the multivariate
calibration issue and will typically occupy the centre stage in what is
to follow.
Some basic definitions needed for multivariate data analysis are
provided in frame 1.1. The two types of measurements [X,Y] are
always organised in two matrices, Figure 1.1.
In most software packages, as also in The Unscrambler®, the X
and Y matrices can be stored as one master data matrix
(concatenated as XY). In many data analysis tasks, it is up to the data
analyst to specify the X and the Y block at the outset, which, with
experience, becomes a very versatile option, allowing the easy
selection of single, or several, non-consecutive variables or
contiguous variable sub-blocks. This option makes explorative data
analysis very powerful, and allows what could be called “experimental
data analysis” (“What if?”).
1.4 Variance, covariance and correlation
A minimum set of statistical definitions pertaining to univariate
summary statistics are listed below. The meaning of the most
important ones will be briefly discussed. The points brought up are
highly relevant to all the following chapters. In chapter 2, a more
detailed explanation of these statistics is provided.
The variance of a variable is a measure of the spread of the
recorded values, which corresponds to how large a range is covered
by the measured n values and has major implications in much of what
is discussed in this book. This univariate concept of “variance”
should be kept in mind also when dealing with more complex
multivariate data structures. It is often the tradition to express the
measure of this spread in the same units as the raw measurements
themselves, hence the commonly used expression [root variance] =
the standard deviation (std).

Object Observation on one sample, recorded as a vector

X-Variable Easily available, “inexpensive” measurements on one object.


Independent measurement

Y-Variable Difficult or “expensive” to measure observation(s) on the same


object. Dependent measurement

N Total number of objects (samples, observations)

P Total number of X variables

Q Total number of Y variables

Frame 1.1: Basic definitions.

The covariance between two variables, x1 and x2, is a measure of


their linear association. If large values of variable x1 occur together
with large values of variable x2, the covariance will be positive.
Conversely, if large values of variable x1 occur together with small
values of variable x2, and vice versa (small values of variable x1
together with large values of variable x2), the covariance is said to be
negative. A large covariance (in absolute values) means that there is a
strong linear dependency between the two variables. If the
covariance is small, the two variables are essentially independent of
one another: if variable x1 changes, this does not affect the
corresponding values of variable x2 very much. Notice the analogy
between the defining equations for variance, which concerns one
variable, and covariance, which is a measure concerning a pair of
variables.
Practical experience shows that talk of “large” and “small” is
usually subjective and therefore imprecise; what is “large”, what is
“small” and what is just “a little” must be defined on a relative scale.
For example, if the covariance between pressure and temperature in
a system is 512 [°C*atm] is this a large (strong) or a small (weak)
covariance? Does it mean that temperature and pressure follow each
other closely or that they are nearly independent? And what about the
covariance between temperature and the concentration of a
substance, say, if the covariance is 12 [°C*(mg.dm–3)]? How does that
compare with the covariance between temperature and pressure?

Figure 1.1: Matrices for the two types of measurements X and Y, made on the same set of n
objects.

The absolute magnitude of covariance depends on the measuring


units of the variables, which is why this measure is often not useful for
variables that are measured in different units. To put everything on an
equal footing, i.e. in order to compare and model linear
dependencies, the correlation between two variables is an alternative
directly interpretable and more practical measure of association. The
correlation between two variables is calculated by dividing the
covariance with the product of their respective standard deviations.
Correlation is thus a unit-less number, a scaled covariance measure.
The correlation is, therefore, the most useful measure of
interdependence between variables, as two, or more coefficients of
correlation are directly comparable, whatever units the variables are
measured in. Pearson’s moment correlation coefficient, r2 is defined
below. r2 is often interpreted as the fraction of the total variance that
can be modelled by a linear relationship. It is highly recommended,
indeed mandatory, always to quote and use r2 rather than r and/or the
covariance.
Mean

Variance

Standard Deviation

Covariance (between x and y)

Correlation (between x and y)

Frame 1.2: Fundamental univariate statistical measures.

Frame 1.2 gives the few necessary basic statistical measures


needed for this chapter. Chapter 2 gives a comprehensive statistical
introduction.
The mean, the average of n measurements, is an estimate of the
localisation of a set of measurements; the variance is an estimate of
the spread of these measurements with respect to the mean, as is the
standard deviation, std. Correlations always lie between –1.0 and
+1.0 (Figure 1.2).
A correlation of 0.0 means there is absolutely no correlation; in
other words: no relationship at all between the variables.
A correlation of +1 means there is an exactly linear positive
relationship between the variables, i.e. there is no MU at all (a
highly unusual, indeed unrealistic situation in practice).
A correlation of –1 means there is an exactly linear negative
relationship between the variables (see above for comment on the
realism of such a case as well).
r2 is the most common form of expressing correlation, but note
that the sign of the r measure is lost. This is of no consequence,
however, because, as will be revealed in this book, one always plots
the raw data relationships in order to make use of the visual data
structure information. Based on this, it is always fully clear whether a
particular correlation is negative or positive. In the type of plots that
illustrate the multivariate data analysis methods introduced in this
book, the correlation sign will always also be evidenced directly by
this plotting tradition.

Figure 1.2: Illustration of correlation (r) signs and (r2) magnitudes.

The reader will find many examples in the scientific and technical
literature of the use of the “strength” of correlation, often expressed
by the correlation coefficient r, and it is not infrequent to see claims,
for example, that r being of the order of 0.50 is testimony to a “good
correlation”. However, this says that only 25% of the total scaled
covariance is systematic, see the right-most panel in Figure 1.2; a
correlation coefficient squared of 0.261 is anything but “good”. One
should always use r2 as a measure of the fraction of the total
correlation as an expression of a systematic relationship—never r.

1.5 Causality vs correlation


While correlation is a statistical concept that furthers a quantitative
measure of the degree of linear association between two variables, it
is basically a neutral, phenomenological measure. It tells nothing of
which of the two variable involved influences the other in a causal
context.
Cause and effect dealings may ensue from interpretation of
deterministic relationships, and may, for example, lead to a causal
understanding of such issues. Many think in terms of cause and
effect when using the term correlation, but this is to be avoided if the
necessary domain-specific knowledge is small or lacking. One should
most specifically use all application and domain-specific knowledge
when interpreting correlated variables with the aim of determining
cause and effect. For example, a statistical survey amongst a number
of small Danish rural towns showed that the number of babies born in
the towns could be correlated fairly well to the number of storks
found in the area (squared correlation, r2, of about 0.85—no less).
However, nowadays very few people believe that storks bring the
babies.
Examples of such confusion between descriptive statistics and
causal interpretation hereof are common in science, technology and
industry, where they may crop up in much more complex contexts
than this—be aware. One can do meaningful first foray multivariate
data analysis both with, but indeed very often also without, such
specific, detailed causal understandings.
1.6 Hidden data structures—correlations again
So far, the concepts of indirect observations and correlations have
been introduced: the prime concepts regarding multivariate data.
There will always be some correlation (strong, intermediate, weak)
between any set of multivariate measurements and the property that
is wished to be estimated via a set of indirect observations, if nothing
else because it was chosen to use many characterising variables. In
general, the higher such correlation, the more precise the Y estimate
may be.
Similar statements apply to one set of X variables alone. There will
always be correlations when dealing with multivariate data sets from
science, technology and industry, as indeed also from many other
disciplines and application fields (economic, sociological, sensory
data, as well as quantifiable aspects within the humanities). In the
marginal situation in which no variable is significantly correlated to
any other in a X matrix, the only relevant approach is through
univariate data analysis and traditional univariate statistics. If this is
the case for an X matrix in a regression context, Multiple Linear
Regression (MLR) is all that has to be applied; there will be no need
for the regression techniques which will be introduced here, Principal
Component Regression (PCR) and Partial Least Squares (PLS), which
are discussed in detail in chapter 7.
However, it is the simultaneous contribution from several different
variables, each with a significantly different information content
(minor, intermediate, strong), that often enables multivariate
modelling, precisely because of, and through, their mutual significant
correlations. Only in very rare, special cases will the Y property
depend on only one X variable, in which case the correlation between
property and variable necessarily must be high. This is called a
“selective variable”.
Multivariate data analysis typically deals with “non-selective”
variables – and invites one to use many (very many if need be) of
these in lieu of selectivity. But more is not always good. In general,
multiple measurements will always also contain elements (quantitative
components) that are irrelevant to the property being sought. These
may be effects that have absolutely nothing to do with what one
seeks to model, things that are causally uncorrelated. Sampling and
instrumental noise, and other effects, will always be present to some
degree as well (the topic of how to distinguish between measurement
error—and sampling error effects, which are actually often very much
influential—will be addressed extensively later). It is only the degree
of such effects that typically is not known in advance. There may also
be other phenomena that often just happen to be measured at the
same time, a situation which is always problem-specific. A warning
note is sounded here already of the inherent dangers in assuming that
all such effects are of the nature termed “random” in statistics. This is
far from the case in most real-world data. In fact, the present
introduction to multivariate data analysis is quite specific in also
focusing on other than random (systematic) effects, even though
these at first sight would appear much more difficult to deal with than
white, or symmetric, noise. An example of this would be the case
where one wishes to find the concentration of substance A in a
mixture which also contains substances B and C. Spectroscopy can
be used to determine the A concentration, but the measured spectra
will not only contain spectral bands from A, which is the analyte of
interest, but necessarily also contributions from bands from the other,
irrelevant, compounds, which one cannot avoid measuring at the
same time; as illustrated in Figure 1.3.
The problem will therefore be to find which contributions come
from A, and which from B and C. Since it is substance A that is to be
determined (quantified), B and C must in this context be considered
as interferents making up the background matrix, i.e. “noise”, which,
as can be plainly seen, is highly structured—not at all complying with
a random noise assumption.
Figure 1.3: Partially overlapping spectra, illustrating a very often encountered situation in
multivariate calibration of spectroscopic data (schematic illustration).

Whether the B and C signals are considered to be noise is, of


course, strongly dependent on the problem definition; if B was the
substance of interest, A and C would now be considered interferents.
In still another problem context, interest may be focused on
measuring and quantifying the contributions from both A and B
simultaneously. This is one of the particularly strong issues in the so-
called multivariate calibration realm, see further below. In the latter
case only the contributions from C would now be considered noise
arising from a matrix interferent. Multivariate calibration, in many
ways a crowning achievement in chemometrics, can handle all these
situations and much more, even without knowing what substances B
and C are when establishing a model for A.
The issue here is that it often is the context of the problem that
determines what to consider as “signal” and what to consider as
“noise”. Real world problems are, of course, often more complex than
the above, but the issues are essentially identical. In general,
multivariate observations can, therefore, be thought of as a sum of
two parts:
The data structure is the signal part that is correlated to the
property of interest. This is often called the information part of the
data. The noise part is “everything else”; that is to say contributions
from other components, instrumental noise, analytical errors and
oftentimes also effects from sampling errors etc. The “noise” term is
often alternatively designated as “errors”.
In multivariate data analysis, the objective is to find (and keep) the
structured part and throw away the noise part. This is carried out by
focusing on modelling the structured part in some optimal fashion
(appropriate optimisation criteria are essential). The problem is that
our observations always are a sum of both of these parts, and the
structure part will at first be “hidden” in the raw data. Thus, what
should be kept and what should be discarded may not be apparent at
first, and indeed good use of the noise part is made, as an important
measure of model fit (it is rather a measure of model lack-of-fit which
can be used just as well—it just needs to be sufficiently small).
This is where multivariate data analysis enters the scene. One of
its most important objectives is to make use of the intrinsic variable
correlations in a given data set to separate these two parts. There are
quite a number of multivariate methods, and a fair selection is
included in The Unscrambler®. Exclusivity for methods that use
covariance/correlation directly in this signal/noise separation or
decomposition of the data, are emphasised in this book.
In point of fact one may say that the variable correlations act as a
driving force for the multivariate analysis.

1.7 Multivariate data analysis vs multivariate


statistics
Multivariate data analysis and multivariate statistics are related fields
with no clear-cut division, but each has a different focus. In
multivariate statistics one is in addition to the signal part also very
interested in the stochastic (random) error part of the data, the noise
part. This part plays an essential role for statistical inference,
hypothesis testing and other statistical considerations.
In multivariate data analysis on the other hand, the overall focus is
mainly related to the practical, problem-specific use (and
interpretation) of the structure part. Still the noise part is also
important here, amongst other things helping to get the most
information out of the structure part and, on the other hand, guarding
against overfitting a model. And one always needs to know
quantitatively how much of the empirical, observable, data variance is
information (structure) and how much is representing errors (noise) in
the data. This is typically reported as a “% proportion of the total
variance modelled” in the data.
Thus there is not a high wall separating these two fields, and there
are many situations where they overlap somewhat in practice. The
methods described in this book make use of univariate statistics to
estimate, e.g., critical limits and uncertainty for model parameters.
For typical multivariate data analysis purposes, the emphasis in this
book will mostly be on the structure part of the data.

1.8 Main objectives of multivariate data analysis


There are very many multivariate data analysis techniques. Which
method to choose depends on the type of answer that is relevant for
the application. It is very important that the data analytical problem is
formulated in such a way that the goal for the analysis is perfectly
clear and that the data are organised in a form suited for achieving
this goal. This is by no means a simple task and should form a major
part of the initial efforts of experimentation; there may at times be a
lot of work involved in these seemingly menial tasks, but this effort
always pays dividends, in the form of much easier data analysis and
interpretation.
When the problem and goal are both well specified, the choice of
technique is generally not difficult, and rather often it will be obvious
which technique to use. But first one must acquire an experience-
based overview of exactly what the seemingly alternative methods do
(and do not do) in order for this choice to become obvious. Making
this an informed choice is one of the main objectives in the present
textbook.
Multivariate data analysis is traditionally used for a number of
distinctly different purposes, but for the present introduction the
objectives can be grouped according to three main objectives:
1) Data description (exploratory data analysis; empirical data structure
modelling)
2) Discrimination and classification
3) Regression and prediction
Below follows an initial introduction to these typical data analysis
objectives.

1.8.1 Data description (exploratory data structure modelling)

A large part of multivariate analysis is concerned with simply looking


for structure(s) in data, characterising the data by useful summaries
and finding useful object and/or variable relationships from a first
analysis. This is often done by displaying the latent or hidden data
structures in the X and Y matrices visually by suitable graphical
projection plots. As a case in point, the data in question could be
state parameter values monitored in an industrial process at several
locations, or measured variables (temperature, refractive indices,
reflux times etc.) from a series of organic syntheses—in general any
p-dimensional characterisation of n samples.
The specific objectives of univariate and multivariate data
description can be manifold: determination of simple means and
standard deviations etc., as well as establishing correlations and
functional regression models. For the example of organic synthesis,
one may naturally be interested in seeing which variables affect the
product yield or the selectivity of the yield as a function of the
experimental parameters. The variables from the synthesis could also
be used to answer questions like: how correlated is temperature with
yield? Is distillation time of importance for the refractive index?
In chapter 6, Principal Component Analysis (PCA) will be
introduced. PCA is by far the most powerful and popular, as well as
the most frequently used, method for general data description and
exploratory data structure modelling of any generic (n,p)-dimensional
data matrix.

1.8.2 Discrimination and classification

Discrimination deals with the separation of groups of data. Suppose


that one has a large number of spectra collected by Near Infrared
(NIR) reflectance measurements of apples and, after the data
analysis, it turns out that the measurements are clustered into two
groups—and that these correspond to, e.g., sweet and sour apples.
There is now the possibility to derive a quantitative data model in
order to discriminate between these two groups (by using NIR
spectra) for future un-discriminated apples. The discrimination
objective is also identical in the case where more than two groups are
on the agenda.
Classification has a somewhat similar purpose, but here one
typically knows a set of relevant groupings in the data before
analysis, that is to say which a priori discriminated data groups are
relevant to model. Classification thus presumes the existence of a
relevant set of “training groups”. These may either be given directly
from the problem context, or they may have appeared in a previous
stage by way of an initial exploratory data analysis. This is but the first
example of a close interplay between objectives 1) and 2) above;
many more will be presented below.
In the above example, this would mean that at the outset one
already knows that there are differences between sweet and sour,
“supervised pattern recognition” to introduce a term which will be
elaborated below. The aim of the data analysis would now be to
assign, i.e. to classify, new apples (based on new measurements) to
the classes of sweet or sour apples. Classification thus requires an
a priori class description. Interestingly, the method of PCA can be
used to great advantage [Soft Independent Modelling of Class
Analogies (SIMCA) approach], but there are other, competing
multivariate classification methods to opt for as well. Here,
experience will be a great asset when choosing the appropriate
method to use.
There is a close relationship between discrimination and
classification:
Discrimination deals with dividing an unknown data matrix into two or
more groups of objects (measurements) according to their
multivariate signatures.
Classification deals with assigning new data (not present in the
original data matrix), to the set of already modelled data groups, data
classes.
In chemometric data analysis it is gratifying that the basic method
(PCA) may be used for both of these objectives, simply used in
slightly different fashions. The topic of Multivariate Classification is
covered in great detail in chapter 10.

1.8.3 Regression and prediction

Regression is an approach for relating two sets of variables (often


called “variable blocks”) to each other. It corresponds to modelling
one (or several) Y variables on the basis of a well-chosen set of
relevant X variables, where X in general must consist of more than,
say, three variables. Note how this setup is closely related to the
concept of indirect observation. The indirect observations would be
collected in X and the property, or properties, of interest would be
collected in Y. Regression is used widely in very many diverse
scientific and technological fields as well as in industry.
In this introduction the fundamental regression methods Principal
Component Regression (PCR) and Partial Least Squares Regression
(PLSR) are first presented, while making due reference to the
statistical approach Multiple Linear Regression (MLR). After this
foundation, the reader will also be introduced to more advanced
regression approaches, for example Support Vector Machines.
Prediction means determining Y values for new X objects,
quantitatively, or for classification purposes, based on a previously
established (calibrated and validated) model. A predicted value is
thus dependent both on the new X data but also on the whole
ensemble of X data that went into the training data set.
Næs et al. [1] provide an introduction to the “Multivariate
Calibration and Classification” problem which in many ways is a
parallel to the present book. Two related, highly merited textbooks
within the multivariate regression and prediction field, both invoking a
fuller statistical basis, are i) Multivariate Calibration by Martens & Næs
[2] and Agnar Høskuldsson [3]: Prediction Methods in the Sciences—
these three text books are greatly recommended as a next step up
the competence ladder, but notably first after having accomplished
the task set for the reader with the present introduction.

1.9 Multivariate techniques as geometric


projections
There are several ways to learn about the multivariate techniques
treated in this book, which differ significantly in their approach.
The methods PCA, PCR and PLSR may, of course, be introduced
as data analytical/statistical methods, in which field they are known
under the name “bilinear methods”. The term denotes a mathematical
approach in which one starts with the fundamental mathematics and
statistics behind the methods. This is the preferred approach in many
statistical and data analytical textbooks, no doubt because of the
unparalleled success of mathematics as the universal language of
science.
It is also of relevant historical interest that when these methods
were introduced to chemometrics, a significant focus was on various
aspects of the algorithms and their implementation. This approach
often led to new method developments originating as new twists on
the existing algorithms, a.o. Höskuldsson [3] traces this historic
development in the much wider context of statistics proper and its
relationship with the sciences and technology, a view that will also be
indispensable for the reader’s further education.
However, this tradition will not be followed here. Instead a
geometric representation of all data structures will be used that can
be expressed as matrices [X, Y] to explain how the different methods
work—specifically by using the central concept of geometric
projections. This approach is highly visual, utilising the fact that there
is no better pattern recogniser than the human cognitive system. It is
this approach which distinguishes the didactics behind the present
textbook through all editions. An extensive experience has been
accumulated based on this approach, which has found a very wide
appeal in science, technology and industry. It is felt that this is a
superior approach precisely for this book’s introductory purpose.
Many generations of student responses, and from practical data
analysts from technology and industry confirm this impression.
Following this, viewing data sets primarily as geometric points will
become more familiar as will viewing data as point swarms in a
multidimensional data space. This will allow a grasp of the essence of
most multivariate concepts and methods in a remarkably efficient
manner. In fact, the ability to perform multivariate data analysis
seemingly without having to master much of the underlying
mathematics and statistics will become common practice.

1.9.1 Geometry, mathematics, algorithms

It is appropriate, however, to point out that it is not the authors’ belief


that such a geometric approach is all that is required in the
multivariate realm. This is but an introductory textbook. After
completion, reading of higher level textbooks for additional theoretical
background in the mathematics and statistics of these methods is
suggested, to deepen the initial understanding of principles defined in
this book and how they work. Introduction to the specific algorithms
in this book are presented at a first foray level, but it is strongly
encouraged to pursue this avenue—albeit first after having secured
the solid phenomenological geometric projection understanding
offered here. Relevant references are supplied below as well as
referral to selected journal articles to serve these further needs.
Parallel to the present textbook there are several other key
introductory works on chemometrics, e.g. Wold et al. [4], Wold et al.
[5] and Jackson [6].

1.10 The grand overview in multivariate data


analysis
Data analysis is the central issue here, univariate, multivariate,
megavariate, hyperspectral (the latter two are but fashionable
buzzwords for “multivariate”). The main purpose of this introduction is
to convey a solid foundation for the principles of problem-dependent
multivariate data analysis. This will be achieved if only the reader has
the stamina to work through the curriculum in the book with diligence
and personal responsibility! If the reactions from generations of
students (and single-user self-learners) and the number of copies
sold of the five earlier editions is anything to go by, this book will
provide a powerful acquaintance with and shared experiences, in a
professional way, with the selected set of the most important
methods currently in use today.
“Data quality” is an important issue, although this term needs
careful consideration, as it does not lend itself to direct, unambiguous
understanding. Many factors are included in the definition of data
quality, of which analytical uncertainty (analytical error) is only one
(but undoubtedly the one that first comes to mind), sample
preparation error would be another, as would contamination, storage
stability … Note that all these data quality elements are related to
what happens after the primary sample was obtained.
A third issue is therefore concerned with the preceding errors:
sampling errors. Sampling errors, in practice, always certainly
dominate the uncertainty budget and they are also intimately
connected with the effects and issues surrounding the major data
analysis theme of validation. This is such an important issue that two
completely new chapters have been included in the present edition of
the book, chapters 3 and 8, related also to chapter 9.
Sampling errors, data quality and validation are treated in full
depth in the chemometric and analytical literature by Esbensen &
Julius [7] and Esbensen & Wagner [8]. In this book a key overview of
these issues in provided in chapters 3 and 8.
Data quality assurance often starts with primary sampling which
takes on many different manifestations, on many scales, depending
on the heterogeneity of the target material as well as lot
dimensionality, size and geometry. There are quite specific rules to
the game of guaranteeing, and being able to document,
representative sampling at all scales from field/plant to analysis; these
can be found in chapter 3. Not many textbooks on data analysis take
the reader all the way back to the data origins. At other times, data
acquisition starts with an experimental design (DOE), discussed in
detail in chapter 11, which provides a much more direct data
generation situation. Data analysts must be aware of and in
command of all primary data generation situations.
If the primary sample, or if a sensor signal, is not representative,
the ensuing mass reduction (sub-sampling as secondary, tertiary
sampling), sample preparation before analysis, or digital signal
processing, as well as data analysis, will be compromised but never
with a knowledge to what degree—upon reflection this is a terrible
situation to be in for a data analyst. No important decision should
ever be made based on data that are non-representative. This dictum
applies to stationary lot sampling, to process sampling as well as to
Process Analytical Technology (PAT). This understanding is a central
theme of this book and is covered in detail in chapter 13.
No data analysis (nor any kind of statistics) can ever make up for,
nor remedy faulty sampling, or poor data quality (inaccurate or
imprecise data)! This is why a first understanding of the error-
generating issues across the full field-to-analysis-to-data analysis
pathway is necessary to become a proficient and competent data
analyst. It is tempting, but it is not enough, just to be competent with
respect to data analysis methods and algorithms. The sixth edition of
this book aims to give the reader the most comprehensive, relevant
and fully updated holistic introduction to multivariate data analysis
and all its prerequisites.
A comprehensive overview of the fundamental methodologies of
Design of Experiment (DoE) is provided in chapter 11. DoE is being
used more and more, particularly in the pharmaceutical and related
industries as a means to define the so-called Design Space. Within
this space, great understanding of the process/product has been
gained through the use of a systematic, rather than an ad hoc
approach to experimentation. No matter what industry or discipline,
DoE can find a place in the way experimenters plan and execute their
experimental protocols with the goal being to extract the maximum
amount of information from a minimum amount of experimental effort.
An introduction to the world of curve resolution and simple
structure is provided in chapter 12. This is a first foray into a more
advanced topic in multivariate analysis and shows how greater
interpretability may be achieved by rotating the multivariate space in
such a way that a physical, rather than a mathematical interpretation
may be possible. There are other “sneak-previews” of more
advanced aspects of data analysis and related fields in chapters 12–
14, brought here to give the reader a first taste for the life-long quest
of mastering chemometric data analysis and its foundations at an
ever increasing level.
Enjoy the learning experience with multivariate methodology
presented in this introductory textbook—the authors’ extensive
experiences shall be with you all the way and we promise that the
reader will become a competent chemometrician. The authors intend
to be quite specific on exactly how to reach this goal—and all the
learning goals established. There is not a single superfluous chapter,
section or paragraph in this book—everything and every chapter
matter. Enjoy the work in front of you!
1.11 References
[1] Næs, T., Isaksson, T., Fearn, T. and Davies, T. (2002). A User-
Friendly Guide to Multivariate Calibration and Classification, NIR
Publications, Chichester. ISBN 0-9528666-2-5.
[2] Martens, H. and Næs, T. (1996). Multivariate Calibration, Wiley,
Chichester. ISBN 0-471-90979-3.
[3] Høskuldsson, A. (1996). Prediction Methods in Science and
Technology, Vol 1. Basic Theory, Thor Publishing, Copenhagen.
ISBN 87-985941-0-9.
[4] Wold, S., Albano, C., Dunn III, W.J., Edlund, O., Esbensen, K.,
Hellberg, S., Johansen, E., Lindberg, W. and Sjostrom, W.
“Multivariate Data Analysis in Chemistry”, in B.R. Kowalski (Ed),
Chemometrics, Mathematics and Statistics in Chemistry, D.
Reidel Publ., pp. 17–95. ISBN 90-277-1846-6
[5] Wold, S., Esbensen, K.H. and Geladi, P. (1987). “Principal
component analysis – a tutorial”, Chemometr. Intell. Lab. Syst.
2, 37–52. https://doi.org/10.1016/0169-7439(87)80084-9
[6] Jackson, J.E. (1991). A User’s Guide to Principal Components,
Wiley. ISBN 0-471-62267-2
[7] Esbensen, K.H. and Julius, L.P. (2009). “Representative
sampling, data quality, validation – a necessary trinity in
chemometrics”, in Brown, S., Tauler, R. and Walczak, R. (Eds),
Comprehensive Chemometrics, Wiley Major Reference Works,
vol. 4, pp. 1–20.
[8] Esbensen, K.H. and Wagner, C. (2014). “Theory of Sampling
(TOS) versus measurement uncertainty (MU) – a call for
integration”, Trends Anal Chem (TrAC) 57, 93–106.
https://doi.org/10.1016/j.trac.2014.02.007
Chapter 2: A review of some
fundamental statistics

The science of statistics can, at first, seem a daunting task if it is


approached from a purely mathematical point of view. However,
when applied to real experimental or process based data in practice,
statistical methodologies form an essential toolbox for revealing
information on system variability and stability. When correctly
applied, statistics can help in isolating critical parameters essential for
better understanding of complex systems, for example, sound
statistical principles can be used to systematically design
experiments that minimise the overall time and resource effort
required and at the same time maximise the amount of information
obtainable. This is the subject of chapter 11, Design of Experiments
(DoE) and will be discussed further there.
When statistical methodologies are applied to a single measured
variable, for example the pH or temperature of a chemical reaction,
this represents the regimen called Univariate Statistics. This one
variable at a time (OVAT) approach is used by many disciplines on a
routine basis, especially in the industrial regimen, usually in the form
of Statistical Process Control (SPC). Statistical methods have
received a recent revival in the form of the Six Sigma (6σ) initiative.
Six Sigma, as a philosophy, has revolutionised the way companies
like General Electric, Motorola, Boeing and Toyota, approach R&D
and manufacturing issues. The pharmaceutical and related industries
have also attempted to implement 6σ, however, due to the fact that
most of their processes are highly multivariate in nature, 6σ
methodology has met with mixed success. This has led to an
extension of the methodology known as Design for Six Sigma (DFSS).
DFSS places a much higher emphasis on DoE, but even more so, it
has highlighted the need for Multivariate Analysis (MVA) for studying
complex systems. For more detailed descriptions of Six Sigma and
DFSS, the interested reader is referred to the extensive literature
available [1–3].
Statistical tools study and manage the variability of a system, with
the aim of being able to predict such variability in the future.
Variability can be viewed as either a good or a bad thing, depending
on the problem context and application. In the analytical laboratory
for example, or on the manufacturing floor, high variability is a major
cause of lost productivity. From another point of view, high variability
in analytical results can lead to extended research, by missing a real
breakthrough due to poor measurement quality. On the other hand, if
a researcher is trying to classify new species or is trying to
understand diverse systems, high variability may be desired. Isolating
such variability and reducing it is the key task of the process
engineer.
In order to estimate the variability inherent in a system, a target
value is usually required for comparison. This could be in the form of
a “Golden Standard”, or a pre-defined specification (tolerance), but in
most cases, it is usually compared to a historical mean and variance
established over a long-time period or similar.
This chapter is not intended to be a definitive or exhaustive
description of basic statistics; there is very extensive coverage in the
literature. It is meant to be a quick reference point for understanding
the principles introduced in later chapters. Also, the focus is solely on
continuous variables; analysis of discrete and categorical data, while
highly important, is a complete topic onto itself and is well covered in
widely used introductory statistical texts, such as Montgomery [4].

2.1 Terminology
In the statistical literature, the concepts of population and sample are
central. A population is taken to mean a large (usually theoretical)
collection of observation values (in theory all possible values
pertaining to the context). A sample, in the statistical context, is a
small selected set of observations, taken from a population and
which is assumed to be able to represent that population. For
historical reasons the terminology used in statistics and in the Theory
of Sampling (TOS), refer to chapter 3, overlap for a few critically
important concepts, that can cause serious confusion if not defined
very clearly, foremost of which are the terms “sample” and
“sampling”. It is necessary to distinguish clearly between what can be
termed a samplestat vs a sampleTOS, (the first is a set of values, the
second is a physical sample extracted from a predecessor lot)—and
samplingstat vs samplingTOS (the sampling process employed in the
statistical and the physical sampling contexts respectively); these
issues are laid out more fully in chapter 3. In the present chapter,
usage of sample is mostly in the statistical sense unless carefully
noted.
The branch of statistics known as parametric statistics is founded
on the properties of the so called Normal Distribution. The normal
distribution (to be discussed in more detail in section 2.4) is described
by two parameters, the mean and the variance, hence where the
name parametric statistics comes from. The normal distribution is
described by a bell-shaped curve where the majority of values of
observations are concentrated around the centre of the curve and
when moving away from the centre, in both directions, the
concentration of observations decreases rapidly. When the values of
the mean and the variance are calculated from a small sample of
observations, these values are called statistics rather than
parameters.
The normal distribution forms much of the basis of Inferential
Statistics, i.e. the ability to make inferences (or deductions) about the
large data set, the population, based on a small, but well understood
sample taken from the entire population.
Table 2.1: Terminology and symbols used in parametric statistics.

Mean Variance Standard deviation

Parameter μ σ2 σ

Statistic s2 s

Table 2.1 summarises some of the terminology and symbols used


for describing populations and samples. Population parameters are
expressed by Greek symbols, while sample statistics are expressed
as Roman symbols. The parameters/statistics listed in Table 2.1 will
be described in more detail in the following section.

2.2 Definitions of some important measurements


and concepts
Often, when a data analysis is to be performed, it is desired to know
whether the data “congregate around a specific value?” This mean
value, i.e. the arithmetic mean, of a set of values is a measure of the
central tendency of the measurements. In many applications, it is
expected that the mean value is as close as possible to some pre-
defined target value. But the mean, by itself, is not a meaningful
measure without some sort of knowledge of the spread of the data
around it. This is quantified as the variability, or the dispersion of the
data about the mean.
A number of commonly encountered situations exist which
describe the relationship between the mean and variance of a data
set. This can be illustrated using a “dart board” analogy. If the bull’s-
eye of the dart board represents the target value, then Figure 2.1a
represents a set of data with a mean value close to the target with a
very small variance (spread, or grouping of the darts). In this figure,
mean and accuracy are related to each other, in the sense that if the
mean is close to the target, the result is considered to be accurate,
i.e. there is no bias between the mean value and the target. Precision
and variance are also related to each other in the sense that if the
grouping of the observations is tight, the variability is small and
therefore the observations have been measured with high precision.
This situation represents a system measured under tight control and
on target. With statistical thinking, it is encouraged to think of the
target as the true mean value of the population, while the relatively
few data shown represent a statistical sample from which a
discernment of the characteristics of the population is required.
Figure 2.1b shows a situation in which the variance in the data has
increased; however, the overall mean is still close to the target value.
This situation occurs when there is a lot of variability, either in the
actual system (see chapter 3 where the same phenomenon is termed
heterogeneity), or introduced by the measurement system itself. It is
therefore the task of the analyst to determine the source of this
variability and, if possible, to manage the consequences hereof in the
data analysis or in the data modelling. Figure 2.1c shows a case
where the variability is small; however, the centre of this data (the
sample mean) lies well away from the target. This systematic
deviation is known as bias (statistical bias). Bias is sometimes
attributable to a drift, or step changes in a process/system, it may be
due to poor calibration/instability of the measuring equipment, or due
to other systematic reasons. There also exists a sampling bias (which
has a fundamentally different nature than the statistical bias and
which is often of very large influence), which is further discussed in
chapter 3.
Figure 2.1: Precision and accuracy as defined by the dart board analogy.

Figure 2.1d represents a poorly controlled system which results in


many a headache to the process analyst, analytical chemist or
sensory scientist for example, but may be the delight of the ecologist
or biologist looking for diversity. Therefore, the value that is perceived
in the data must be built into the objective of the experiment or into
the observation campaign. Either way, statistical sampling and data
analysis is often called for.
A convenient way of visualising real data, without having to
construct a dart board, is to draw a histogram. A histogram
(discussed in more detail in section 2.4.1) is a special form of bar
chart used to visualise the frequency distribution (i.e. the occurrence)
of the data. The shape of the histogram provides important
information regarding the nature of the system under study. For well-
controlled processes or natural systems of low variability, the
maximum point in the histogram should line up closely with the
expected mean value of the system (if there is such an expectation)—
at other times the statistical data analyst is only interested in
describing the empirical data structure, in which case the mean is
what the mean turns out to be—this case is termed exploratory data
analysis. When the spread around the sample mean is broadly
symmetrical and non-structured, indicating that the variability is
randomly distributed, the observation values are said to be normally
distributed. The tighter this spread, the smaller the variability is.

2.2.1 The mean

By definition, the arithmetic mean is sum of the values of


observations divided by the number of observations. Mathematically,
for N observations, sum the individual values and divide these by the
total number of observations, N. Equation 2.1 provides the formal
calculation of the mean.

where is defined in Table 2.1, the subscript i identifies the individual


x values being summed and N is the total number of observations.
The mean is sometimes referred to as a non-robust measure of
central tendency. Robust in this sense refers to a parameter or
statistics ability to relatively remain unchanged in the presence of an
outlier. Outliers, and their detection represent a most powerful
application of multivariate data analysis tools and they will be
described in great detail throughout the context of this book. To
demonstrate this principle of robustness, consider the following set of
data,

[1.0, 2.0, 3.0, 4.0, 5.0], i.e. a data set where N = 5

Using equation 2.1, the mean can be shown to be 3.0. Consider


now, for example, an experimenter makes a mistake in the form of a
transcription error (this can be in the form of a misplaced decimal
point) so that instead of writing down the value 5.0 for the last
observation, 50 was erroneously recorded. The mean value for the
new data set is now 12.0. This shows that the mean is stretched
toward the highest value in the set, i.e. the mean is sensitively
affected by an outlying value (or several outliers).

2.2.2 The median

Another measure of central tendency exists, which is robust to such


extreme values. This is known as the median and is referred to as a
non-parametric value. Non-parametric statistics is outside of the
scope of this text and the interested reader is referred to the
extensive literature available on this topic [5–6]. However, the median
has some interesting properties, which enables an analyst to
establish how symmetric a data set is.
Consider the data set described above in section 2.2.1. The
median is defined as the point where the data set is balanced, i.e.
there are equal numbers of observations on either side of the mid-
point of the set. In this case, there were five observations and using
the analogy of a seesaw, the median aims to place the pivot where
equal numbers of observations are on either side. Since N = 5 is an
odd number, the pivot can be placed under the number 3, resulting in
two observations lying on one side of the pivot and 2 on the other
side.
Figure 2.2: Pictorial representation of the median using the analogy of a balanced seesaw.

When N is an even number, the pivot cannot be placed under one


particular value. For example, if N = 6 for the following set of data,

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0], i.e. a data set where N = 6

if the pivot is placed under either 3.0 or 4.0, this results in an


unbalanced distribution of values around the pivot. To account for
this imbalance, the pivot must be placed at the centre point of the
observations between 3.0 and 4.0 (i.e. at 3.5). This results in there
being three observations on either side of the median. The seesaw
principle of the mean and median is shown in Figure 2.2.
To show the effectiveness of using the mean and the median for
detecting gross (or extreme) points, returning to the mean values
calculated in this section for the data with and without the
transcription error. The data is provided below for convenience.
1.0, 2.0, 3.0, 4.0, 5.0 and 1.0, 2.0, 3.0, 4.0, 50.0,

where N = 5 in both cases.


The previously calculated means for both data sets were = 3.0
and = 12.0, respectively. Now, since N = 5 is odd, the median for
both sets occur at the value x3. Note that in both cases even when an
extreme value is in the set, the value of the median is still 3. The
median is not dependent on the weighted data but on the ranking (or
ordering) of the data.
It should also be noted that in the case where no outliers are
present and the mean and median are equal, this is an indication of
symmetry in the data. For the data with an extreme value the mean is
12.0 and the median is 3.0. The mean is highly affected by the
extreme value (as it is a weighted sum) and this indicates that the
data set displays a non-symmetry (or skewness). When data are
skewed, this indicates the existence of infrequently occurring high (or
low) values. This departure from symmetry is further discussed in
section 2.4.1.

2.2.3 The mode

Another frequently encountered measure in statistics for representing


central tendency is the mode. By definition, the mode is the most
frequently occurring value present in a data set. This may be an
important attribute for a set of data in specific cases, but no more will
be said here about the mode except for this definition.

2.2.4 Variance and standard deviation


Variance, also known as variability, is a major contributor to profit loss
in industry and a major cause of why experimenters’ may miss
important observations when trying to distinguish between small
effects. If an important difference in a system is not detected, one
possible reason is that no consideration has been given to the
measurement system’s or process equipment’s ability to perform a
given task. The Six Sigma initiative has helped companies to look at
the fundamental characteristics of the “gauges” used to evaluate
critical quality attributes. One should always endeavour to understand
the specific measurement systems variability characteristics in order
to be able to make the best possible decisions regarding the data
they generate.
A gauge is usually defined as any system that provides a single
number estimate of a particular attribute. This could, for example, be
associated with measuring the height of a number of military recruits
using a ruler or using High Performance Liquid Chromatography
(HPLC) for measuring the potency of a particular brand of
pharmaceutical tablet. If there is no ability to assess the specific error
associated with the measurements obtained, the measurement error
(MU), an interpretation of such results may quickly become
problematic, and results in the situation of “flying blind” regarding all
small(er) differences that may never the less be significant in context.
The formal calculation of variance for a population and a sample is
shown in equations 2.2a and b, respectively.

where σ2, s2, μ and are described in Table 2.1, i represents the
particular x value being summed and N is again the total number of
observations. Note in equation 2.2b, the denominator is N – 1 rather
than N. The term N – 1 is known as the degrees of freedom (dof) of
the data and this will be discussed further in section 2.5.4.

Table 2.2: Analysis of Difference and Sum of Squares data in the


Variance calculation.

Sample number Observed value Difference from the Squared difference


mean

1 1.0 –2.0 4.0

2 2.0 –1.0 2.0

3 3.0 0.0 0.0

4 4.0 1.0 2.0

5 5.0 2.0 4.0

Sum (mean = 3.0) 0.0 10.0

Further reflections of equation 2.2 reveals the following important


points about the variance: the numerator represents the sum of the
squared differences of the individual values with respect to their
mean value. This measure is known as the Sum of Squares (SS) of
the data set and is a very important concept in statistics, especially in
Analysis of Variance (ANOVA) and Design of Experiments (DoE). Both
these topics are discussed in chapter 11.
The reason why the individual differences are squared will be
argued in the following way by revisiting the data used previously in
section 2.2.1. Table 2.2 describes the calculations performed by
equations 2.2a and b.
When the straight sums of the differences are considered, there is
no information obtainable about their spread. This is just an effect of
the balance of the data around its calculated mean. When the
differences are squared, then summed, a value representing the
spread of the data is obtained. Dividing this sum of squares value by
its degrees of freedom provides the effective average of this variance
measure.
An interesting property of the variance is that for two or more sets
of data, if the variances of the individual data can be shown to be
equivalent, i.e. of similar magnitude, they can be pooled for further
statistical analysis. This will be described further in section 2.5 on
hypothesis testing.
The major disadvantage of variance as a measure of spread is that
the units of measure are the squared values of the original variables.
For example, if the variance of a Vernier Caliper is being established
in millimetres (mm), then the units of variance will be mm2. One way
to overcome this problem is to take the square root of the variance.
This is known as the Standard Deviation. The terminology and
symbols used for the standard deviation are provided in Table 2.1.
The formal calculation of the sample standard deviation is provided in
equation 2.3. The associated population standard deviation is
calculated by replacing N – 1 by N in the denominator.

The standard deviation provides a measure of the spread


expressed in terms of the original measurement units. It is frequently
used in hypothesis testing and for defining the characteristics of the
normal distribution.

2.3 Samples and representative sampling


As mentioned in section 2.1, inferential statistics helps to draw
conclusions about a larger set of data, based on a small, but
assumed representative statistical sample, taken from a population.
The key term used here is representative and this is directly
associated with the concept of random sampling (from a statistical
population), samplingstat.
The Theory of Sampling (TOS) is concerned with sampling
(extracting a small mass) from a physical lot which is characterised by
a significant spatial as well as compositional heterogeneity. Here a
“sample” is defined as a small mass fraction of the lot, which can be
documented to be representative of the total physical lot; this
process, which is of a distinctly different nature, shall be
distinguished as samplingTOS. These issues are discussed in full detail
in chapters 3, 8 and 9. A focus on representative sampling in the
statistical sense, samplingstat will be taken in this chapter in order to
be sufficiently familiar with some basic concepts, terms and the
statistical sampling process, mostly from a practical point of view.
Since it is impossible to measure every value of an individual’s or
an object’s characteristics (either because of sheer numbers or of
destructive or expensive testing), more often than not an
experimenter is at the mercy of what is available at the time. For
example, from a well-controlled and validated process, the output is
expected to exhibit very small variability in its quality measures. In
this case, the assumption is that units taken from the process at any
time (physical samples), can be expected to be representative of the
overall process, unless or until proven otherwise. Observe that the
thinking here is of random sampling of the process stream,
samplingstat.
Situations exist, however, where process operators, research
scientists or even market analysts know beforehand where, or when,
to take measurements a priori, to the exclusion of samples that will
not meet expected criteria. This is known as non-random or biased
sampling. In such a case, the units collected may not be
representative of the overall population and therefore incorrect
inferences could be made. The crux of the matter is that in such a
case it is never known how badly the resulting data is deceiving. This
is clearly an undesirable situation.
Whether data is collected from a process or from a natural
system, in statistical thinking such data is regarded as being taken
from the theoretical population of all possible values that can ever be
measured. Since it is impossible to collect all such data, inferences
have to be made about populations based on a small sampled data
set, randomly collected over a time period of relevance, for the
problem context. A (very) large collection of historical data may be
used as an approximation to the theoretical population, provided that
the units collected are random and representative. It is not easy to
assign a clear and unambiguous meaning to the critical attribute
“representative”, but, for example, the text by Box, Hunter and
Hunter [7] furthers an excellent discussion.

IMPORTANT NOTE: In order to make correct inferences about a


population it is necessary that both the samples used are
representative and that they have been measured using reliable
and precise methods. If one of these aspects of sampling and
measurement is inferior (or faulty), the statistical inferences made
run a significant danger of being suboptimal or directly unreliable.

2.3.1 An example from the pharmaceutical industry

The manufacture of pharmaceutical tablets is a highly complex


process consisting of a number of unit operations, including milling,
blending, drying and compressing into tablets. Of key importance to
the overall quality of the drug product is the uniformity of the drug
substance, in the blend. A common reference analytical method is
that of High Performance Liquid Chromatography (HPLC). From the
powder blend, small (unit dose) samples are taken from the blender
(which is considered as the population) at a small set of pre-defined
points (at approximately 10 locations) in order to assess the
uniformity over a standardised spatial arrangement in the blender.
This sampling scheme stipulates a few triplicate samples at pre-
selected locations within such an array. Throwing in a smattering of
basic statistics, the replicate sample analytical results are described
by local mean and standard deviations and compared to the same
statistics calculated for all 10 results. On this basis guidance
documents lead the quality control department to conclude whether
the blend uniformity is acceptable locally, generally, or not (the
relevant variations must not be larger than 5% of the overall mean
level).
Some researchers have recently shown that this sampling
approach, which is based on sampling spears (also known as
sampling thieves) is not representative, Muzzio [8], and this critique
was recently augmented by Esbensen et al. [9], who a.o. also showed
that the pre-defined sampling plan in no way can be considered as
random samplingstat; in fact, their technical and regulatory analysis
concluded with a call for a radical paradigm shift completely away
from this established practice, ibid. However, the standard spear
sampling approach has been in place for many years and is still
widely accepted; change often comes only with reluctance and
resistance.
In recent times the United States Food and Drug Administration
(US-FDA) introduced an initiative known as Process Analytical
Technology (PAT), refer to chapter 13. This was put in place to
encourage the pharmaceutical industry to adopt the latest
technologies for monitoring and understanding processes, with the
ultimate aim of reducing variability and thereby improving quality. One
of the key technologies used in PAT is Near Infrared (NIR)
spectroscopy. NIR has found important applications in nearly all
industrial sectors, mainly due to its ability to analyse samples non-
destructively and can often easily be implemented directly into a
process stream for real-time analysis. The major advantage of being
able to measure a process in real time is that the measurements are
performed on “samples”, as they exist in the process. It is often
assumed that this approach is a statistically representative
samplingstat with clear advantages for process monitoring and control
(“all you need is a PAT sensor and some statistics dealing with the
analytical results”). It has been shown, however, that this apparent
universal “solution” is critically dependent upon whether the
principles laid down in the Theory of Sampling (TOS) are followed
strictly. These principles govern the physical sampling, samplingTOS of
the extracted units. Even if these units are actually not samplesTOS in
this case but PAT sensor spectra, Esbensen and Paasch-Mortensen
[10] showed that there is a complete duality between the sampling
situation as regards both physical samplesTOS and spectral signal
acquisition; see more details in chapter 3.
Spectroscopic methods (including NIR), PAT and proper process
sampling are topics dealt with in later chapters and form a large basis
for examples and cases therein. Interestingly they all critically rely on,
or are intimately related to chemometrics in general, to multivariate
calibration in particular.

2.4 The normal distribution and its properties


By normally distributed, is meant that a set of data has a mean value (
) which is centrally located and the distribution of the data around
this mean are fairly tight and symmetrically distributed with a variance
(s2). The normal distribution is the most widely used of all of known
distributions in statistics, since very many univariately measured
variables in science, technology and industry often distribute
themselves in this manner. It also forms the basis of the hypothesis
tests described in section 2.5.
Even when the parent distribution is non-normal in character, it
can be sampled in such a way that the resulting values are normally
(or near normally) distributed. This is the topic of section 2.5.3.
There are many other distribution types that describe certain
specific data structures, and there also exist transformations (but not
necessarily always) that can transform these distributions into normal
like, or at least into symmetric distributions. These are outside of the
scope of this introductory textbook, however; below, data that
naturally follow the normal distribution are used. The interested
reader is referred to the literature for more information on data
transformation, Montgomery [11]. The following sections describe
some important features of the normal distribution, including
graphical aspects and special properties that make it suitable for
hypothesis testing purposes.
The usual notation, as found in the statistical literature for a
normal distribution is x ~ N(μ,σ2). This is to be read as: variable x is
normally distributed (N) with an estimated mean of μ and an
estimated variance of σ2. Another common notation often met with is
the independently and identically distributed (i.i.d) characteristic. By
this statement, it is assumed that the individual measurement values
in a data set are independent from each other (i.e. random) and that
each measurement (value) comes from a population with the same
mean and variance. When data are dependent, i.e. one result relies
on the level of another; ways of determining the degree of such
correlation are required. Autocorrelation is briefly introduced in
section 2.6.

2.4.1 Graphical representations

The histogram
A Histogram is a special form of bar chart that plots the frequency of
occurrence of values falling in pre-defined data intervals, i.e. the data
range is divided up into a specified number of “bins” of particular
width, and the height of each bin bar is a reflection of how many
times data values lie in these bins. The histogram serves the purpose
of providing a graphical overview of the spread of a data set. A
general histogram is provided in Figure 2.3.
The following questions can be answered using a histogram:
Figure 2.3: The histogram of a normally distributed data set.

Figure 2.4: The most often encountered types of univariate data distributions.

1. Are the data spread evenly over a wide range of the variable
outcome space, or are they concentrated into only a small region?
2. Do the data form one coherent group, or do they form two, or
possibly more groups?
3. Are the data symmetrically distributed around a central point (i.e.
normally distributed) or are they skewed to the left or to the right?
(See further below.)
The human cognitive system is highly adept at detecting patterns
when data are displayed in simple graphical formats such as the
histogram. Figure 2.4 shows some of the most commonly
encountered types of univariate distributions.
To compliment Figure 2.4, a few examples of such data structures
and their typical meanings are provided below.
Normal: A quality attribute of a tightly controlled process should
show a normal distribution of data with a mean value close to the
expected target value and a small spread, indicating tight control. In
regression analysis, the spread of the so called residual values around
zero, with a small variance is a requirement of least squares fitting.
Left Skewed: This type of distribution can be expected, for example,
when a system is under development and has not yet been
optimised. The target value lies to the right in the histogram and may,
for example, represent overall yield. The “tail” of the data tapers off to
the left of the graph, hence the name. The values to the left may
represent instances where the desired yield was not achieved.
Typical: This is the common shape of most experimental data and
may sometimes (but not necessarily always) represent a situation
somewhat close to the normal distribution.
Right Skewed: This is the usual situation, for example, when testing
for impurities in finished products. The majority of the measurements
are to the left of the plot, with the tail skewed to the right. The tail
indicates occurrences where high impurities are the exception, rather
than the norm.
Bimodal: This is the typical situation when two populations may exist
in the same data set.
Boxcar: This represents an even distribution of measurements over
the entire range. It is the desired distribution of data for multivariate
calibration development purposes discussed in chapter 7. A boxcar
distribution has a small value of kurtosis (“peakedness”), usually less
than 3; this is described in more detail below in this section.
For a small number of samples N, the so-called “small sample”
case in statistics, the shape of a histogram can be very misleading. It
has been suggested that histograms are best used when the number
of samples is greater than 75, Montgomery [4]. The reader will
undoubtedly quickly get a feeling for how, and to what extent, a
small-sample basis serves proper statistics, or not…

Special properties of the histogram for normal distributions


Some other important properties of the normal distribution that are
essential and very useful in hypothesis testing are described in the
following,
1. The normal distribution is described by an exact mathematical
function, known as the Gaussian function, equation 2.4
where the entire notation is as usual and provided in Table 2.1.
2. The Gaussian function is a member of the exponential family of
distributions. The tails of the curve extend in principle to infinity in
either direction without ever touching the x-axis. The integrated
area under the curve is 1.00, i.e. from equation 2.5 the area is
calculated as

3. Since the normal distribution can be viewed as a PDF, the sigma-


scaled (sigma representing the standard deviations) normal
distribution has the properties shown in Figure 2.5.
Since the histogram measures the frequency of occurrence of
values between two specified points (a bin), it can be interpreted as a
Probability Density Function (PDF). Using equation 2.5, and since the
total area under the curve is equal to 1.00, any interval area under the
curve can therefore be interpreted as a probability estimate.
Figure 2.5: Properties of the normal distribution in terms of a probability density function.

These properties allow for the determination of the probability of


an event’s likelihood, based solely on knowing the mean and
standard deviation of a data set. Section 2.5.3 describes a formal
statistical test for assessing whether a data set is normally
distributed.
From this definition, it is possible to calculate the probability of a
sample’s value lying in an interval between a specified region of the
PDF (a bin) using the following formula (equation 2.6)

This can be represented graphically in Figure 2.6 where the


shaded region shows the interval for which the probability estimate is
being calculated.
Returning to Figure 2.5, the main features of the normal
distribution can be described in plain words as follows.
Figure 2.6: Using the normal distribution to find the probability of all values lying within the
interval between a and b.

Approximately 68% of the area under the normal distribution


curve lies between ±1 standard deviation around the mean.
Approximately 95% of the area lies between ±2 standard
deviations around the mean (this is where the definition of the 95%
confidence interval comes from, which is discussed in more detail in
section 2.5.2).
Finally, approximately 99% of the area lies between ±3 standard
deviations around the mean.
Practically, what this means is that if a measured variable value
lies at a distance >3 standard deviations from the mean of a normally
distributed data set, then that value has a very low probability of
belonging to that particular population (again, provided the data used
are representative).
Figure 2.7: The cumulative distribution for a normal data set and its representation using a
normal probability plot.

Skewness and kurtosis


Two important parameters that are used to describe the shape and
symmetry of a distribution are known as skewness and kurtosis.
Skewness is also known as the third central moment of the
distribution, the first being the mean and the second the standard
deviation. Skewness acts to distort the normal distribution in a
particular direction (either left or right). Skewness is therefore a
measure of the asymmetry of the distribution and its effect is
measured as the coefficient of skewness (equation 2.7)

Figure 2.8: Examples of normal probability plots for non-normal data.

The kurtosis of a distribution is a measure of its “peakedness”. It


is the fourth central moment of a distribution and is calculated using
equation 2.8

Skewness and kurtosis are used as initial statistical measures to


assess the degree of departure from a normal distribution. Positive
values of γ1 indicate positive skewness, negative values indicate
negative skewness. Values of γ1 greater than ±1 represent situations
of high skewness. Values close to zero are usually indicative of a
normal distribution. For kurtosis, values of γ2 less than 3 represent flat
distributions, where values of γ2 larger than from 3 represent
distributions with sharper peaks and less dispersion than N(μ,σ2),
Evans [12].

Normal probability plot


Another graphical representation of a distribution is to plot the points
representing the set values using a probability plot. When a data set
consists of only a small number of samples (between say 5 and 50), a
normal probability plot is an expression of the Cumulative Distribution
Function (CDF) on a scale that approximates the normal distribution
as a straight line. Deviations from this approximate straight line are
indicative of departures from normality. A generic example of the
cumulative distribution and a normal probability plot for a normal
distribution are provided in Figure 2.7.
Figure 2.8 provides some examples of the shape of some
probability plots and their interpretations.
Normal probability and cumulative distribution plots provide the
basis of formal statistical tests of normality and section 2.5.3
discusses one such much used approach known as the Kolmogorov–
Smirnov test.

Box plots
A Box Plot (sometimes known as a “box and whisker” plot)
represents the distribution of a particular data set in terms of its
percentiles, separating the data into quartiles as illustrated in Figure
2.9.
Figure 2.9: The Box Plot and its graphic quartile distribution subdivisions.

Fifty percent of the data are contained within the central box part
of the plot while the extreme values (represented by the Whiskers)
contain the outer fifty percent. The horizontal line found in the box
part of the plot represents the median value.
The case where the whiskers are of equal size, and the median
lies in the centre of the box, is a clear sign that the data set is
symmetrically distributed (i.e. possibly normally distributed—but only
the kurtosis will tell). The vertical length (size) of the inner box and of
the whiskers represents the overall spread of the data. When the
sizes of the whiskers are different from each other, the longest
whisker signifies skewness in the data.
The box plot is a powerful tool for providing an initial overview of
data before further analysis. It finds major use when comparing the
uniformity and magnitudes of multiple variables in a single overview
plot. Figure 2.10 provides an example of a multiple box plot for
sensory preference data collected on raspberry samples used for jam
making.

Figure 2.10: Multiple box plots showing the relative distribution and median magnitudes of a
number of sensory variables measured and monitored in the jam making process.

From a practical point of view, the following properties of Figure


2.10 should be noted.
1. Since the data were collected on a scale from 1 to 9, this plot can
be used to ensure that all of the sensory panelists used the scale
correctly.
2. Both the highest and lowest preferred attributes can be visualised
in one simple plot
3. The variability of the panelists can be assessed for each attribute
highlighting those variables not described well—and also for
determining the type of scaling (if any) to apply to the data.
The use of descriptive statistics, including the use of box plots is
highly recommended as a first high level data overview, provided the
number of variables measured is not too large. As a general rule,
these plots are useful for less than a few tens of variables, but can
also have applicability in the analysis of spectroscopic data (where
the number of variables measured can be of the order of thousands).
The use of multivariate techniques provides a much better picture of
the data when the number of variables becomes large.

2.5 Hypothesis testing


Up until now, focus has been given to the various
parameters/statistics and graphical methods used to describe the
distributional properties of small and large data sets.
A next logical question, from a statistical standpoint would be,
“how can two (or more) sets of data be compared to each other in
terms of equivalence?”. Equivalence may be expressed practically,
for example, by asking questions such as “is batch 1 similar to batch
2?” or “is analyst 1 better than analyst 2 at performing a specific
task?”. In order to answer such questions, however, we must set up a
hypothesis that clearly and unambiguously defines what similar or
better than means—in quantitative terms. This is the field of statistics
known as Hypothesis Testing.
Hypothesis testing employs use of a formal statistical
methodology to make inferences regarding one or more data sets.
Since these tests are based around the assumption of normality of
the data sets being analysed, they are known as parametric tests (it is
a requirement to know, or to have estimated, the respective
distributions parameters in order to be able to make the statistical
inference). When it cannot be safely assumed that the data are
normally (or near normally) distributed, the use of non-parametric
tests may be required, see references [5–6].
There are always two opposing statistical hypotheses that must be
established for a formal statistical test.
The Null Hypothesis: This hypothesis, usually denoted by H0, states
that there is no (statistical) difference between the two data sets
under investigation.
The Alternative Hypothesis: This hypothesis, usually denoted by Ha,
states that there is a statistical difference between data sets, with
respect to the Null Hypothesis.
Note: In statistics, hypothesis tests are based on probability
distributions and not on exact values, i.e. they represent the likelihood
of a difference. It is therefore essential to understand that statistics
cannot help an analyst to accept a hypothesis as being true, either H0
or Ha—statistics can only give quantitative help to reject one
hypothesis in favor of another.
In order to perform a formal statistical test, the establishment of
some measure of how significant an observed difference must be
declared. This threshold is defined by the concepts of significance,
risk and power which are discussed in the next section.

2.5.1 Significance, risk and power

Establishing that an observed difference is statistically different does


not necessarily mean that it is also practically different. There can be
a massive difference in the two definitions, and this has to do with
how many measurements are involved. In some cases, so many data
points may have been collected for a particular analysis that even the
smallest test difference may be found to be statistically significant. It
is always necessary to keep in mind a clear overview of the various
assumptions and prerequisites behind formal statistical hypothesis
testing.
One particular example taken from the pharmaceutical industry
will illustrate this. Machinery used to produce tablets are set to
manufacture within tight tolerances. In one particular production run,
the average thickness of the tablets manufactured was found to be
10.1 mm with a tolerance of ±0.05 mm. During the next
manufacturing run, the average tablet thickness was found to be 10.4
mm with the same tolerance. Since over 1000 tablets were measured
routinely during the quality control examination of the production, the
likelihood that the two production run means were statistically
different were quite high. However, the acceptable range for the
tablet thickness was between 10.0 mm and 11.0 mm (based on a
priori regulatory or in-house stipulations), making the observed
difference practically insignificant.

Significance level, risk and the concept of p-values


How is a statistical significant difference established? This is usually
based on definitions made in carefully defined experimental
strategies, or may be known a priori. It is also based on the risk that is
willing to be taken—in statistics focus is always on the risk of being
wrong! Risk and probability are closely related subjects; when a
statistical evaluation of data is made, it is done so accepting to some
degree there is a chance of making a wrong decision. Most formal
statistical tests provide a p-value along with other calculated
statistics for this purpose.
The p-value is defined as the smallest level of significance that
would lead to rejection of the null hypothesis and takes on values
between 0 and 1. p-values close to one usually indicate that a value
has a high probability of belonging to the population defined by the
null hypothesis where values approaching 0 indicated that a value is
most likely coming from the population defined in the alternative
hypothesis. The p-values are usually computer calculated and must
be compared to the significance level stated for the test.
Risk, or significance level, is usually represented by the symbol α
in the statistics literature. For a more exact meaning of the
significance level, reference to the discussion on the normal
distribution from section 2.4.1 is made. and in particular, making
reference to Figure 2.5.
Provided the assumption of normality holds, approximately 68%
of observations should lie within ±1 standard deviation of the mean,
approximately 95.4% within ±2 standard deviations and
approximately 99.6% within ±3 standard deviations. This now forms
the basis for a formal statistical test. The general steps involved when
comparing two sets of data, usually include (but are not limited to)
1. Set up the null and alternative hypotheses.
2. Define the level of acceptable risk, α.
3. Perform the appropriate statistical test and reject one hypothesis in
favour of another.
With this procedure, a valid decision can be made about the test
hypothesis from a statistical point of view, but before proceeding any
further, there are some important issues to be considered regarding
risk. Risk must always be related to the specific context of the
problem at hand. If the statistical test is to be performed regarding a
critical measure, such as in a life or death situation, analysts will take
a more conservative approach regarding the risk they are willing to
accept. In this case they are willing to accept less observations
increasing the likelihood of making a type I error.
In terms of the null hypothesis, a type I error occurs if the null
hypothesis is rejected when in fact it is true. In the practical sense,
limits are being set on the values such that samples that are
bordering on the edges of being part of a particular population are
rejected for safe measure. Conversely, by increasing the likelihood of
a type I error, the likelihood of making a type II error is reduced.
Type II errors occur if/when the null hypothesis is not rejected,
when in fact it is false. This is the most serious of the error types and
is further explained by the use on an example. Suppose a
manufacturer produces a test kit for detecting bowel cancer. The
worst case that could occur is when the test shows up as negative,
when it is positive. It is therefore better to design the test kit in such a
way that the type I error is increased and the type II error is
decreased. In such a case, some tests may show up as positive,
when in fact they are not (hence the term false positive). This type of
result is very often considered as acceptable because it can be
nullified by further work and diagnostic tests. However, if a type II (or
false negative) was observed, the test would provide a negative result
for someone who actually did have bowel cancer. The consequences
speak for themselves; this is the kind of “confidence” nobody likes
from a visit to the doctor.
There is a broad carrying-over potential from this example to
other, similar situations, even though the critical “life-or-death”
characteristic is no longer literal. There are plenty of identical
situations in science, technology and industry where the
consequences of false negative outcomes of statistical tests take on
the same, grim appearance, e.g. “the new installation was subjected
to a proper statistical test for equivalence… but the factory exploded
anyway”; “…the O-ring material was tested extensively, but not at so
low ambient temperatures (it was in fact never tested at negative
temperatures” (the Challenger shuttle disaster); or the old medical
adage: “the operation succeeded, but the patient died”. These three
examples also serve to point out that the context in which simple
statistical tests can meaningfully (and very powerfully) be carried out
may often take on considerably more complex appearances. It is the
job of the competent data analyst/statistician to cut to the
cause/effect core of such more elaborate scenarios.
When stated as a mathematical expression, Type I and Type II
errors can be formulated as equations 2.9 and 2.10, respectively,

The Power of a Statistical Test


From equations 2.9 and 2.10 the power of a test can be defined by
equation 2.11,

From this definition, power is the probability of correctly rejecting


the null hypothesis H0. Even though a little mathematical and
statistical prowess is needed to understand these probabilistic
statements and the statistical tests at a first encounter, it pays to be
diligent and work hard to understand these fundamental principles as
early as possible in any data analysis career.
The general procedure to be taken when defining a statistical test
is first to specify the risk (α, Type I) and design the test in such a way
that maximises power (minimises β, Type II). β is a function of sample
size, i.e. the larger the sample, the smaller is β, but it is also a
function of risk and the signal to noise (S/N). By the signal to noise
influence, the power of a test determines whether some critical test
magnitude, say δ, can be detected with respect to the empirical
spread, variance or the inverse, precision etc. of the data. This is
further discussed in section 2.5.5 regarding t-tests.

2.5.2 Defining an appropriate risk level

After taking the above short detour into proper statistical thinking, the
main question to be asked is, “what is an acceptable risk level?”.
There is unfortunately no hard and fast rule for setting a value for α.
However, the most commonly used value is α = 0.05—this is the
canonical choice if there are no special arguments for other choices.
This value signifies 95% confidence of making a right decision in
terms of the type I error (compared to the risk of 5% of making a
wrong decision). The value 95% is obtained using the 100(1 – α)%
criterion. Some other commonly used significance levels are 0.25, 0.1
and 0.01 representing 75%, 90% and 99% confidence levels,
respectively.
A p-value obtained for a particular test must be compared with
the significance level chosen. Table 2.3 provides some general rules
for interpreting the p-value at the canonical α level 0.05.
The concept of p- and α-values can be illustrated graphically as
shown in Figure 2.11; these snapshots of fundamental statistical
concepts have been found useful for scores of entry-level data
analysts for decades, as they embody the essential issues in a
remarkably simple format.
Acknowledging that around the mean, the probability of a
samplestat value in this vicinity is close to 1, then, as one moves
further and further away from the mean, the p-value approaches zero.
This means that there must be some point at which a decision must
be made as to whether the observation value (measurement) belongs,
or not, to the population defined by the null hypothesis. At α = 0.05,
there is an acceptance that there is a 5% (a 1 in 20) chance of being
wrong. If the p-value calculated is >0.05 this is indicative of a sample
belonging to the population defined by the null hypothesis. This is
another compelling argument to be in full command regarding
variance, i.e. always try to get measurement variances (indeed also
the sampling variabilities) estimated as validly as possible. Indeed,
the mandate is always work diligently to curtail all influential variance-
inducting phenomena and processes: it is often possible to reduce
both sampling and measurement errors (variances) significantly,
provided one is in the disposition of the necessary competences, see
chapters 3, 8 and 9. Work of this kind cannot fail to give desirable
results a.o. in the form of increased statistical test power.

Table 2.3: Standard interpretations of p-values resulting from


properly conducted statistical tests.

p-value Interpretation at the α = 0.05 level

>0.1 Insignificant

0.05–0.1 Potentially significant

0.01–0.05 Significant

<0.01 Highly significant


Figure 2.11: Confidence intervals with the normal distribution, defining test significance.

Another consideration that must be taken into account is whether


the test is a one-sided or a two-sided test. The situations described
in Figure 2.11 are for the case of two sided tests. This issue is
described in more details in the next section.

One-sided and two-sided tests of significance

In section 2.5, hypothesis testing was introduced and in particular the


concepts of the null and alternative hypothesis were discussed.
Focus is now turned towards the alternative hypothesis, Ha for the
purpose of discussing the issues of one-sided versus two-sided
statistical tests.
The alternative hypothesis Ha can be set up in two principal ways:
1. Ha: Population 1 ≠ Population 2 or
2. Ha: Population 1 > Population 2 (alternatively: Population 1 <
Population 2).
Case 1 above is known as a two-sided hypothesis since the
statement population 1 ≠ population 2 could mean that the values
obtained are either less than or greater than some hypothesised
difference. In most cases, this difference is zero, since testing for
equality is being performed.
Case 2 presents the two possibilities associated with a one-sided
hypothesis. In this case, a test of whether the difference between the
means of the two populations is greater than, or less than some
hypothesised threshold difference, δ is being performed. In both
cases, it is most important to state the form of the test up front, as
this determines the value of α to be used. This is shown below in
Figure 2.12 for the case of α = 0.05. Note that for a two-sided test, if
it is stated that α = 0.05, then both tails of the total distribution are
being investigated. Therefore, the test is assessing whether some
hypothesised value lies in the outer 2.5% tails of the distribution,
together making up a total of 5%.

Figure 2.12: The difference in risk for one-sided and two-sided hypothesis tests.

Confidence intervals
As the name suggests, a confidence interval is a statement of how
confident an analyst is that a particular value lies within some
tolerance (interval). There is a confidence, to some level of probability,
α, that the value x lies within a lower bound, defined as L and an
upper bound, defined as U. The 100(1 – α)% confidence interval is
defined by the alternative test types (one-sided or two-sided) and the
probability level (risk) that is willing to be accepted. The smaller the α
value, the wider the confidence interval, there is a willingness to
accept more type II errors. General statements of the confidence
interval are provided as follows and these are discussed in more
details in section 2.5.5 when introducing the t-test.
Two-Sided Confidence Interval: L ≤ μ ≤ U
One-Sided Confidence Interval: L ≤ μ or μ ≥ U
Here is an example to illustrate how to set up an appropriate
statistical test scenario: it is the purpose of the test to see if the mean
value of a samplestat is equivalent to some specified target value. Now
the novice “statistician” gets to work, and formulates this wish so as
to match the appropriate test framework: Equivalence is, in this case,
defined such that if the target value lies within the confidence interval
of the sample mean, there is no statistical reason to consider the
sample as being different from the target. Furthermore, if α = 0.05 is
chosen and the target value lies within the confidence interval of the
sample, there is a 95% level of confidence that the sample meets the
target value specification. This type of (re-)formulation of the practical
test objective into the framework of a standard statistical test and all
the, at first sight, intricate paraphernalia of risk, confidence, error
types and what-not, will take a lot of hard work if met with for the very
first time—and there is no easy fix. There are insights, experiences
and competences that can only be obtained by hard work. Do not
give up; there is a world of real power in mastering statistical
hypothesis testing (“it was all your competitors that gave up too
early…”).

2.5.3 A general guideline for applying formal statistical tests

Most of the parameters required to statistically describe populations


and representative samplesstat taken from such populations have now
been defined. The following sections provide a suggested approach
to applying formal statistical tests for comparing two sets of data and
interpreting the results obtained. Other methodologies exist when we
want to compare more than two sets with each other, for which the
reader is referred to standard statistical textbooks.
A statistical test of normality

The first step in applying a parametric test to data sets is to establish


whether they are distributed “normally” (or display near normality).
There exist a number of formal statistical tests that assess the
assumption of normality, examples include the Anderson–Darling test
[13], the Ryan–Joiner test [14] and the Kolmogorov–Smirnov (K–S)
test, Miller and Miller [15].
The K–S test is performed by comparing the Empirical Distribution
Function (EDF) of the data set under study with the expected
theoretical CDF for the same data set. The two distributions are
drawn on the same plot and in the case where they follow each other
closely (in the statistical sense), the assumption of normality cannot
be rejected. The usual output of the K–S test is either a plot of the
distributions or a normal probability plot. In the case where a normal
probability plot is used, departures from a straight line (along with a
p-value < 0.05) indicate a non-normal distribution (refer to Figure 2.8
for possible shapes of the normal probability plot and their
interpretation).
When the EDF and CDF is the chosen graphical output, the EDF is
commonly represented as a step function, while the theoretical CDF
is represented as a continuous curve. The K–S method tests the
hypothesis that the maximum vertical distance between the two
distributions is insignificant compared to a set of tabulated K–S
values. In order to evaluate this, a test statistic is generated by first
transforming all of the x-variables in the data set using the Standard
Normal Variate (z) transformation (equation 2.12).

where μ represents the mean of the empirical data set and σ is the
best possible estimate of the sample standard deviation. The
transformed variables are then placed in order, ranked, from lowest to
highest values and the expected frequency value (expressed as 1/n,
where n = number of samples) are plotted for each transformed x-
variable as a stepped cumulative distribution plot. The maximum
vertical distance between each point in the step curve and the
theoretical distribution is calculated and compared to a K–S table. If
this distance exceeds the tabulated distance at a specified probability
level (usually p = 0.05), the null hypothesis, that the data come from a
normal distribution, must be rejected in favour of the alternative
hypothesis. An example output is provided in Figure 2.13.
A correction commonly applied when using the K–S method is the
“Lilliefor’s correction”. It is used to test the null hypothesis that the
data come from a normal population, when the null hypothesis does
not specify which normal distribution, i.e. does not specify the
expected value (the mean) and the variance. This test procedure
follows the original K–S procedure closely, and only differs in the way
it calculates the test statistic for assessing normality. For more details
on this procedure, refer to Miller and Miller [15].
In the case where a data set is found to be non-normally
distributed, there exists a family of power transformations that can be
used to transform non-normally distributed data into normally
distributed data for the purposes of statistical comparison. Two
common examples are the Box–Cox transformation, Draper and
Smith [16], and the Johnson Transformation, Chou et. al. [17]. Care
should be taken when using such transformations as the underlying
parent distribution is non-normal. This is outside the scope of this
introductory level and the interested reader is referred to the literature
for more information.

Example 2.1: Ten samplesTOS were randomly extracted from a


process line and their masses were measured as part of a periodic
quality control (QC) testing procedure. The measured values for the
masses obtained are listed in Table 2.4.
The question to be answered is: can this samplestat be considered
as coming from a normal distribution at the α = 0.05 level? To answer
this question, the K–S normality test was applied to the data in Table
2.4 which resulted in the statistics shown in Table 2.5.
These data can now be interpreted as follows: Since the
calculated K–S statistic is less than the Critical Value, there is no
reason to reject the null hypothesis (that the data come from a normal
distribution). This is overwhelmingly confirmed by the p-value of
0.9700 (>α = 0.05)
A final confirmation that the data come from a normal distribution
is to present the data as a normal probability plot, shown in Figure
2.14. In this plot the data lie very close to a straight line, furthering
strong evidence of a normal distribution. The straight-line criteria for
normality was discussed in section 2.4.1.

Figure 2.13: An example of the cumulative distribution output associated with the K–S test.

Table 2.4: Quality measures (masses) from process data for Example
2.1.
Sample number Mass (g)

1 94.1

2 93.2

3 91.4

4 95.1

5 92.4

6 91.2

7 95.2

8 94.3

9 93.2

10 95.3

If the parent distribution is non-normally distributed, normality may


actually be induced by sampling the parent distribution, this is
discussed immediately below.

Samplingstat of a distribution and the standard error

When samplesstat are taken from a non-normally distributed parent


population, these samples may distribute themselves in any one of
the distributions shown in Figure 2.4. Since parametric tests are
based on the prerequisite of normality, ideally the collected
samplesstat should be normally distributed. One way of inducing
normality, even when the parent population is non-normal, is to
samplestat the parent distribution a number of times.

Table 2.5: Summary of K–S statistics for Example 2.1.

Kolmogorov–Smirnov statistic 0.1478

Critical Value (with Liliefors correction) 0.2630


p-value 0.9700

Figure 2.14: Normal probability plot of data described in Example 2.1.

For example, say samplesstat are being drawn from a right skewed
distribution and it is decided to take five measurements at a time. By
taking the average of these values, even if there is one extreme value
in the set, their average will usually be closer to the majority of the
sample values than the extreme. By iterating this process numerous
times, this will result in new subset average values that are nearly
normally distributed. The number of times the parent distribution has
to be sampled to obtain normality is defined by the Central Limit
Theorem (CLT). The interested reader is referred to chapter 8* and the
literature for more details of the CLT, Montgomery [4].
Nevertheless, when a parent distribution is sampled many times,
the averaged values from these sub-samplestat averages provide an
analyst with a much more confident estimate of the population mean,
with respect to “grabbing” just one value, at random. If the sample
size is n ≥ 5, there will definitely be more confidence in the values of
the variance observed, i.e. as manifested by the standard deviation of
the spread of the individual samplestat means. The value of n ≥ 5 has
been stated as a sufficient number of samples in order to provide a
reliable estimate of variance Massart et al. [18], but it is fair to say that
there is not universal agreement about this very small N.
Observe that this dispute, along with several others (all related to
“what is the minimum N needed in order to…”), often reflects the
evergreen desire to get away with the current task in data analysis
and statistics with the least effort—but even the present cursory
introductions should have made it abundantly clear that there is no
magical “small number, N” that will fix such problems.
The standard error (s.e.) (often called the standard error of the
mean) is defined as the confidence level regarding the reliability of
that mean value actually representing the population mean. It is
defined by equation 2.13.

If the population variance σ2 is not known, the sample variance s2


can be used in its place. This equation shows that the more
samplesstat taken from a population, the smaller the standard error
and therefore, the more confidence there is in this estimate of the
mean. The standard error is used extensively in defining confidence
intervals and will be referred to extensively when performing t-tests.
Figure 2.15 provides a graphical representation of the standard error
of the mean compared to the overall distribution of the observations
in a data set.
Figure 2.15: Illustration of the standard error of the mean.

The following provides a numerical example of how a non-


normally distributed parent can be sampled in order to provide a
normally distributed sample. An initial population of 30 random
observations was taken from an experimental system (with unknown
distribution) and a K–S test of normality was applied. The results are
shown in Figure 2.16 and the statistics are summarised in Table 2.6.
The null hypothesis, that the data come from a normal distribution,
must be rejected, since p < 0.05. The population was then sampledstat
by drawing seven observation values at random and deriving their
average value. This was performed a total of 30 times (with
replacement) and the results of application of the K–S test, based on
these 30 subset average values, can be appreciated by Figure 2.17
and Table 2.7.
This didactic experiment shows that by repeatedly sampling a
non-normal distribution, a normally distributed sampled population of
sub-set averages can be obtained. In this case, the parametric tests
described above may now be appropriately applied with confidence
and the result is improved confidence in the proper statistical
inference opportunities.
Figure 2.16: Cumulative distribution showing the characteristics of a non-normal distribution.

Table 2.6: Summary of K–S statistics for a non-normal population.


The p-value less than 0.05 confirms this finding.

Kolmogorov–Smirnov Statistic 0.3372

Critical Value (with Liliefors Correction) 0.1600

p-value < 0.05

2.5.4 A Test for Equivalence of Variances: The F-test

After a normality test has been applied to observations on two (or


more) sets of data, the next step in hypothesis testing is to compare
the sets for equivalence of variance. Variance is directly related to the
total measurement error (MU), often called precision in statistics. In
order to compare two sample sets in a fair way, it should be
established whether their variances are equivalent. When the variance
(precision) of one set of observations is poor, compared to a second
set, this lack of precision means that it may be difficult to establish
whether a real difference exists between the two sets. This is shown,
by way of a rather extreme case, in Figure 2.18.

Table 2.7: Summary of K–S statistics for the re-samples subset


average values, indicating a normal population.

Kolmogorov–Smirnov Statistic 0.1037

Critical Value (with Liliefors Correction) 0.1600

p-value > 0.05

Figure 2.17: Cumulative distribution of the re-samples subset average values showing the
characteristics of a normal distribution.
Figure 2.18: Schematic differences in variance between two data sets exhibiting identical
means.

The F-test first postulated by Sir Ronald Fischer [19] calculates


the ratio of the variances for two data sets. The null hypothesis is set
up such that it postulates that there is no significant difference
between the variances, versus the alternative hypothesis, that one
variance is greater than the other, i.e. this F-test application is usually
a one-sided test. If the null hypothesis cannot be rejected, the data
analyst is left with the practical interpretation that the ratio of the
variances is close to one (within the limits of the random variation).
When it cannot be assumed that the difference is due to random
variation, a significant difference between the two variances exists
and thus the null hypothesis is rejected. In section 2.5.5, a discussion
of the ways of comparing sample means when it cannot safely be
assumed that there is equivalence of variances is presented. The F-
ratio is provided in equation 2.14.

In the case where the population variances are known, equation


2.14 is the appropriate form of the F-test. If, however, the variance
must be estimated from two samplesstat, σ2 in equation 2.14 is
replaced by s2. The usual convention for application of the F-test is to
put the largest variance in the numerator; this ensures that the ratio of
variances is always greater than 1. The calculated test statistic F0 is
compared to an F-table (the so-called Snedecor F-table, Beyer [20])
for a specified number of degrees of freedom. The form of the test
statistic is as follows

where α = significance level (usually set to 0.05), ν1 = degrees of


freedom for observation set 1 and ν2 = degrees of freedom for
observation set 2.
A p-value is also generated for this test. If p > 0.05 (at 95%
confidence), then the null hypothesis cannot be rejected. If p < 0.05,
the null hypothesis (that the variances are equivalent) is rejected.
Questions such as: “does the new process improvements
implemented result in less variability” can be answered with the F-
test. The outcome of the F-test also helps to determine what type of
t-test to apply when comparing the means of two samples (section
2.5.5).

Example 2.2: Returning to the mass data in Example 2.1, suppose a


quality improvement initiative was undertaken to reduce the variability
in the masses of products manufactured. After implementation of the
initiative, a second sample of 10 units was taken from the production
lot and the masses taken. The combined data are provided in Table
2.8.
It was first shown, using the K–S test, that the new sample is also
normally distributed, therefore the F-test can be applied to these two
data sets. The sample variances were calculated using equation 2.2b
and the F-ratio was calculated using equation 2.14. The relevant
information is summarised in Table 2.9.
Since there are 10 samples in both data sets, the degrees of
freedom (ν) are calculated as ν – 1. In this case, n1 = n2 = 10 – 1 = 9
degrees of freedom. Therefore, the calculated test statistic F0 is
compared to the tabulated critical F-value with 9 degrees of freedom
in the numerator and 9 in the denominator.
The p-value for this example was found to be 0.026 meaning that
the null hypothesis that the two sample variances are equivalent must
be rejected in favour of the assumption that the new sample variance
is less than the original sample variance. In this case, the objective
has been verified, the new process improvements have indeed
resulted in less product mass variability. This can be displayed
graphically using a variance comparison plot as is shown in
Figure 2.19, where it very clearly can be seen that the variance for the
post improvement is much less than that of the pre-improvement
results—powerfully substantiated by the K–S test, which attributes
this observation as statistically significant.

Table 2.8: Mass results for two samples taken from a manufacturing
process.

Sample number Mass (g) (pre improvement) Mass (g) (post improvement)

1 94.1 93.4

2 93.2 92.9

3 91.4 93.5

4 95.1 94.8

5 92.4 93.7

6 91.2 92.4

7 95.2 94.1

8 94.3 93.7

9 93.2 92.8

10 95.3 93.5

Table 2.9: Summary of statistics calculated for an F-test of the data


in Table 2.8.

Sample 1 variance ( ) 2.31


Sample 2 variance ( ) 0.47

F0 4.949

F0.05,9,9 3.179

p-value 0.026

A short discussion on degrees of freedom


In the last section and in section 2.2.4 on variance and standard
deviation, the term degrees of freedom (dof) were mentioned—but
what are degrees of freedom formally? To answer this, the small data
set introduced in section 2.2.1 will be used. This data is reproduced
below for convenience.

1.0, 2.0, 3.0, 4.0, 5.0

The mean was previously calculated as = 3 and this was the


weighted average of the individual results, weighted by the inverse of
the number of observations (1/N). Say, as an example, that the mean
value of a data set was known and all but one value in the set was
also known. Then the calculation of the missing value is easily
performed using simple algebra. In general, the number of dof is
calculated as the number of samples minus the number of estimated
parameters. In the above case, the mean is an estimated parameter,
so there are 5 – 1 = 4 dof for the above sample set.
Figure 2.19: A variance comparison plot showing that the post-improvement data (right) has
a significantly less spread than the pre-improvement data (left).

Returning to the calculation of variance and standard deviation,


making particular reference to equations 2.2b and 2.3 which both
contain the term N – 1 in the denominator, which are the degrees of
freedom. This term is in the equation since an estimate of the sample
mean is required in order to calculate variance. As the number of
samples N gets larger, the term N – 1 becomes more and more
equivalent to N in practice. This is the reason why in equation 2.2a for
the calculation of the population variance, the denominator contains
the full term N as no parameters are estimated.
Degrees of freedom are used widely in statistical tests, especially
when comparing calculated statistics to tables of critical values (such
as the F- and t-tables). They represent a measure of the confidence in
the calculated values (remember from section 2.5.3 when N is large,
there is more confidence in the estimated mean). When the dof is a
large number there is a better chance of making the right decisions
and this is also related to the power and type II errors discussed in
section 2.5.1.

2.5.5 Tests for equivalence of means

It is often the case in science, technology and industry that, for


example, a set of experimental trials is conducted in order to
determine if a response can be generated and detected (cause and
effect). The results obtained depend solely on the objective of the
experiment, for example, in process improvement applications,
questions like, “will a process modification result in a higher yield”
may be asked, or in the analytical chemistry laboratory questions
such as “is operator 1 equivalent to operator 2 in performing a
particular test” needs to be answered. In the first case, improvement
is the objective where in the second case, equivalence is sought.
Such objectives must be established with respect to some defined
reference or target outcome.

Table 2.11: Summary of statistical tests used for comparing the


equivalence of population and sample means.

Test Description and usage

Z-test Used when the number of measurements > 30


Uses the normal Z-distribution to define critical values
Requires a priori knowledge of the population variance

t-test Used for “small sample” cases (usually between 5 and 30)
Uses the t-table which scales the distribution based on the
number of measurements
Requires no prior knowledge of the population variance

When comparing two samplesstat for equivalence, inferences are


being made about the two mean values (with the assumption that the
variables are continuous). In order to compare the means, information
is required about the spread of the data and also, a statement of how
confident the estimated means are. In section 2.5.3, a method was
discussed for calculating the confidence in the mean through the
standard error (s.e.). The standard error plays a key role in hypothesis
testing and calculating confidence intervals.
The type of test applied to compare sample means depends upon
a number of factors. In particular, the number of measurements being
tested and whether it is known a priori the population variance of the
samplesstat drawn from this population. The two tests available for
comparing the equivalence of means are listed in Table 2.11.
In this book, only a focus on the t-tests postulated by Gossett [21]
(under the pseudonym “Student”) is made and the interested reader
is referred to the literature for more information regarding Z-tests,
Montgomery [4]. It is stated here that as the number of samples
becomes larger, the t-table approaches the Z-table, so there is no
generality lost by focusing only on the t-tests. The t-tests described in
this chapter are also referred to as Student’s t-tests.
In Table 2.11 it was stated that the t-table is scaled depending on
the number of samples. This follows on from the above discussion on
degrees of freedom and that the more samples taken, the more
confidence there is in the estimated mean value. To compensate for
the lack of information gained from taking small samples, the t-
distribution is wider for small sample sizes and becomes more like
the standard normal (Z) distribution as N approaches 30 (this is
another outcome of the central limit theorem mentioned in section
2.5.3). Since the t-distribution is wider for smaller sample sizes, the
confidence intervals around calculated estimates are also wider to
account for the lack of precision. This is shown in Figure 2.20 where
the spread of the t-distribution is widest for small samples and
approaches the normal distribution as N approaches ~30.
Remember from Figure 2.5 in section 2.4.1. it was stated that
approximately 95% of the area under the normal distribution lies
within approximately 2 standard deviations of the mean. The actual
value from the Z-table is 1.96 standard deviations from the mean.
Note from Figure 2.20 that if N is ridiculously small, e.g. N = 2, there
is only 95% confidence when the confidence interval is calculated at
a distance of 4.30 standard deviations from the mean.
A suggested procedure for applying the t-tests described in the
following sections is as follows:

Figure 2.20: The t-distribution is scaled based on the number of samples in the set. When
this number approaches 30, the t-distribution is equivalent to the Z-distribution.

1. Test for normality of the observations (by applying the K–S test).
2. Test for equivalence of variance (F-test).
3. Apply the appropriate t-test (described in the following sections).

The one-sample t-test

In some applications, comparison of the mean of a samplestat set to


some standard or target value is performed. Usually, the target is
specified by company policy as meeting some critical quality criteria,
or it may be specified as part of government regulations, especially in
areas such as environmental analysis and in the healthcare sector.
In the case where only one sample set is collected and it is to be
compared to a specified target value, this is known as a one-sample
t-test. The form of the test is described by equation 2.15.

where t0 = test statistic to be calculated, = sample mean, μ0 =


specified target value, s = sample standard deviation (calculated from
equation 2.3) and N = number of measurements used to calculate the
mean. Note that the denominator of equation 2.15 is the standard
error as defined by equation 2.13. The numerator is calculating the
difference between the estimated mean of the sample set and the
specified target value. The closer this value is to zero, the closer the
sample mean is to the target. When the numerator is small compared
to the denominator, this means that the confidence interval of the
observed mean is likely to enclose the target value. This is shown in
Figure 2.21 where the specified value μ0 is enclosed within the
confidence interval of the mean. Remember, this confidence interval
is determined based on the number of samples used to calculate the
mean and becomes wider when N becomes smaller.

Example 2.3: Returning to the data in Example 2.1, the specified


target for acceptable product is 95 g. It must be determined whether
the target value lies within the 95% confidence interval of the sample
mean, i.e. at the α = 0.05 level of significance.

Figure 2.21: For the mean value of a samplestat to be effectively equivalent to the target
value, the target should lie in the confidence interval of the estimated sample mean.
Note that this is a two-sided test because there is interest in
testing the following hypotheses,

H0: = μ0

Ha: ≠ μ0

In the calculation of the critical t-value (tcrit), the form of the


statistic is as follows,

The value α/2 indicates that the test is two-sided and therefore at
95% confidence, the 2.5% tails of the t-distribution are being tested
(refer to Figure 2.12). Since there is only one sample, ν represents the
degrees of freedom for the sample set and is equal to N – 1. For this
example, the number of samples is 10, therefore the dof is 9 and the
critical t-value has the form t0.025,9. Table 2.12 provides the details of
the calculation.
The null hypothesis can only be rejected if |t0| > tα/2,ν. In this case,
since 3.04 > 2.262, it was concluded that the mean value of the
sample set does not meet the specification for this product. The
calculated p-value also suggests that the null hypothesis should be
rejected, i.e. p < 0.05.

Table 2.12: Summary statistics of the one-sample t-test applied to


the data of example 2.3.

Sample mean 93.5

Specified target μ0 95.0

Sample standard deviation (s) 1.5

N 10

t0 –3.04
tα/2,ν 2.262

p-value 0.015

As was described for the F-table, the relevant portion of the t-


table is reproduced in Table 2.13. to illustrate how the critical value is
found.
A 95% confidence interval around the sample mean can be
calculated using equation 2.16. This equation is just a re-expression
of equation 2.15 in the form of an equality.

By defining this relationship, if μ0 lies within the confidence interval


of the sample mean, there is no reason to reject the null hypothesis.
Substituting the calculated values from Table 2.12 into this
expression results in the following confidence interval.

92.3 ≤ μ0 ≤ 94.8

Table 2.13: Relevant part of the table of critical t-values for Example
2.3.

dof (ν) Significance level (α)

0.1 0.05 0.025

8 1.397 1.860 2.309

9 1.383 1.833 2.262

10 1.372 1.812 2.228

Since the specified value of 95 g does not lie within the 95%
confidence interval of the sample mean, the null hypothesis that the
sample mean is equal to the specified mean value is rejected in
favour of the alternative hypothesis: these means are statistically
different.

The Two-sample t-test

The two-sample t-test is used to compare the equivalence of means


for two independent sets of observations. By independent, it is
assumed that the results are random samples from a population that
are not influenced by the results of another population. A typical
example is the taking of samples from two different manufacturing
batches of a material and comparing the mean results for
equivalence. The test can be set up for two principal situations,
1. Test for the equality of means using the assumption of equal
variances.
2. Test for the equality of means using the assumption of non-equal
variances.

Equal variance assumption: When it can be assumed that the


variances of the two sets of observations are equivalent, i.e. the null
hypothesis of equivalence of variance cannot be rejected using an F-
test, the form of the t-statistic is defined by equation 2.17:

The numerator contains the term 1 – 2 which measures the


difference between the two set means. The closer this value is to 0,
the more likely the two sets of observations come from the same
population. The denominator contains the term sp, which is called the
Pooled Standard Deviation, and is defined by equation 2.18
The pooled standard deviation is a measure of the common
spread of the two populations and can only be representative of both
populations when the variances are tested to be equivalent (F-test).
Returning to equation 2.17, the denominator represents another
form of the standard error, this time taking into account the pooled
variance of both sets. Therefore, the t-statistic is here a measure of
the ratio of the difference between two sample means and the overall
pooled precision associated with the data sets.
The two-sample t-test can be either one-sided or two-sided. The
null hypothesis and alternative hypotheses are usually set up as
follows:

H0: 1 = 2 (i.e. no difference)


Ha: 1 ≠ 2 (two-sided)

or

Ha: 1 < ,
2 1 > 2 (one-sided)

A p-value > 0.05 (or |t0| < tcrit) indicates that the null hypothesis
cannot be rejected, i.e. there is no difference between 1 and 2. A p-
value < 0.05 (or |t0| > tcrit) suggests that the sets of observations are
significantly different and therefore the null hypothesis must be
rejected.

Table 2.14: Mass data for example 2.4.

Sample number Mass (g) (post improvement 1) Mass (g) (post improvement 2)

1 93.4 94.2

2 92.9 95.1

3 93.5 93.3

4 94.8 94.2

5 93.7 93.8
6 92.4 95.1

7 94.1 94.3

8 93.7 92.9

9 92.8 93.7

10 93.5 94.7

Table 2.15: Summary data for the two-sample t-test performed in


Example 2.4.

Sample mean 1 93.5

Sample mean 2 94.1

Difference 1 – 2 0.65

Pooled Standard Deviation (sp) 0.67

n1 10

n2 10

t0 –2.063

tα/2,ν 2.101

p-value 0.054

Example 2.4: Again, using the data from Example 2.3, a second set
of mass data was collected on the improved process to determine
whether the changes made a real impact or whether they were just
the result of chance. The data are provided in Table 2.14 and the
calculations for the t-test are provided in Table 2.15.
The first step is to establish normality and equivalence of variance
of the data sets. This was performed using the K–S and F-tests and
concluded that the equal variance assumption t-test can be used.
It is important to note that in this case the p-value is marginal (i.e.
between 0.05 and 0.1, refer to Table 2.3) so a decision must be made
as to decide whether a significant difference exists or not. This is a
case where the difference between statistically significant and
practically significant must be made, i.e. is the mean value of 93.5 g
really different from 94.1 g? So much for the power of properly
conducted statistical hypothesis tests—but perhaps the more
interesting real-world question here is: “Why are the data not centred
around 95 g, the specified process target?”.

Figure 2.22: Situations describing equivalent and non-equivalence of sample means using
the two-sample t-test.

Further comments on the power of a statistical test: When the two-


sample t-test is applied, what this is asking is, “is there is a significant
difference between sample 1 and sample 2?”. Figure 2.22 shows this
for two situations, to the left the case where there is no significant
difference between the samples, and to the right the case where a
significant difference exists.
The value 1 – 2 is the observed difference between the means
from the two sample sets. In some cases, the t-test is used to
determine if this difference can be distinguished from zero (in the
statistical sense) and in some cases, an observed difference δ may
wish to be tested. This is set up as a hypothesis test and depends
primarily on three factors, the variance of the two sample sets, the
number of samples measured and the risk. As was mentioned
previously (section 2.2) variance and precision are related to each
other. Therefore, the more precise measurements can be made, the
greater the likelihood of detecting smaller differences, e.g. between
the two means.
In the situation where an estimate of the precision is not available,
an estimate of the mean can be improved by taking more samples. It
was shown above that the precision of the mean is a function of the
square root of the number of measurements. Therefore, if one wishes
to double the precision, 4 measurements must be made, but to
increase the precision 4-fold, there is a need for 16 measurements
and to improve 10 times, 100 measurements are needed. This can
soon become practically impossible, so all attempts should be made
to improve the precision of the measuring system instead, if critical
decisions need to be made. There is here a direct link with chapters
7, 8 and 9, which continues this discussion.

Figure 2.23: Graphical representation of making a type I and type II errors.

Figure 2.23 provides a graphical representation of the relationship


between type I and type II errors. Type II errors are usually considered
the most business critical and are directly related to the power of the
test. Power was defined previously as (1 – β) in section 2.5.1.
The probability of making a type I error is α and this defines the
rejection region of the hypothesis test. However, since the sample
distributions overlap, it is possible that the test statistic will fall into
the acceptance region, even though the sample comes from a
population with true mean μ1. This region is denoted by β, i.e. the
probability of making a type II error. In order to minimise the risk of
making such an error, it must be stated again that the precision of the
samplingstat and measurement methods must be fully understood.
Alternatively, increasing the sample size serves to increase the
precision estimate of the mean, therefore maximising the power of
the design.

Comparison of two dependent means: the paired t-test

When alternative sets of observations come from measurements


performed on the same samplestat, the assumption of independence is
no longer valid. An example of dependent data sets could for
example concern measuring the durability of the soles of shoes made
from two different materials. In this context, the cases (shoes) are
tested in the same way by fitting them to a specified number of
people (different sole materials for each foot) and measuring the
durability each sole show after a given time. In this case, wherever
one shoe goes so does the other, therefore there is dependence. The
difference in the durability values is a measure only of the difference
between the materials used to make the shoe soles.
Paired t-tests are commonly used to test the equivalence of
operators performing similar tasks, or for comparison of a new
analytical method with respect to an established method. In the
calculation of the paired t-test statistic, a number of other important
statistics are calculated along the way, including the bias between the
results and the Standard Deviation of Differences (SDD), usually used
for establishing the standard error of laboratory associated with the
measuring system.
The form of the paired t-statistic is calculated using equation 2.21.
The term sd represents the SDD and is calculated using equation
2.22

where dj is the observed difference between a series of paired


measurements and is defined as the average difference between
them (bias). Bias, the systematic differences between the two series,
is further described in the following text.
Equation 2.21 is similar in form to the t-statistic formulas
discussed earlier, but here everything centres on the mean difference
between the two series. In this case, the numerator contains the term
(the mean difference between the two sets of measurements). The
closer is to zero the more likely the two sets of observations come
from the same population. The denominator contains the term sd/√N
which is just another way of expressing the standard error (precision)
of the mean differences between the observations. When the term
is found to be statistically significant, there is reason to believe that a
real difference exists between the sample sets.

Example 2.5: A producer of health care products wants to replace an


existing destructive analytical test with a non-destructive rapid NIR
method of analysis. In order to determine if the new method is
equivalent to (or better than) the existing method, it was decided to
run a paired t-test on 10 representative samplesTOS from the process.
This can be done because the NIR method is non-destructive. The
test procedure therefore first involved scanning the samples with NIR
and then submitting the same samples to be analysed using the
existing (destructive) method. The results obtained are provided in
Table 2.16 along with the differences between the pairs of
measurements, the mean difference and the standard deviation of
differences.
Table 2.16: Data from analytical method comparison for Example 2.5.

SampleTOS Existing method NIR method Differences

1 84.63 83.15 1.48

2 84.38 83.72 0.66

3 84.08 83.84 0.24

4 84.41 84.20 0.21

5 83.82 83.92 –0.10

6 83.55 84.16 –0.61

7 83.92 84.02 –0.10

8 83.69 83.60 0.09

9 84.06 84.13 –0.07

10 84.03 84.24 –0.21

Mean 84.06 83.90 0.159

Std Dev 0.34 0.34 0.57

Table 2.17: Summary statistics for the paired t-test performed in


Example 2.5.

t0 0.882

t0.025,9 2.262

p-value 0.401

It was first checked, and shown that the two-measurement series


both come from a normal distribution (K–S) and also that the
differences form a normal distribution and that the variances of the
two sets of observations were equivalent (F-test). The test statistics
are provided in Table 2.17.
Because there were 10 differences calculated, there are 9 degrees
of freedom associated with the test. In the case where the variances
are found to be significantly different, the paired t-test is not
performed—however, this is not necessary either: it can now be
stated that the method with the lowest variance would be the
preferred method (provided its mean value is close to target, which is
a matter of validation of the analytical method).
The results in Table 2.17 show that |t0| < tcrit, (0.882 < 2.262),
therefore a statistical insignificant difference exists between the two
methods and the null hypothesis cannot be rejected, i.e. the
difference between the two measurements series could not be said to
be statistically different from zero. The practical conclusion was,
therefore, that the NIR method could replace the existing method with
confidence, thus reducing the use of resources and eliminating
product loss to destructive testing.
Note that the critical t-value in Table 2.17 (2.262) is the same as
the one calculated for Example 2.3 and displayed in Table 2.13. This
is because in both examples, the data sets consisted of 10 samples,
therefore when reference is made to the t-table, 9 degrees of freedom
were used in both cases. Therefore, the t-table is independent of the
form of the test used.

Figure 2.24: An example time series plot with a histogram. The horizontal lines are control
limits.

The point samplingstat vs samplingTOS is being made clearly in this


chapter, the index “stat” or “TOS” may occasionally be omitted, but
only when there is no danger of misunderstanding.
2.6 An introduction to time series and control
charts
The methods investigated to this point have produced what is known
as static analyses of data. Static analyses are a summary of the
results collected over a specific period of time. However, what if the
time dependence of the data distribution is to be investigated. This is
the process realm, a process = a time-varying system (much more to
be said about this in chapter 13). The time dependent data structure
is best displayed in a time series plot. The time series plot shows a
measured variable’s distribution over a time interval, where all the
time periods (intervals) between points are equal (equidistant
observation times).

Figure 2.25: Example of a time series plot showing a step change in the process. Note that
the marginal histogram (and its fitted distribution) is now non-normal.

The main advantage of the time series plot is that it allows an


analyst to visualise the time-dependent nature of the data
distribution. For example, if a data set shows random and constant
distribution around a target value over time, this situation represents
a stable system and is illustrated in Figure 2.24. Also, shown in Figure
2.24 is the marginal distribution of the data around the mean in the
form of a histogram.
Figure 2.26: Time series plot of cyclic data. Note that the histogram looks normally
distributed and is therefore unable to detect the cyclic structure of the data.

If, for instance, the data showed trending, a step change or even
cyclic behaviour, static measures are not useful in detecting when
such changes occurred. The marginal histogram can only summarise
the final data structure. The data in Figure 2.25 shows the effect of a
step change in a process. Note that the distribution is now skewed,
indicating a change in the process, but it cannot locate when this
event happened. This can only be addressed using the time series
plot.
There is a direct link between hypothesis testing and time series
plots when applied to process or other data collected over time. The
time series chart is sometimes known as a run chart, where it is
expected that data is collected at regular time intervals. In a run
chart, the mean is usually plotted as a horizontal line and the variation
of the data around the mean can be visualised and further
investigated.
Time Series Analysis (TSA) [22, 23] is used to forecast the
possibility of a future event. It uses techniques such as linear
regression (chapter 7) and methods such as smoothing and de-
trending (chapter 5) to isolate the main patterns in the data. From
these patterns, models are developed that best describes future
events. This, of course, is highly dependent on the reliability and the
representativity of the historical data used to develop the model.
Therefore, TSA models are usually updated on a regular basis. Figure
2.26 provides a time series plot of what is known as cyclic data
(periodic data). Cycles may be due to many behind-the-scene
influences, e.g. they may reflect seasonal changes—or may be based
on external drivers that occur due to inflation and deflation of
economies over certain time periods. Again, if a histogram is fitted to
the cyclic data, it only provides a static indication of the variability of
the data as a whole and cannot detect the cycles in the data.
Figure 2.27 provides the same data as Figure 2.26, however, this
time the smoothed time series is plotted together with the original
data. Smoothing very often helps to better visualise the cyclic nature
of the data.
Regarding time series plots in general: if the data are
symmetrically (or near symmetrically) distributed around their mean
value in a random fashion, for example similar to the data in Figure
2.24, a certain level of confidence that the data are near normally
distributed around the mean would be perceived. The variation, or the
spread of the data with respect to the mean would provide an
indication of how close the overall data are to this mean. In process
applications, this is of primary importance as it is the goal of any
company to produce goods with as little variability as possible. If the
mean value of the data set is close to a specified target, it could be
said that the process, or system, is behaving as expected (the system
is “capable”). This is very similar to the scope of the problem
discussed in section 2.5.5 on one-sample t-tests.
Figure 2.27: Plot of original time series data (blue) with the smoothed data (red) overlaid to
better visualise the cyclic nature of the data.

If it can be established that the data are normally distributed


around the mean, then it is feasible to set up Control Limits on the run
chart based on prior knowledge of the standard properties of the
normal distribution in section 2.4. The run chart now becomes a
Control Chart and these were first introduced by Shewhart in 1931
[24]. Very little details of how control limits are set up will be provided,
except to state the fundamental characteristic that control limits are
usually set to approximate ±2 or ±3 standard deviation limits around
the mean value, representing approximate 95% and 99% confidence
intervals, respectively. Figure 2.28 provides a schematic example of a
control chart.

Figure 2.28: A schematic example of a control chart.

The usual convention in a control chart is that the lower control


limit is denoted by LCL while the upper control limit is denoted as
UCL. Some control charts display both the 95% and 99% limits on
one chart and these are called the Warning and Action limits,
respectively. When a sample lies in-between the warning and the
action limits, the 95% tail of the normal distribution are entered and
run the risk of making a type II error regarding these points. If multiple
points in a row are found to lie in this region, this is usually indicative
of a step change, or a bias, in the system. This is the basis of
Statistical Process Control (SPC) discussed briefly in the introduction
to this chapter.

Figure 2.29: An example of a control chart showing warning and action limits.

In most modern manufacturing processes, a number of systems


are implemented for detecting what are known as Special Cause
variations. These are described in detail in the book by Montgomery
[4] and if any such warnings look like producing a problem,
Engineering Process Control (EPC) can be used to send a
feedforward/feedback signal to the process equipment so as to make
changes with the aim to rectify the problem.
It is common practice to stop the process once an action limit has
been crossed. Figure 2.29 shows the control chart with warning and
action limits with respect to a normal distribution of data. It also
illustrates the relationship of these limits to the hypothesis tests
discussed in this chapter.

2.7 Joint confidence intervals and the need for


multivariate analysis
The methods for control charting are excellent ways for monitoring
one variable, or a very few variables at a time. They are primarily used
to determine when a single parameter, be it a predicted quality
measure, or the direct output from process equipment is out of
control or out of calibration. These are important events to detect as
they can lead to detrimental results if left undetected, but the major
flaw with univariate SPC is that it fails to take into account the
relationship, known as the covariance, of the variables with one
another.
Covariance, and its scaled version, correlation (see chapter 1), is a
key topic of this textbook and is key to understanding the multivariate
methods presented hereafter. A very simple situation that will form
the basis for much of the multivariate thinking presented in this book
will be used as an illustration. The simplest multivariate situation
occurs when two variables are measured that are acting on a single
system. What happens when both of the variables temperature and
pH of a certain system are measured?

Figure 2.30: Individual control charts of temperature and pH showing no indication of out of
control situation.
Figure 2.31: The covariance structure of temperature and pH. This highlights the process
failure in a multivariate data space.

It is known that temperature and pH influence each other, based


on sound chemical and physical principles. If both variables are
measured simultaneously on a single system over time, simple
control charts can be generated for these variables individually along
with appropriate control limits. As the process continues, more and
more data points are collected on both variables. A point where a
critical quality parameter is out of specification arises, but analysis of
the control charts in Figure 2.30 reveal that there is no reason to
suggest that temperature and pH would lead to this out of
specification (OOS) situation, a definite problem exists. It is also
possible that the conclusion drawn here would be that temperature
and pH are providing no real information on quality. As is shown in
Figure 2.31, this couldn’t be further from the truth.
An alternative simple multivariate chart can be created by plotting
the same information (the same data) in the two control charts
(perhaps appropriately scaled) in a scatter plot. The scatter plot
simply consists of plotted data pairs taken at the same time on the
control chart. For convenience, the control charts are plotted in
Figure 2.31 in such a way as to reveal the covariance structure in
these data sets.
It is now clear from the covariance structure, as visualised in the
scatter plot, that at one point in time, a data pair breaks the overall,
strong linear pattern shown by the majority of all other points. It is
here that the crucial deviation in quality occurred, but this would have
remained undetected on a solely univariate SPC approach. If Figure
2.31 is further investigated, the boxed area bounded by the limits of
the two control charts in the scatter plot represents the allowed
univariate region of variability. This is why the suspect point went
undetected. To fully understand the allowed region of variability, the
Joint Confidence Interval of the two sets of data must be taken into
account. This is found by looking at the distribution of the data in the
histograms. By plotting the confidence intervals of the two data sets
together, it is found that the resulting Multivariate Interval is defined
by the ellipse surrounding the data in Figure 2.31.
The direction of this ellipse is defined by whether the variables are
positively or negatively correlated to each other. If the confidence
ellipse is positively sloped from bottom-left to top-right, then the
variables are said to be positively correlated to each other. If the
ellipse sloped in the reverse direction, then the variables are said to
be negatively correlated. The slope of the line formed by the two
variables determines the degree of correlation between them. A slope
close to ±1 (on an equal scale) indicates high correlation whereas a
slope close to zero indicates little to no correlation. Covariance and
correlation were introduced in chapter 1 and provide great details into
the multivariate structure of data. The so-called Autocorrelation of
data is a key aspect of TSA and is used to test the independence of
data [22]. Independence is absolutely critical in the definition of
randomness and should be established before the application of
formal statistical tests to data.
Whether it is applied to industrial or research applications, the
reliance on solely univariate approaches can potentially mask
important interactions such as in the example presented in this
section. In the chapter on Design of Experiments (DoE) (chapter 11),
ways of systematically designing rational experiments are discussed
that are used to detect for the presence of interactions and then
quantify their impact on some quality or other response variable.
When it can be established which variables are dependent on
each other and which ones are not, suitable control systems can be
developed that can detect the onset of failure before it becomes an
issue. The multivariate methods discussed in this book form the basis
of Multivariate Statistical Process Control (MSPC), which is currently a
hot topic for industrial applications. For example, the combination of
measuring single variables univariately can lead the detection of
gross problems associated with instrument failure, mis-calibration or
a real process event. The combination of all process variables acting
simultaneously can lead to the detection of subtle interactions that
are detrimental to a systems normal operation, but would otherwise
go unnoticed. This is the role of the multivariate model. Such
endeavours are usually guided by the results of a good experimental
design, if there is a possibility to carry out such preemptive
investigations.
Finally, the outputs of both the univariate and multivariate models
can be used for root cause analysis of events, or they can be used in
control loops that utilise the results of forecasted values (from time
series models) to correct or shutdown the system before an event
becomes a problem. This is the most proactive approach to smarter
manufacturing and avoids the mistakes of the past made by taking
reactive approaches. And it all comes down to the professional
competence of the data analyst to be in full command of Multivariate
Data Analysis (MVDA).
2.8 Chapter summary
Univariate statistical methodology, performing samplingstat, was
introduced in this chapter and the most commonly used approaches
related to what is known as parametric statistics. Parametric methods
can only be applied to observations that display a normal (or near
normal) distribution. The concept of normality is best described by
the standard Gaussian bell-shaped curve, for which many statistical
and probability tables are available to describe the various data sets
encountered in real world situations.
To assess the normality of a set of observations, the Kolmogorov–
Smirnov (KS) test was discussed and using this (or another valid test
such as the Anderson–Darling or Ryan–Joiner tests), this is the first
step in determining whether the observations can be assessed using
parametric statistics or not. If the observations do not meet the
criteria for a normal distribution, non-parametric statistical methods
have to be used, which are outside of the scope of this introductory
textbook.
The concept of variance (a central theme throughout the entire
textbook) was described in terms of single variables and the
comparison of sets of observations for equivalence of variance. This
is a highly important topic as variance is the framework in which it is
to be determined how much of a data set can be modelled by the
data analyst. In the case where there is no variance, the precision of
the results is said to be perfect. When two data sets can be
compared, and shown to be equivalent because of identical empirical
variances, this is an indication that the two sets of data were
collected under similar levels of precision. It is only when there is a
significant difference, or shift in variance between two data sets that
further investigation is warranted. If, for example, in analytical
chemistry, two methods are being compared to for equivalence, the
method with the lowest variability will generally be preferred (all other
factors being equal) since it will result a method of higher precision.
To deal with the complementary feature of accuracy, methods
such as the t-tests are used to measure the equivalence of means to
be compared. In particular, three tests were discussed.
1. The single sample t-test that compares the mean of a set of
observations to a single tolerance value. This is used regularly in
industrial applications where a set of data is compared to a
specification to ensure the process has not drifted from target.
2. The two-sample t-test is used by industry and research alike, for
example, to determine whether two alternative treatments or
process conditions result in observations that are statistically
equivalent. This test may for example help to indicate whether a
facility can produce two batches of the same material consistency,
or two methods of sample preparation lead to the same predicted
values when using a specific analytical method.
3. The paired t-test is used mainly to test the equivalence of samples
that have been split in such a way that they can be compared as
being drawn from exactly the same population. It is commonly
used to test the equivalence of primary and alternative test
methods to see if there is any significant bias between the
methods. Another area of common application is in product
development, where, for example, the compound for the soles of
shoes is to be tested for equivalence. One shoe (left foot) would be
made of compound A and one shoe (right foot) of compound B.
Since the individual wearing the test shoes would impart the same
wear and tear on both shoes, i.e. where one foot goes, so does the
other, a test for equivalence of wear and tear for the two
compounds can be made.
In many industrial applications, Statistical Process Control (SPC)
is used to look for process deviations in real or near real time. The
main outputs are control charts that are a measure of the equivalence
of means, i.e. a single sample t-test compared to a target value and
the variance within the samples for the subsample size used. While
SPC can be used as an effective root cause tool in many situations,
its major drawback is that it considers the variables measured as
being independent of other variables measured in a process or other
situation. By neglecting the covariance between variables, a high risk
is being run that when a process is going out of control, the individual
variables may not be capable of detecting the fault, because it is a
combination of in specification variables that are causing the issue.
This is where Multivariate Analysis (the central theme of this textbook)
comes into practice as it not only provides information about the
variables being measured, but also their correlations with other
variables being measured.
The rest of this textbook is dedicated to solving the multivariate
data analysis problem and only refers to univariate method of analysis
when a certain hypothesis needs to be tested based on a multivariate
result. Before delving into this multivariate world, however, the next
chapter presents another basic topic that must be mastered before
starting to perform multivariate data analysis, this is the topic of
representative samplingTOS.

2.9 References
[1] George, L.M., Rowlands, D. Price, M. and Maxey, J. (2005).
Lean Six Sigma Pocket Toolbook, McGraw Hill.
[2] Brue, G. and Luansby, R.G. (2003). Design for Six Sigma,
McGraw Hill.
[3] Yang, K. and El-Haik, B. (2003). Design for Six Sigma, A
Roadmap for Product Development, 1st Edn, McGraw Hill
Professional.
[4] Montgomery, D.C., (2005). Introduction to Statistical Quality
Control, 5th Edn, John Wiley & Sons.
[5] Conover, W.J. (1998). Practical Non-Parametric Statistics, 3rd
Edn, John Wiley & Sons.
[6] Hollander, M. and Wolfe, D.A. (1999). Non-Parametric Statistical
Methods, 2nd Edn, Wiley Interscience.
[7] Box, G.E.P., Hunter, W.G. and Hunter, J.S. (1978). Statistics for
Experimenters, John Wiley & Sons.
[8] Muzzio, F.J., Robinson, P., Wightman, C. and Brone, D. (1997).
“Sampling practices in powder blending”, Int. J. Pharm. 155,
153–178. https://doi.org/10.1016/S0378-5173(97)04865-5
[9] Esbensen, K.H., Roman-Ospino, A.D., Sanchez, A. and
Romanach, R.J. (2016). “Adequacy and verifiability of
pharmaceutical mixtures and dose units by variographic
analysis (Theory of Sampling) – A call for a regulatory paradigm
shift”, Int. J. Pharm. 499, 156–174.
https://doi.org/10.1016/j.ijpharm.2015.12.038
[10] Esbensen, K.H., Paasch-Mortensen, P. (2010). “Process
sampling: Theory of Sampling – the missing link in process
analytical technologies (PAT)”, in Process Analytical
Technology, Ed by K.A. Bakeev, John Wiley & Sons, pp. 37–80.
https://doi.org/10.1002/9780470689592.ch3
[11] Montgomery, D.C. (2001). Design and Analysis of Experiments,
5th Edn, John Wiley & Sons.
[12] Evans, J.R. (2007). Statistics, Data Analysis and Decision
Modeling, 3rd Edn, Pearson Prentice Hall.
[13] Anderson, T.W. and Darling, D.A. (1954). “A test of goodness-
of-fit”, J. Amer. Stat. Assoc. 49, 765–769.
https://doi.org/10.1080/01621459.1954.10501232
[14] Ryan T.A., Joiner B.L. (1976). Normal Probability Plots and Tests
for Normality, Technical Report, Statistics Department, The
Pennsylvania State University. Available from:
http://www.minitab.com/uploadedFiles/Shared_Resources/Documents/Arti
[15] Miller, J.N. and Miller, J.C. (2005). Statistics and Chemometrics
for Analytical Chemistry, 5th Edn, Prentice Hall.
[16] Draper, N.R. and Smith, H. (1998). Applied Regression Analysis,
3rd Edn, John Wiley & Sons.
https://doi.org/10.1002/9781118625590
[17] Chou, Y.M., Polansky, A.M. and Mason, R.L. (1998).
“Transforming non-normal data to normality in statistical
process control”, J. Qual. Technol. 30(2), 133–141.
[18] Massart, D.L., Vandeginste, B.G.M., Buydens, L.M.C., De Jong,
S., Lewi, P.J. and Smeyers-Verbeke, J. (1997). Handbook of
Chemometrics and Qualimetrics, Part A, Elsevier Science.
[19] Fisher, R.A. (1954). Statistical Methods for Research Workers,
Oliver and Boyd.
[20] Beyer, W.H. (1987). CRC Standard Mathematical Tables, 28th
Edn, CRC Press.
[21] Student (1908). “The probable error of a mean”, Biometrika 6(1),
1–25. https://doi.org/10.1093/biomet/6.1.1
[22] Montgomery, D.C., Jennings, C.L. and Kulahci, M. (2008).
Introduction to Time Series Analysis and Forecasting, John
Wiley & Sons.
[23] Box, G.E.P., Jenkins, G.M. and Reinsel, G.C. (2008). Time
Series Analysis, Forecasting and Control, 4th Edn, John Wiley &
Sons. https://doi.org/10.1002/9781118619193
[24] Shewhart, W.A. (1931). Economic Control of Quality of
Manufactured Product.

* Observe already now, however, that the CLT stipulates that this desirable and useful effect
is strongly dependent upon taking many samplesstat and that they each should be “large”,
i.e. N should ideally be (much) larger than five (sic.).
Chapter 3: Theory of Sampling (TOS)

This chapter presents essential information on the origin of sampling


errors and their impact on “data quality”. Not everything can be
understood as “measurement uncertainty” in the traditional analytical
sense only, which is shown below to be far too narrow in scope.
Sampling error effects in fact constitute a missing link in data analysis
and statistics by providing insight into what has until now, typically
been missing. The analyst could be preparing an analytical aliquot of,
say 0.1 g, as the very last step before analysis, or, for example,
conducting primary sampling from a 40,000-ton shipload of soy
beans, sub-samples of which are later to be analysed for Genetically
Modified Organisms (GMO). Within these scale opposites lie a great
many scenarios in which physical sampling, subsampling or mass
reduction takes place. These scenarios all contribute to “data
uncertainty” and the same principles also hold for sensor sampling,
such as employed in Process Analytical Technology, PAT (for more
details on PAT, refer to chapter 13). Wherever, or whenever, sampling
is taking place from a heterogeneous material, sampling errors will
arise, large or small. Sampling errors will very nearly always constitute
the largest component in the total measurement uncertainty. The
effect is all the more serious in direct proportion to the degree of
heterogeneity of the material lot, but the sampling process itself also
produces errors if not properly counteracted. This arena has received
little to no attention in traditional analytical chemistry, in process
technology and within many specific disciplines in which sampling
also takes place. Since these issues all influence data uncertainty
(“data quality”) there is an obligation to be in command with a
minimum of relevant insight.
The Theory of Sampling (TOS) is a much-neglected area of
expertise that covers these issues with a unified framework of basic
principles for representative sampling. This chapter presents a brief
overview of TOS, sufficient to be able to entertain in full the
discussions on data analytical validation presented in chapter 8.
Salient remarks on the interrelationships between TOS and
Measurement Uncertainty (MU) are also offered below. It is, for
example, not clear to everybody that even when applying on-line/in-
line sensor technologies (such as PAT) in the pharmaceutical and
related industries, or Advanced Process Control (APC) in other
industries, there is still sampling taking place, only here it is spectra
that are acquired. It is essential to understand that this type of
sampling is governed by the exact same principles elucidated by
TOS. Sampling in biological systems also requires careful
consideration—the spatial distribution of fat within one salmon
specimen, for example, is much larger than the distribution of the
average fat content in 100 salmon specimens. This chapter covers
both sampling from stationary lots, as well as from moving lots, i.e.
process sampling.
Collectively chapters 3 and 8 should be viewed as contributions to
developing a comprehensive scientific approach to data analysis and
statistics: it is manifestly not enough to study and master the
methods, procedures, algorithms of data analysis only. The issue of
data quality is equally important. It was introduced briefly in chapter 1
and will be further discussed intensively below. A comprehensive
understanding of all aspects of data quality is a critical success factor
in data analysis.
There are three key elements along the pathway to data analysis
in all of science, technology and industry:
i) Sampling (i.e. procuring the necessary “samples-for-analysis-for-
data-analysis”)
ii) Analysis proper (procuring the analytical results), which includes
reference analysis. Analysis can be chemical, physical, other…
iii) Data analysis (statistical, data analytical), data modelling…
It is the obligation of everyone who wants to perform valid data
analysis (statisticians, data analysts, chemometricians, applied
scientists, engineers…) to be well acquainted with all three elements
“from field-to-analysis”. It is not enough just to be a data analyst,
however competent, without concern for the context, the origin and
thus the quality of data, Esbensen and Wagner [1].

3.1 Chapter overview


Naturally occurring or processed materials in science, technology and
industry (including manifestations hereof occurring in all pre-analytical
stages) are heterogeneous at all effective operative scales related to
materials’ handling producing the final aliquot to be analysed.
Therefore, sampling cannot be satisfactorily carried out in practice,
without a working understanding of the phenomenon of heterogeneity
and how its effects must be counteracted in the sampling process.
Sampling processes interact with the heterogeneous material making
up a lot, irrespective of its scale, size, form, volume etc., and also
create sampling errors of their own, due to non-compliance with the
practical, mechanical, maintenance and operative procedural
principles in TOS. For stationary lots this generates five principal
types of sampling errors, which are the prime topic of the present
chapter. Process sampling will also be covered here as well as in the
context of PAT, chapter 13.
The objective of a framework for representative sampling must
cover the conditions and circumstances under which it can be
guaranteed that a reliable sample with an analyte concentration, aS,
sufficiently close to the true average lot concentration, aL can be
obtained. TOS shows that in addition to the intrinsic heterogeneity of
the material sampled, much also rests with the sampling process
itself. At the outset is a fundamental prerequisite: It is not possible to
ascertain whether a specific “sample” is representative or not from
inspection of the sample itself. Neither is there any way to
compensate for faulty sampling, statistically or otherwise. Faulty
sampling is not a matter that can be rectified by throwing however
much and however complicated statistics at the problem. A complete
understanding of TOS includes: heterogeneity, five sampling errors,
the fundamental sampling principle, lot dimensionality, proper
methods for mass reduction, sampling correctness, seven sampling
unit operations (SUO) and the replication experiment.
The heterogeneity common to all materials, regardless of their
physical state makes it possible to approach all sampling tasks in a
unified manner, relying on a singular set of common principles. Focus
here is placed on the material and its corresponding sampling
process.

3.2 Heterogeneity
Heterogeneity of stationary lots and materials has two fundamental
aspects: constitutional heterogeneity (CH) and distributional
heterogeneity (DH), which are sufficient to describe (and quantify) all
operative aspects of sampling. Only a few salient definitions are
needed.
The Constitutional Heterogeneity represents the heterogeneity
dependent on the physical or chemical differences between individual
lot units, which TOS terms “fragments”, but “grains” provides a more
useful image for the physical lot constituents, e.g. mineral grains,
seed grains, kernels, biological cells. Any given target to be sampled,
characterised by lot geometry, material type and state, and grain-size
distribution, exhibits a CH which is an inherent property of the lot
material. Thus, CH plays out its role as a summation of the between-
grain scale relationships in all lots. CH is a material-specific property;
CH can only be reduced by altering the physical state of the material
(by comminution, i.e. crushing the material to a smaller overall grain
size). The grain (the fragment*) is the smallest lot unit in the sampling
scenario.
The Distributional Heterogeneity complements this
characterisation by describing all aspects of heterogeneity dependent
upon the spatial distribution of all practical constituents in the lot, as
seen from the point of view of the operative sampling tool size
(volume/mass) used; this reference volume is called an increment.
This sampling unit could, for example, be a sampling scoop, but will
also take on many different forms in relation to lot type and size, for
example, a spatula, shovel, drill core, syringe, knife, vial, a front-
loader bucket…
The physical manifestations of DH are stratification, segregation
and/or local groups-of-fragment concentrations, i.e. “clumps” of
material with a significantly higher (or lower) analyte concentration
than the average lot concentration, aL (positive or negative “hot
spots”). DH can actively be reduced by using a suite of “correct”
sampling methods to be delineated further in this chapter. DH can
never be larger than CH (in a sense DH always constitutes a fraction
of CH) and CH can never be strictly zero. Depending on the purpose
and scale of sampling (the effective “scoop size”), DH may be
negligible, but it is never zero. TOS defines homogeneity as the
(theoretical) limiting case of zero heterogeneity. If a homogeneous
material actually did exist, sampling would not be needed—as all
sampling errors would be zero, because all “samples” would be
identical. Upon reflection, it should be obvious that homogenous
materials do not exist in nature, technology or industry, but the
concept of homogeneity is still very useful as an ideal attribute or
state in some theoretical deliberations. In practice, however, TOS
considers that all materials are indeed always heterogeneous, and
can consequently be treated in a unified manner. If one can deal
effectively with significantly heterogeneous matter, one can deal with
any type of material—opening up for a much simplified, indeed
universal approach with which to deal with any kind of lot, material,
situation w.r.t. sampling, defined in the document DS 3077 [2], and
Esbensen and Minkkinen [3].

3.2.1 Constitutional heterogeneity (CH)


TOS defines a heterogeneity contribution to the total lot heterogeneity
by focusing on the scale of the individual fragments (grains). TOS
characterises all fragments according to the component of interest
(the analyte, A), expressed as the proportion (or grade), ai, and the
fragment mass, Mi. If a lot consists of NF individual fragments with
individual masses, Mi, with an average fragment mass i with lot
grade aL and a lot mass ML, the heterogeneity contribution from each
individual fragment, hi, is defined by equation 3.1:

Heterogeneity contributions are dimensionless intensive


quantifiers. hi expresses both the compositional deviations of each
fragment, while also factoring in variation in the fragment masses; for
identical grades, larger fragments result in a larger influence on the
total heterogeneity than smaller ones. This viewpoint constitutes a
major distinction from “classical statistics” where all population units
contribute equally (“with equal statistical mass”). hi constitutes an
appropriate measure of mass-weighed heterogeneity as contributed
by each of the NF fragments to the lot.
The total constitutional heterogeneity of the lot, CHL, can easily be
defined on the basis of the individual hi’s—as the variance of the
distribution of the heterogeneity contributions of all fragments in the
lot (equation 3.2):

This variance measure is a convenient estimate of the total


heterogeneity variability in the lot. Whereas the spatial analyte
distribution of a heterogeneous lot does not comply with a random
distribution assumption, as discussed later in this chapter, TOS has
never-the-less derived the above theoretical understanding of CHL
resulting in the practical and very simple view that all lots are made
up of a specific heterogeneity contributions from each fragment.

3.2.2 Distributional heterogeneity (DH)

By ascending one scale level, from the scale of fragments to the


operative level of one sampling unit (sampling scoop), the increment,
it is possible to tackle the complementary realm of the spatial lot
distributional heterogeneity, DHL. No longer concerned with the lot
consisting of the totality of NF fragments, at this larger scale of
scrutiny, lots can alternatively be considered as being made up of a
number of potential sampling increments, (in TOS termed “groups” or
“groups-of-fragments”), NG, commensurate with the selected
operative volume of the sampling tool in use.
Other than this hierarchical scale difference, the focus and
formalism is identical, viz. quantitative description of the differences
in composition (concentration of the analyte, an) between the groups
(increments). DHL is calculated in a strict analogue to the definition of
heterogeneity carried by a single fragment. But the effective unit is
now the sampling volume, the increments, which contain a group-of-
fragments, which will be analysed in toto. Thus, a group in the lot
(index n), Gn, similarly carries a contribution of the total lot
heterogeneity, hn, which can be calculated from the concentration of
the group in question, an, the group mass, Mn, the average group
mass n, and the average grade over all groups, a as defined in
equation 3.3.

The total heterogeneity for the entire lot can be calculated in an


identical fashion as for fragments as the variance of all group
heterogeneity contributions:

Due to the fact that the aggregate sum of all (virtual) groups
constitutes the physical lot in its geometric entirety, it follows that
DHL in fact is a measure of the total spatial heterogeneity exhibited by
the lot, hence the term “distributional heterogeneity”, DHL. In this
complementary view, the lot is made up of a set of potential (virtual)
groups, NG in total. This two-scale understanding of the heterogeneity
of any lot, system or material constitutes the effective theoretical
concept with which one is able to understand all key issues of
heterogeneity and representative sampling. DHL accounts for the
material heterogeneity in a specifically relevant form, namely that
corresponding to the specific sampling size used, characterised by a
specific sample mass, MS. The effective group in question is the
sampling increment used. It is possible to ascertain the quantitative
effect of the lot heterogeneity interacting with alternative sampling
processes, for example using alternative sampling volumes or by
using an alternative sampling procedure. This would result in a
numerically different measure of the spatial heterogeneity.
A specific type of increment, termed a grab sample has historically
been very much used, particularly in industrial applications of which
the pharmaceutical industry is one—and the most important aspect
of any such sampling process is the size of this sampling unit MS and
the way it has come about. But as will become clear immediately
below, such a single-scoop sample is very, very rarely acceptable
(results in unacceptably inflated sampling bias), so MS is nearly
always to be understood as the compound mass of a composite
sample, which consists of Q increments. One of the most important
objectives of TOS is to allow the sampler the most reliable way to
estimate the magnitude of Q, refer to DS 3077 [2].
Unlike CHL, which is an intrinsic characteristic of the given
material, DHL can actively be reduced, especially by choosing a
smaller sampling tool, thereby increasing the number of increments in
composite sampling/sampling frequency, and/or the lot can be
thoroughly mixed, blended etc. In large lots, forced mixing is often
impractical or impossible, however; in such cases increasing the
number of increments is the only option for reliable primary sampling.
If there is a significant segregation or grouping (fragment clustering) in
the lot, increasing the sample size for a one-increment sample, MS,
only results in a comparatively minor effect and will soon reach an
impractical limit. TOS has much to say (all negative) regarding the
universal futility of such grab sampling. Grab sampling is in fact never
reliable and must accordingly be abolished.
By way of contrast and effectiveness, composite sampling is
always a good choice of action. It is advantageous to think of more
increments as synonymous with better lot coverage, as illustrated in
Figure 3.1.
It follows that sampling from a heterogeneous lot can never result
in completely identical analytical results; there will always be a
sampling variance (more accurately, a sampling-and-analysis
variance) as expressed by a set of analytical results. Even a set of
identically replicated samples, carried out following an identical
protocol, will give rise to a distinct, non-zero sampling variance (see
section 3.6: Replication experiment). This is solely due to the fact that
no sampling process can fully eliminate the effect of heterogeneity at
all possible scales. The role of representative sampling is to reduce
this fundamental sampling effect as much as possible, and to be able
to quantify the remaining sampling variance. It may happen that
particular systems possess extraordinarily small heterogeneities, but
no generalisations as to universal relationships regarding
“homogeneity” or “sufficient homogeneity” can be drawn from such
particular instances. It is highly advisable always to treat any lot
material as if it carried a significant degree of heterogeneity.
3.3 Sampling error vs practical sampling
TOS’ analysis of the phenomenon of heterogeneity outlines three
factors which are responsible for the magnitude of the distributional
heterogeneity:
CHL (constant for any given material in a specific state).
Grouping (depends on the size (volume/mass) of the extracted
increments).
Segregation (depends on the spatial distribution of
fragments/groups in the lot).
Both segregation and grouping can be quantified if need be;
methods and equations are described in detail in the pertinent
literature and in section 3.9. More important, however, is how to
counteract the effects arising from DHL in practical sampling. In order
to extract samples from heterogeneous materials with sufficiently low
sampling variation it is necessary to minimise the effective DHL. For
any given material state, the case of reducing the two
phenomenological factors grouping and segregation can principally
be achieved in two ways only:
Decreasing the size of the extracted increments, thereby increasing
the number of increments (or increasing the sampling frequency)
combined to form a total sample mass, MS. The purpose of more
increments is to be able to cover the volume of the lot in a more
effective, and hence satisfactory fashion. This approach
counteracts grouping and segregation on the scales between
increment and the whole lot.
Mixing/“homogenising” the lot reduces macro-scale lot
segregation. Mixing is used universally at all sampling stages
except the primary sampling stage where, obviously, this option is
not possible—unless a particular original lot is very small. Thus,
mixing is not a universally available option, whereas composite
sampling always is.
Figure 3.1: “Lot coverage” illustrated in the case of composite sampling from an aggregate
stationary lot. Left, insufficient lot coverage (too small sampling “footprint”). Right, better
coverage, but sampling is still only taking place at the surface of the lot, which is not good
enough—the interior 90% of the lot is structurally never possible to sample (full disclosure
can be found in DS 3077, 2013). See also Figure 3.7.

The increased number of increments is always to be used


diligently to increase the spatial lot coverage as appropriately as
possible. It is the worst possible option to give in to demands to be
“economical” or excessively “practical” at this junction. Also, the cost
for analysis of one well-sampled composite sample in the laboratory
is identical to that for a single unreliable, non-representative grab
sample.
Spatial lot coverage is the most fundamental requirement for
starting out on the right foot regarding representative sampling.
Figure 3.1 portrays the critical success factor of “lot coverage” in
connection with composite sampling. Even though this illustration
pertains to primary sampling of a large lot, an identical principle
pertains to all lower scales as well, ending with the analytical aliquot.
Grab sampling is compared well to composite sampling in Figure 3.8.
If these DH-countermeasures are insufficient for a given total error
acceptance level, it will be necessary also to reduce the constitutional
heterogeneity itself, CHL, which may, for example, necessitate
physical reduction of the fragment sizes, comminution (grinding or
crushing), if the lot size allows—or increasing the critical parameter in
composite sampling, NG. The reader is referred to the international
standard, DS 3077 [2] and the chapter references for a broader
overview regarding the full complement of principles and procedures
in TOS, including further background literature.

3.4 Total Sampling Error (TSE)—Fundamental


Sampling Principle (FSP)
All analytical results are associated with a specific uncertainty
stemming from the analytical method itself, often expressed as the
Total Analytical Error (TAE). To this must be added the much larger
uncertainty components related to the entire sampling process; TOS
aggregates all such error sources as the Total Sampling Error (TSE).
TAE and TSE together form the Global Estimation Error (GEE). The
total schema of errors is presented in Figure 3.2.
TAE is often in good, even excellent, control in the laboratory, and
is usually of little concern in comparison to sampling. In fact, TSE is
often 20–50–100 times larger than TAE, depending on the intrinsic
heterogeneity met with. Exceptions would reflect only uniform
materials with an exceptionally small heterogeneity—which are rare
indeed.
Figure 3.2: Stationary lot sampling errors and their inter-relationships. Full definitions of error
abbreviations can be found in the references and in Appendix A of this chapter.

TSE has many sources. The objective of TOS is to identify,


eliminate or reduce all contributing sampling errors. While much of
the sampling procedure and sampling equipment issues and efforts
are to some extent under control by the sampler, the part from
constitutional heterogeneity is dependent on the material properties
only. This error is termed the Fundamental Sampling Error (FSE), as it
cannot be altered for any given system (lot, geometry, material, state,
size distribution); FSE can only be reduced by comminution of the lot
material.
On the other hand, contributions from the spatial distribution of
the material are not fixed and can more easily be altered, although
not by any quick fix, but rather by informed systematic work. This is
because the effective DHL is not only dependent on the material
characteristics but also on the sampling procedure and the nature of
counteraction measures invoked (if for example mixing can be
applied before sampling or not). The variation stemming from
distributional heterogeneity is represented by the Grouping and
Segregation Error (GSE).
At the highest conceptual level, representative sampling is
dependent on adherence to a simple logical requirement, termed the
Fundamental Sampling Principle, (FSP): all possible, virtual,
increments must have the same probability of being selected, i.e. of
being materialised by a specific sampling process. FSP must never
be compromised and thus demands equal potential physical access
to all virtual sampling increments of the lot. This is why the improved
sampling in Figure 3.1 (right panel) is still not good enough as there is
no attempt to sample the comparatively largest inner volume fraction
of the lot. While this demand at first sight (often) has been claimed to
be difficult to realise for some lots/lot types, in practice there can be
no way to avoid contemplating the consequences of non-fulfilment. If
there exist areas in a lot which cannot be accessed if selected in a
composite scenario, there will per force be parts of the lot which will,
never be sampled with the unintended, but unavoidable,
consequence that any sample (grab or composite) simply cannot be
representative of the entire lot. This is a structural impossibility if FSP
cannot be followed. For this reason, TOS contains a wealth of
practical guidelines on how to achieve compliance, DS 3077 [2].
Following the above analysis, it is readily acknowledged that there
may well be large complements of TSE + TAE compromising “data
quality”, i.e. there may very well be significant “hidden” error
components embedded in the numerical data making up the
independent X and dependent Y matrices associated with the
corresponding data analysis procedure. The crucial insight is that this
problem cannot be solved by any data analytical or statistical
correction; in particular, there is no specific chemometric fix. The only
way out of this situation is to be able contribute towards reduction, or
elimination of the most influential sampling errors.
Both the analytical as well as sampling errors are expressed as
variances, partly (in practice) because variances are additive and
subtractive. Variances are the quantitative reflection of imprecision,
thus TSE + TAE is an important messenger regarding the sampling-
plus-analysis precision.
There is, however, another aspect of how well data represent
reality—accuracy. This aspect is of particular importance in TOS. The
above theoretical and practical brief is meant to inspire the reader to
seek a modicum of further TOS competence, see Esbensen and
Julius [4], Esbensen and Paasch-Mortensen [5] for a more detailed
exposition of this topic.

3.5 Sampling Unit Operations (SUO)


During work in the last decade on making TOS more accessible, a set
of useful Sampling Unit Operations (SUO) have been formulated. This
framework constitutes a complete set of procedures and general
principles regarding practical sampling. In one way, these procedures
are all that is needed in order to be able to perform representative
sampling, but their full potential can of course better be realised by a
dedicated understanding of the basic principles of TOS. An attempt
has been made here to deliver a minimum background sufficient for
the informed data analyst.
The Sampling Unit Operations can be divided into two groups
according to their use:
1) General principles: normally used only once in planning or
optimisation of new or existing sampling procedures, for example:
Transformation of lot dimensionality (transforming “difficult-to-
sample” 2-D and 3-D lots to “easy to sample” 1-D lots). It is always
possible to acquire some form of non-representative specimen
from 3-D and 2-D lots, but whether this is based on probabilistic,
correct, unbiased methods is a much more difficult issue, see DS
3077 [2].
Characterisation of 0-D sampling variation by a Replication
Experiment, see section 3.6 for more details.
Characterisation of 1-D (process) variation by variographics.
2) Four practical procedures: often used, among other things, during
practical sampling:
Lot, or sample, homogenisation by mixing or blending
Composite sampling (Q increments must “cover” the entire lot as
best possible)
Particle size reduction by crushing (comminution)
Representative mass reduction, see Petersen et al. [6]
DS 3077 [2] operates with seven SUO in all, while Esbensen and
Wagner [1] takes this to a final description of six Governing Principles
(GPs) and the same four practical procedures as above.
The following three examples illustrate the use of SUO.
i) If the fundamental sampling principle appears difficult to uphold (for
example for large stationary lots) Sampling Unit Operation #1 must
be invoked, which is Lot Dimensionality Transformation (LDT), as
shown in Figure 3.3. In this fashion, what are often considered
“impossible-to-sample” lots (2-D, 3-D lots) can in fact often be
transformed into a 1-D lot configuration, by far the easiest
configuration for representative sampling, i.e. the process sampling
situation.
ii) All primary sampling must employ composite sampling unless it has
been specifically proven that acceptable sampling quality can in
fact be achieved based upon a single increment sample with
reference to CVrel (see below). Grab sampling can never be
accepted without a comprehensive qualification.
iii) All mass reductions in the analytical laboratory are in fact ordinary
sampling operations, only taking place at the smallest scale of
interest. Every requirement for representative sampling must be
upheld in this important realm of sub-sampling, including sample
splitting, see Petersen et al. [6].
Figure 3.3: Illustrating TOS’ concept of 0-, 1-, 2- and 3-dimensional lots. It has been found
advantageous to transform many 3-D and 2-D lots into a 1-D configuration. This constitutes
the Lot Dimensionality Transformation (LDT), one of the seven Sampling Unit Operations.

Thus, all SUOs may, should and will operate on lots at all scales—
from the largest conceivable lot size to the final, pre-aliquot powder
mass immediately before analysis. It is only the physical and
geometrical manifestation of the sampling equipment which
“scales”—in this context a spatula, a scoop, a shovel (spade), a front
loader bucket, are all identical: all are “scoops” taking out
increments. This makes representative sampling completely general
and easy to implement.

3.6 Replication experiment—quantifying sampling


errors
The quantitative effect of DHL interacting with a particular sampling
process (a specific sampling plan, grab sampling, composite
sampling, other…) can be quantified by extracting and analysing a
number of replicate primary samples covering the full geometry of the
lot and calculating the resulting empirical variance of the analytical
results aS. Often a relatively small number of primary samples will
suffice, though never less than 10 (more is certainly recommended).
This procedure is termed a Replication Experiment, and can be very
useful in order to reach a practical estimate of the magnitude of the
total sampling error involved before analysis; this will give the data
analyst a first idea of the “data quality”, and an indication whether
something should be done about it (reducing TSE).
The Replication Experiment must be governed by a fixed protocol
that specifies how the sampling and analysis methods are to be
carried out and replicated. It is essential that both primary sampling
as well as all sub-sampling and mass-reduction stages, sample
preparation etc. is replicated in an absolute identical fashion. The
powerful feature of the replication experiment is that it can be applied
to any sampling procedure. It is thus acceptable to “try out” the
replication experiment assessment protocol in the (legitimate) hope
that a current procedure may meet the acceptance criteria as
described below. If not, however, only a small effort was spent, and
very important information was gathered: the current sampling
procedure is unacceptable and TOS must be invoked in order to
reduce TSE; when this is mandated, see DS 3077 [2].
It has been found advantageous to employ a standard statistic to
the analytical results obtained from a replication experiment. The
relative coefficient of variation, (RSV or CVrel) is a useful measure of
the magnitude of the standard deviation (std) in relation to the
average (Xavr), usually expressed as a percentage measure):

RSV is influenced by the heterogeneity of the material, as


expressed by the current sampling procedure—as well as the
sampling procedure itself as it contributes to the effective, total
variance. RSV includes all sampling errors (TSE), primary sampling
error, secondary, tertiary etc. as well as errors incurred by mass
reduction and the analytical error(s). RSV is a particularly apt
characterisation of the Global Estimation Error (GEE).
It is convenient to use RSV for all initial characterisation of an
existing sampling procedure—as well as to compare the numerical
percentage resulting from modified, hopefully improved procedures.
The replication experiment can also be applied to individual stages of
a compound sampling procedure; there is often a need to check a
particular sampling or sample handling stage, which may be
suspected to be out of control. A focused replication experiment will
produce the required information immediately. There is a RSV
measure for stationary lot sampling, RSV0-dim contrasting with RSV1-dim
which is estimated from a process sampling experiment (section 3.8).
International efforts in the last decade have culminated in
formulating a so-called “Horizontal sampling standard” (“matrix-
independent standard”), refer to DS 3077 [2]. This work has focused
on developing a rationale for specification of authoritative threshold
levels for RSV formulated as practical sampling. For significantly
heterogeneous materials and systems, a level corresponding to 35%
has been suggested based on extensive practical experience. For
unspecified materials, the level is a tentative 20%. It is important to
acknowledge that for many systems in which the heterogeneity is
substantially less (including so-called uniform systems) the RSV
threshold should be set as low as 10%, or 5% (or even lower),
depending upon the typical material heterogeneity and the traditions
prevailing in the relevant community.
Be advised, however, that DS 3077 does not recommend a small
set of universal specific thresholds—this would be nonsensical in
view of the myriad of different heterogeneous lots and materials (each
with their own specific DHL) in the world of science, technology and
industry. The obligation from DS 3077 is that all process steps from
lot-to-analysis, i.e. GEE, must be characterised by a specific RSV—
and that this RSV is mandated to be made public, voluntarily. Imagine
a world in which all analytical results, data, were always so well
qualified. Imagine that data quality was always taken so seriously.
Figure 3.4 shows the framework for a replication experiment.
RSV constitutes a simple, yet very powerful sampling quality
measure that can be used for all material types, at all scales and at all
individual stages in the sampling process. If it is desired to limit
oneself to one general threshold level for all materials (which would
de facto lump together all of the world’s widely varying materials and
heterogeneities), RSV < 20% would be the choice for those who are
not willing to go behind the scene, but this is most emphatically not
recommended, DS 3077 [2].

3.7 TOS in relation to multivariate data analysis


What is the status of conclusions drawn from a(ny) multivariate data
analysis… in the knowledge that “measurement uncertainty” in the
traditional strict TAE sense, is regularly exceeded by factors of 10–50
due to heterogeneity and ill-reflected sampling errors? It is of the
utmost importance to understand the basics of the effects arising
from heterogeneity in any material at any scale, and that the sampling
process also makes significant contributions to the total
“measurement uncertainty” These are the fundamental, and
dominating aspects of “data quality”. Only TOS is able to analyse the
complete and general issues involved and derive solutions.
Figure 3.4: Illustration of a generic replication experiment. Any sampling operation,
representative or not, can be replicated for example 10 times, allowing an empirical
estimation of RSV. It is an easy matter to compare this with a relevant threshold.

Some chemometricians have earlier voiced hope in the conjecture


(wishful thinking, as it turns out) that error decomposition in bilinear
models, is tantamount to compensating for the sampling error effects.
This notion quickly had to be abandoned, however, once the full
impact of TOS was introduced in chemometrics. More serious
scientific issues in multivariate data analysis are:
What is the TSE effect on the direction of bilinear components, i.e.
the loadings?
What is the TSE effect on the projected localisation of objects on
components, i.e. the scores?
What is the TSE effect on the magnitude of the data analytical
errors, ɛ?
The very short answer is: make every possible effort to reduce
(and eliminate where possible) all TSE! But this chapter is only an
introduction to TOS. Full answers to these chemometrics questions
can be found in the background literature on TOS, which is extensive
and much of it has been published in chemometric journals signifying
its importance to statistics, data analysis and chemometrics, see the
chapter references section for a list of relevant background literature.

3.8 Process sampling—variographic analysis


The experimental variogram of a process stream (also termed a one-
dimensional lot) can be estimated either from historical data or from a
variographic experiment. The general variographic analysis approach
has proved its worth over and over in all areas of science, technology
and industry. For a variographic analysis, N increments are typically
collected using a systematic sample selection mode along the time
dimension. The collected increments are in this experiment
specifically first treated as individual samples, i.e. they are analysed
individually which results in a series of N analytical results, aS (these
results may later be aggregated in various fashions to form composite
samples as illustrated in Figure 3.5). The variographic data analysis is
often used to simulate the effects of composite sampling schemes in
process sampling (by averaging a number of successive analytical
results). The cost of a variographic experiment is that associated with
taking the N primary increments (samples) and analysing all of these
in the laboratory; there are no other costs and the experiment need
not be repeated in order to investigate all aspects of the possible
improvement effects from composite sampling with a varying number
of increments, Q. This is an indispensable advantage and a very great
savings potential to many, if not all process industries.
Figure 3.5: Increments (correctly sampled from a 1-D lot) can be grouped in several different
sets each characterised by a different “lag” (inter-increment distance). Three sets with lag = 1
(A), lag = 2 (B) or lag = 3 (C) are illustrated here. Dots represent individual analytical results
aS. This procedure employs a “systematic sampling” mode.

Variographic analysis can answer important questions of the type:


“what will be the (TSE + TAE) associated with a suggested composite
sampling alternative to the single sample full analytical costs
baseline?” Indeed, variographic analysis provides an answer for what
would be an optimal number of increments, (the Q-value in the
interval [2, 3, 4, 5.... N]) and will furnish an estimate for different
optional sampling rates as well.
Depending on the heterogeneity of the target lot or material, a
practical minimum number of increments, Q, needed for a valid
variographic experiment is preferentially taken to be in the interval
60–100 samples. In the situation where the material flux is relatively
constant (varying below ±20% rel.) a simple arithmetic average of all
analytical results will provide a good estimate of the average lot
concentration, aL.
The experimental variogram is calculated either from the
heterogeneity contributions, h, for increasing sample lags from 1 to a
maximum of N/2. Equation 3.6 uses the relative units, which is
recommended (for hi definition, see section 3.2.1).

Figure 3.6 provides an example of a generic variagram along with


its most important features.
All variograms are defined by a set of only three parameters: The
Nugget effect = V(0), Range (R), Sill = V(j > R), as shown in Figure
3.6. From a variographic analysis it is possible to derive a total
sampling error RSV1-dim estimate as the relative magnitude of the
nugget effect in relation to the sill. This is the essential quality
assurance index for process sampling. RSV1-dim is interpreted as the
percentage of the total observed process variability that is
attributable to all sampling errors, including TAE. Since the objective
of process monitoring is to make valid interpretations of the real
process variability stripped of sampling and analytical error
impediments, it is obvious that RSV1-dim must not grow too large
compared to the sill. Esbensen and Romanach [7] and Esbensen et
al. [8] treated these aspects in full depth in relation to a chemometric
prediction model monitoring a pharmaceutical mixing process.
Figure 3.6: Generic variogram illustrating the three fundamental parameters, range, sill and
nugget effect. The range is the lag at which the variogram (V(j) becomes effectively constant
(“flat”), characterised by the maximum data set variance, sill (S). V(0) is the Minimum Possible
Error (MPE), the nugget effect. In the variogram depicted RSV1-dim is below the acceptance
threshold 20% (i.e. 20% of the sill value).

Process sampling competence is an important aspect of PAT, in


which chemometrics is essential. Esbensen and Paasch-Mortensen
[5] point out the essential role of proper TOS-sampling in order for
process data not to be encumbered by unnecessary sampling errors.
Variographic characterisation is described comprehensively in major
overview references in the analytical, chemometric and sampling
literature, see the reference section of this chapter for a list of
recommended reading.
Without proper TSE assessment, through RSV and variographic
analysis, one is left with no relevant information as to the effective
error level in one’s data. This is not an acceptable situation for any
analyst, for any field scientist or process engineer—neither for any
data analyst. In chapter 8 it will be shown how the effective TSE (or
GEE) has a dominating impact on validation, especially resulting in
significantly inflated prediction error (RMSEP) magnitudes. The data
analyst has a clear responsibility to obtain the relevant information as
to the entire chain of events starting with the primary sampling, if for
no other reason in order to be able to contribute towards a reduction
in RMSEP. In order to be able to document that effective TSE(GEE)
reductions have taken place, it is necessary to be able to perform
relevant RSV0-dim (RSV1-dim) assessments.

Figure 3.7: RSV assessment can be invoked for any sampling procedure/sampling stage.
Here different lot coverage for biological material is testing composite sampling (Q = 11
increments) vs grab sampling employing a singular sample (blue). If a data analysis,
unwittingly or otherwise, is based on analytical data ultimately pertaining to grab samples,
this figure illustrates how spatial heterogeneity is not represented, unless by using composite
sampling. Data analysts should also be interested in the pre-history of their data.

It is strongly recommended to carry out repeated assessments in


situations in which decision making is of critical importance. When is
this? In this introduction to multivariate data analysis, a term that shall
become very familiar is “problem-dependent”. Many issues
surrounding data analysis do not have their solution in the data
analysis method, or algorithm per se—but in the domain specific
context of the “problem owner” (or “data owner”), aka the problem
domain.
The scenario illustrated in Figure 3.7, or closely similar grab-
sampling situations, gives rise to endless frustrations in chemometric
data analysis; Why is this classification rate so low? Why is this
RMSEP so large? What can I do about it? Above all: there may be a
significant, perhaps dominating, proportion of sampling-and-analysis
errors (GEE) behind the type of data analysed in chemometrics. Data
quality matters very much—due diligence is always required!

3.8.1 Appendix A. Terms and definitions used in the TOS


literature

Theory of Sampling (TOS). The body of theoretical work originating in


1950 by the French scientist Pierre Gy, who over a period of the next
25 years developed a complete theory of heterogeneity, sampling
procedures and sampling equipment (design principles, operation
and maintenance requirements). Pierre Gy has himself published over
275 papers, including 7 books on the subject of sampling, in the latter
25 years or so joined by several other international sampling experts.
A first overview of the history of TOS was given in Esbensen and
Minkkinen [3]. Figure 3.6 shows an overview of TOS.
Lot (stationary, dynamic). The lot is the sampling target, the
specified material subjected to the sampling procedure. The term
“lot” refers both to the material itself as well as to its size, its physical
and geometric features and form. Lots are distinguished into
stationary and dynamic lots. The latter is a material flux, where
sampling is usually carried out at one, or more, stationary sampling
locations.
Sample. A sample is the correctly extracted material from the lot
so as to be representative (TOS’ definition of “correct sampling” and
“representative sampling” are imperative). Representative samples
can only be the result of a representative sampling process. It is not
possible to discern whether a particular increment is representative or
not by any criterion derived from the increment itself. Representativity
cannot be declined: either a particular sample is, or is not,
representative.
Specimen. A specimen is made up of material extracted from the
lot in any incorrect fashion, i.e. from any sampling process which
cannot be documented to be accurate and therefore neither
representative. TOS puts critical emphasis on the difference between
a sample and a specimen, whereas, e.g., ISO 6644 does not address
the concept of specimen. From a specimen one cannot draw valid
conclusions concerning the properties of the whole lot.
Increment. An increment is a partial sample unit, which is intended
to be combined with other increments to form a composite sample.
The designation sample vs increment is a critically determinant with
respect to the subsequent use, as an increment is supposed to be
physically mixed with other increments to make up a composite
sample. TOS is the only framework which distinguishes between two
possible outcomes of any sampling process: representative
increments (or samples) vs non-representative specimens. An
individual increment can sometimes also serve in the capacity of a
sample, especially in variographic analysis (see below).
Grab sample. Increment resulting from a single sampling operation
(literally “grabbing”), almost always emphasising alleged efficiency,
inexpensiveness, effort-minimising desirability. Grab sampling can
result in representative samples only in the most rare of instances,
however, which must always be documented by impeccable
heterogeneity characterisation and DQO assessment. Without such
documentation, grab sampling is to be assumed as always resulting
in worthless specimens.
Composite sample. In TOS, a composite sample is by definition
made up of several increments, Figures 3.1 and 3.10. The ISO
equivalent of composite sample is the bulk sample. There is full
conceptual consistency between the definition of composite (TOS)
and bulk (ISO) sample, but a composite sample may either be
representative or not, according to the characteristics of its
increments, a distinction only made in TOS.
Representative sampling. Representative sampling is invariably a
multi-stage process covering all stages from the moment an
increment, or a primary sample, is materialised from the original lot,
until an aliquot or test portion is administered to the final analytical
operation. All constituents of the lot (batch, container, conveyor belt
or pipeline cross-section etc.) must have an equal probability of being
selected and not altered in any way that would change the properties,
while all elements that do not belong to the lot must have zero
probability of being selected hereby eliminating both cross
contaminations between samples and/or contamination from any
other external sources. These criteria make up the Fundamental
Sampling Principle (FSP), which must never be violated, lest
representativity is compromised.
Measurement Uncertainty (MU). All analytical results have inherent
uncertainties stemming from the specific analytical method
employed. The analytical error s.s. is the deviation of the analytical
result (an estimate of aL) from the true concentration value in the
analytical aliquot—whereas the measurement uncertainty (MU) s.l. is
viewed as caused both by the Total Sampling Error (TSE) and the
Total Analytical Error (TAE). TSE is the combined effects from the
entire procedure from the moment the primary sample is defined and
sampled, until the analytical quantification, or testing, is completed,
which is where TAE applies only. In much of science, technology and
industry, an unfortunate lack of attention and preciseness in
definitions has often led to the complacent notion TAE includes TSE,
which has caused a widespread confusion and serious
misconceptions. TSE and TAE are point estimates, magnitudes,
associated with an individual datum (for example in the form ±
percentage)—whereas MU is in interval estimate within which one is
likely to find the parameter value sought.
Grade vs concentration. TOS defines the grade aL of the
constituent of interest of the lot as the mass of the constituent
present in the lot divided by the total mass of the lot; and the grade aS
of the constituent of interest in the sample as the mass of the
constituent of interest present in the sample divided by the total mass
of the sample. In this respect, what is defined grade in TOS is
referred to as “concentration” in many ISO contexts.
Analyte. The chemical or physical determinand (metrology), the
quantity of which is estimated by the analysis employed subsequent
to sampling. The analyte: “that which is analysed” is synonymous
with chemical compound, physical parameter, mass.
Comminution. Preferential reduction of the top particle sizes in an
aggregate material subjected to crushing. Comminution is the
technical term describing the effect of crushing on a specific particle
size distribution.
Mass reduction. All sampling operations per force also perform
mass reduction. The critical issue is whether particular mass
reduction equipment and procedures are representative, however. A
comprehensive benchmark survey of procedures and equipment
types for mass reduction was reported by Petersen et al. [6] covering
all major approaches met with in science, technology and industry.
Fundamental Sampling Principle (FSP). All potential increment
volumes of a lot must have identical probability and practical
possibility to end up as the physically extracted increment (or
sample). There cannot exist areas, volumes, parts of a lot which are
not physically accessible, lest representativity is impossible to
achieve.
Sampling Correctness Principle (SCP). TOS’ specifies correctness
as elimination of all “incorrect sampling errors”.
Data Quality Objective (DQO). Quantitative sampling variability
index, usually expressed as a unit-less ratio (relative %) or as a
fraction of a variance ratio (also expressed as a relative %).
Relative Standard Variation (RSV). Standard variation in relation to
the average of the measured concentrations. RSV is the proper DQO
for the Replication Experiment.
Replication Experiment. Procedure for estimating RSV for a
stationary lot being sampled by a specific sampling procedure. A
replication experiment can also be used, with proper lot coverage, for
a lot which has been transformed into an elongated, 1-dimensional
lot.
Variography. Estimation of total variance at increasing lag intervals
for process sampling (dynamic lots).
Lag. Between-increment sampling distance (between-sample
distance), j, in process sampling.
Variogram. Graphical expression of total variance, expressed as a
function of the lag, V(j).
Sill. Average variance of an experimental variogram. In the case of
a sufficient number of increments/samples (50–60), the sill takes on
the appearance of a “ceiling” to the variogram.
Range. The lag, R, at which the variogram becomes effective
constant (i.e. a “flat” variogram). At lags outside the range, process
variance becomes constant. V(>j) constitutes the maximal variance in
a process sampling system.
Nugget effect. Minimum variance in the variogram, V(0). V(0)
contains all stationary sampling—as well as the total analytical error
variance. The nugget effect indicates the Minimum Possible Error
(MPE) in any process sampling situation.
Experimental variogram. Empirical process sampling assessment
involving 50–60 increments (minimum) resulting in a variographic
characterisation. The information in an empirical variogram can be
expressed by three measure only, range, nugget effect and sill. RSV1-
dim is the proper DQO for process sampling.
RSV1-dim. The ratio of the nugget effect to the sill in an
experimental variogram.
Incorrect Sampling Errors (ISE). Four errors (IDE, IEE, IWE and IPE)
that add to form the incorrect sampling error.
Increment Delimitation Error (IDE). Occurs when the boundaries of
the selected increment is not coincide with an isotropic volume of
observation.
Increment Extraction Error (IEE). Occurs when the sampling tool is
selective on what it is taking thus not representing all parts of the lot
equally.
Increment Weighting Error (IWE). Occurs when all collected
increments is not proportional to the flow rate (1-dimensional), or
thickness of a stratum (2-dimensional) at the time or place of
collection.
Increment Preparation Error (IPE). Occurs as the result of for
example contamination, losses, alteration of physical or chemical
composition, human errors, ignorance, carelessness, fraud or
sabotage.
Sampling Unit Operation (SUO). A system of operations defining
representative sampling. SUO’s are a late addition to TOS intended to
provide a succinct minimum practical framework for representative
sampling, Esbensen and Minkkinen [3], Esbensen & Julius [9]. An
overview of the five stationary lot sampling SUOs (and additional two
process sampling SUOs) is given below, in the context of TOS’s
sampling errors. Appendix A provides further in-depth background.
Material class. Closely related materials types, for which specific
heterogeneity characterisation and sampling DQO quantification is
not necessary, e.g. closely related commodity types, aggregate
material with closely related grain size distributions (N.B. similar
density, surface stickiness etc.). In all likeliness, however, it will very
often be easier, and less expensive, to perform a heterogeneity
characterisation for all new materials not sampled before, given the
economic and other potential consequences of relying on
undocumented sampling procedures.
Due diligence. It is mandatory always to assure a correct and
hence accurate, sampling process in order to cancel the sampling
bias. Making a particular sampling process accurate may sometimes
be fraught with considerable practical and labour demands in
particular adverse situations—but this is unavoidable if
representativity is the objective. After the demands for correct
sampling are accomplished, the remaining sampling errors can be
minimised to any level required (as operationalised by DQO), only at
the willingness to employ the practical and labour efforts needed.
Unified sampling responsibility. All sampling is a multi-stage
process. Sampling at any stage must respect identical criteria as
primary sampling in order to be representative. There is no difference
but the scale at which sampling takes place. Because of the close
association to analysis, sample operations occurring in the laboratory
(collectively termed “sample preparation”) have traditionally been the
responsibility of the analyst. It is of the utmost importance that all
sampling operations at all scales are under a unified responsibility.
Whether this be the analysts, a process engineer or a sampling
responsible entity at another level, is irrelevant—the unified
responsibility is an absolute necessity.

3.9 References
[1] Esbensen, K.H. and Wagner, C. (2014). “Theory of Sampling
(TOS) vs. Measurement Uncertainty (MU)—a call for
integration”, Trends Anal. Chem. 57, 93–106.
https://doi.org/10.1016/j.trac.2014.02.007
[2] DS 3077 (2013). DS 3077. Representative Sampling—Horizontal
Standard. Danish Standards. www.ds.dk
[3] Esbensen, K.H. and Minkkinen, P. (Eds) (2004). “Special Issue:
50 years of Pierre Gy’s Theory of Sampling. Proceedings: First
World Conference on Sampling and Blending (WCSB1).
Tutorials on Sampling: Theory and Practise”, Chemometr. Intell.
Lab. Syst. 74(1), (2004).
[4] Esbensen, K.H. and Julius, L. (2013). “DS 3077 Horizontal—a
new standard for representative sampling. Design, history and
acknowledgements”, NIR news 24(8), 16–19.
https://doi.org/10.1255/nirn.1406
[5] Esbensen, K.H. and Mortensen, P. (2010). “Process Sampling
(Theory of Sampling, TOS) – the Missing Link in Process
Analytical Technology (PAT)”, in Bakeev, K.A. (Ed.), Process
Analytical Technology, 2nd Edn. Wiley, pp. 37–80.
https://doi.org/10.1002/9780470689592.ch3
[6] Petersen, L., Dahl, C. and Esbensen K.H. (2004).
“Representative mass reduction in sampling – a critical survey
of techniques and hardware”, in “Special Issue: 50 years of
Pierre Gy’s Theory of Sampling. Proceedings: First World
Conference on Sampling and Blending (WCSB1)”, Esbensen,
K.H. and Minkkinen, P. (Eds). Chemometr. Intell. Lab. Syst.
74(1), 95–114.
[7] Esbensen, K.H. and Romañach, R.J. (2015). “Proper sampling,
total measurement uncertainty, variographic analysis & fit-for-
purpose acceptance levels for pharmaceutical mixing
monitoring”, in “Proceedings of the 7th International Conference
on Sampling and Blending, June 10-12, Bordeaux”, TOS forum
Issue 5, 25–30. doi: https://doi.org/10.1255/tosf.68
[8] Esbensen, K.H., Román-Ospino, A.D., Sanchez, A. and
Romañach, R.J. (2016). “Adequacy and verifiability of
pharmaceutical mixtures and dose units by variographic
analysis (Theory of Sampling) – A call for a regulatory paradigm
shift”, Int. J. Pharmaceut. 499, 156–174.
https://doi.org/10.1016/j.ijpharm.2015.12.038
[9] Esbensen, K.H. and Julius, L.P. (2009). “Representative
sampling, data quality, validation – a necessary trinity in
chemometrics”, in Comprehensive Chemometrics, Brown, S.,
Tauler, R. and Walczak, R. (Eds). Wiley Major Reference Works,
Vol. 4, pp. 1–20. Wiley, Oxford. https://doi.org/10.1016/b978-
044452701-1.00088-0

* TOS is able to simultaneously consider both original units, “grains”, as well as all possible
fragments hereof produced by the sampling process in operation, by calling both entities
“fragments” (i.e. both true fragments as well as unaffected original grains). This ingenious
conceptual twist allows a comprehensive description of all types of materials—whether they
are but only a little, some or much subjected to fragmentation (or not at all).
Chapter 4: Fundamentals of principal
component analysis (PCA)

In this chapter the basic workhorse of multivariate analysis, Principal


Component Analysis (PCA), is introduced. PCA involves
decomposing one data matrix, X, into a “structure” part and a “noise”
part, which allows powerful projection visualisation of the hidden data
structure, the “latent structure” in X. There is no Y-matrix at this
stage; the Y-matrix will be very prominent in later chapters on
multivariate regression. This chapter should be worked through very
closely and reflected upon most carefully: it describes how to “think”
multivariate data analysis, and it introduces the basic philosophy that
is fundamental also for many of the methods described later. Above
all this chapter provides invaluable hands-on experiences in
multivariate data analyses.
PCA has been described in a plethora of ground-breaking
literature, starting with Sir R.A. Fischer (1936) [1] spanning some 75+
years, see also, e.g., Geladi and Esbensen [2] and Esbensen and
Geladi [3]. Indeed, the didactic outreach regarding multivariate data
analysis was continuous right up to this moment on the internet
where can be found useful introductions to PCA, some with excellent
illustrative graphics a.o. However, far from all are equally excellent
w.r.t. the supporting framework, didactic scope and educational
breadth, indeed much material found here is partial and disjunctive.
The present authors make a bid for history with this book.
A complete historical bibliography is out of bounds for the present
purpose, but it is fervently stated that a modicum of historical insight
and experience is of importance in shaping the mindset of a
competent data analyst. For this purpose, a compact historical
viewpoint has been collected in references [1–7], to which is added
an optional set of truly excellent textbooks in statistics, data analysis,
chemometrics (and even other fields in which PCA has received
ample introductory treatment), surely of interest for the developing
data analyst [8–17]. The present authors owe a great debt of gratitude
to all erstwhile educators and their valuable contributions, many of
which still loom large in the firmament. This extensive general
background literature is recommended to the reader at this starting
point, rather than giving meticulous individual references below.

4.1 Representing data as a matrix


The starting point is an X-matrix with n objects and p variables, i.e. an
n by p matrix (Frame 1.1 in chapter 1). This is called the “data matrix”,
the “data set” or simply “the data”. The row vectors of the matrix are
the objects which represent observations, samples or experiments,
while the variables are the “measurements” carried out on each
object. The important issue is that the p variables collectively
characterise each of the n objects in a comparable manner, each
object differing from others by the numerical content of its object
vector, i.e. its row data.
The exact configuration of the X-matrix, such as which variables
to use—and for which set of objects, is a strongly problem-
dependent issue. The main advantage of PCA is that there is freedom
to use a practically unlimited number of variables for a multivariable
characterisation—and likewise regarding the number of objects, for
all types of X-data.
The purpose of all multivariate data analyses is to decompose the
data in order to detect and model “hidden (or latent) phenomena”.
The concept of variance is of central importance. It is a fundamental
assumption in multivariate data analysis that the underlying
“directions of maximum variance” must be caused by important
influences, in fact the general case is that principal components are
direct manifestations of these hidden phenomena. All this may
perhaps seem a bit unclear now, but what PCA does will become
very clear through this chapter and the accompanying examples
presented.

4.2 The variable space—plotting objects in p-


dimensions

4.2.1 Plotting data in 1-d and 2d space


The data matrix X, with its p variable columns and n object rows, can
be represented in a Cartesian (orthogonal) co-ordinate system of
dimension p.
Consider for the moment the first variable, i.e. column X1. The
individual entries for each object can be plotted along a 1-
dimensional axis (see Figure 4.1a). The axis must have an origin, a
zero point, as well as a direction and a measurement unit. If X1 is a
series of measured masses, for example, the unit would be mg, kg or
some other unit of mass.
This can be extended to take in another variable, X2 (see Figure
4.1b). This would result in a 2-dimensional plot, often called a
“bivariate” scatter plot.
The axes for the variables are orthogonal and have a common
origin, but may have different measurement units, entirely a function
of exactly what was selected by the experimenter, or the data
analyst, to be X1 and X2. Continuing this extension is possible until all
p variables are covered, for example by plotting all pertinent variable
pairs.

4.2.2 The variable space and dimensions


The p-dimensional co-ordinate system described above is called the
variable space, the space spanned by the p variables. While the
dimension of this space is p, the effective dimension related to the
rank of the matrix representation (mathematically: the number of
independent basis vectors, statistically: the number of independent
sources of variation within the data matrix) may often be less than p
as will be clarified further below. Multivariate data analysis aims at
determining this “effective” dimensionality. However, at the outset
one should always assume p dimensions before starting data
analysis. Most, if not all, of multivariate data analysis will benefit from
the concept of plotting X-data in the p-dimensional variable space.

4.2.3 Visualisation in 3-D (or more)

The didactic approach of this book encourages the data analyst to


think of multivariate data as a swarm of points in variable space. Of
course, as p increases above 3, visualisation is no longer physically
possible in the mind, or on paper. However, this is of no serious
consequence as it is not necessary to be able to picture anything
more complex than 3-dimensional systems to learn to understand
multivariate data analysis in its full p dimensions. The use of 1-, 2-
and 3-dimensional plots will be used as exemplars, and this insight
and data space “feeling” is directly applicable to all higher
dimensions.
Figure 4.1: Plotting data in a) 1-dimensional and b) 2-dimensional space.

4.3 Plotting objects in variable space


Assume that an X matrix has n objects and only 3 variables, i.e. p = 3.
The variable space will have 3 axes representing x1, x2 and x3. If the x-
values are plotted for each object in this variable space, object
number 1 has the set of variable measurements; x11, x12 and x13;
object number 2 is characterised by the set x21, x22 and x23 and so on.
In the Cartesian co-ordinate system, each object can be
characterised by its coordinates, its row vector elements (x1, x2, x3).
Each object can therefore be represented, and plotted, as a point
in this variable space. When all X-values for all objects are plotted in
the variable space, the result is a swarm of points as shown in Figure
4.2. One of the useful features of PCA is that there are only n points
delineated in this p-dimensional space. Observe, for example, how
this rendition of the (n, p)-dimensional two-way data matrix, allows
direct geometrical insight into the hidden data structure.
In this geometrical view, Figure 4.2, it is easy to get an
appreciation that there is a marked trend among the objects in this
data set, a trend that is so prominent that it is called a “hidden linear
association” among all three variables plotted. This means that all
three variables in this example are in fact correlated to each other.
This geometric impression of an underlying covariance (or
correlation) data structure that is revealed when plotted in 3 (p)
dimensions is actually all that is needed for a full phenomenological
understanding of PCA.
Figure 4.2: Data plotted as a swarm of n points in the variable space, revealing a hidden
linear trend.

4.4 Example—plotting raw data (beverage)

4.4.1 Purpose

This example illustrates how to study single variable relationships,


and object similarities using standard descriptive statistical tools and
plots, i.e. univariate data analysis. This data set will be used to gain an
insight into the need for PCA modelling.

4.4.2 Data set

The data set consists of 17 objects and 6 variables. The objects are
17 cities in Europe and the variables represent yearly consumption
(2012) of various beverages:
Beer consumption litres per year

Wine consumption litres per year

Coffee consumption kg per year

Tea consumption kg per year

Bottled water consumption litres per year

Fruit juices consumption litres per year

The primary relationship between Wine and Beer consumption is


shown in Figure 4.3. The correlation coefficient was calculated to be –
0.55 resulting in a fitted linear regression line (with a negative slope).
This plot confirms the general knowledge that more wine is
consumed in the Latin-speaking countries, whereas countries like
Ireland, Austria and Germany consume more beer. At first glance it
may seem that Spain is not located where one should expect, but on
further consideration, it is realised that Spain is a wine producing
country with relatively high beer consumption. This is just one of
many possible 2D scatter plots to show the interrelationships
between the six variables. The reader is encouraged to try their own
hand in plotting their own data in pairwise combinations. In general,
even from comparatively small data sets, these typically contain both
positively as well as negatively correlated relations. There are also
“random shots” interrelationships to be found in many data sets.
Figure 4.3: 2D scatter plot (Wine vs Beer consumption).

With p variables, there are a total of p × (p – 1)/2 such pairwise


combinations. With just six variables, there are thus already 15
bivariate scatter plots to peruse. It does not take a rocket scientist to
appreciate the work involved in the investigation of all variable pairs in
even a moderately dimensioned multivariate data set, say p in the
interval 10–50. Surely there must be an easier way? One common
way is to calculate the cross-correlation between all variables and
represent them as a correlation table. However, this condensed
representation of the data table does not show the distribution of the
object in the multivariate space.
One could also study three variables at the same time using a 3-D
scatter plot, which is shown for Beer, Coffee and Bottled water in
Figure 4.4. One drawback with 3D-plots, although favoured by many,
is that the spatial interpretation of the relationships of the objects
becomes somewhat difficult. Including vertical grid lines in the
background may help, but at the same time it clutters the plot.
Even while these simple two-variable and three-variable plotting
routines are powerful in their own right, the number of variables (in
this particular case only six), quickly makes it impractical to
investigate all pertinent combinations either in isolation or in toto.
One of the (many) reasons for using PCA is to make it possible to
survey all pertinent inter-variable relationships simultaneously. Below
other data analysis objectives are postponed to such time that the
principles of PCA and interpretation of its results have been covered
and well understood.
With PCA and related methods there is a focus to convey variable
relationships and how the objects are distributed in terms of plots
that enables interpretation based on the user’s background
knowledge. PCA often conveys renditions of meaningful data
structures, some of which may be known beforehand, but the true
value of PCA is that it may reveal something new about the data,
features which were previously unknown. This is exactly what may
lead to new insights and innovation in a practical setting. Performing
PCA, technically, is the easy task—getting the interpretation of the
PCA results is the challenging objective. But all the technicalities of
PCA must be mastered first, which is the focus of the remainder of
this chapter.
Figure 4.4: 3D scatter plot of Beer, Coffee and Bottled water consumption.

4.5 The first principal component

4.5.1 Maximum variance directions


The data swarm in Figure 4.2 shows that when correlation between
variables exists, the objects delineate a new direction in variable
space. In this case, a central axis could be drawn through the swarm
and that this line would describe the data swarm almost as efficiently
as all the original p (p = 3) variables (Figure 4.5). Due to the strong
correlation, the effective dimension is not 3, but rather 1 (the issue of
whether the “transverse scatter” orthogonal to the central axis has
data analytical meaning will be dealt with below).
This central axis, which appears very natural here because the
swarm indeed “looks” linear, is actually the direction of maximum
variance in the data set. The variance, the spread of all the objects in
the data set, is largest along this axis in the sense that all other
directions will span a smaller spread. In the general case this axis can
have any orientation in the 3-dimensional space (it need not be
parallel to any of the original X-axes; in fact, it very seldom is). It is
usually so several variables collectively display this type of linear
association. However, if the direction of maximum variance lines up
along a single axis, this represents the situation where only that
variable is responsible for most of the variation in the system.
Familiarity with this “trend” aspect of PCA will soon become very
apparent.
When “the variance” is described, the question is “the variance of
what?” The feature in question is the variance along the direction
described/represented by the central axis, whatever this new variable
may actually “represent”. Observe that this new axis functions exactly
like a variable, it is linear i.e. 1-dimensional. This highlights what is
meant by the term: “modelling a hidden phenomenon”. There is a co-
varying, linear behaviour along this central axis due to “something”,
some unknown cause or reason. If the original x1, x2 and x3 variables
are looked at individually, there is no such apparent connection,
except that their pairwise co-variances are large. But this simple
simultaneous geometrical plotting in 3-dimensional space reveals the
hidden data structure in its entirety very effectively.
All PCA does is to allow for such geometrical understandings to
be generalised into any arbitrary higher p-dimensionality—et voila:
PCA without mathematics (at least to a first understanding).
The central axis is called the first Principal Component, in short
PC1. PC1 thus not only lies along the direction of maximum variance
in the data set—it is the direction of maximum variance in any data
set, X, in fact. It is said that there is a “hidden variable” associated
with this new axis, a Principal Component, PC.
At this stage, it is not known what this new variable “means”. PCA
modelling will result in this first—and similar other—PCs, but it is up
to the analyst to interpret what they mean or which phenomena they
describe. This issue of interpretation will be addressed soon, but a
little more familiarity with principal components is still required.
In the example where only three variables were available, it was
recognised that a linear behaviour was apparent by plotting the
objects in the 3-dimensional variable space. When there are more—
many more—variables (like in spectroscopy, where each row of the
matrix is a spectrum of perhaps several hundred or thousands of
wavelengths), this procedure is, of course, not feasible in a directly
similar fashion. Identification of this type of linear behaviour in a
space with several thousand dimensions, of course, cannot any
longer be done by direct visual inspection. However, here PCA can
help in the discovery of the hidden data structures just as easily as
above due to its powerful projection characteristics.

Figure 4.5: 3-Dimensional data representation with PC1 fitted.

4.5.2 The first principal component as a least squares fit


The central new variable is defined as the axis along which the PC
variance is maximised. There is also a complementary way to view
this axis. Assume the same swarm of points exists and now an
arbitrary axis is drawn through the swarm, i.e. an axis with a randomly
chosen direction. This is just a proxy direction to illustrate the
concept—not the actual final axis direction, which will be found by
the PCA algorithm.
Each point is projected perpendicularly down onto this proxy axis.
The (perpendicular) distance from point i (object i) is denoted ei (and
is called the object residual). As can be seen from Figure 4.6, each
point is situated at a certain “transverse” distance from the line. The
first PC can be thought of as finding the line that is a best
simultaneous fit to all the points through the use of the least-squares
optimisation principle in the sense that there is only one line (one
direction) that minimises the sum of all squared transverse distances
from all objects in the data matrix, i.e. minimises Σ(ei)2.
This line is the exact same PC-axis that was found more
“intuitively” above. When using the Least Squares approach on the
residuals, a completely objective algorithmic approach is obtained
with which to calculate the first PC through a simple sum-of-squares
optimisation.
It is appreciated that the n objects contribute differently to the
determination of the axis direction through their individual orthogonal
projection distances. Objects lying far away from an arbitrary PC axis
in the “transverse” sense will pull heavily on the axis’ direction
because the residual distances count by their squared contributions.
This property will be discussed later and is known as leverage.
Conversely, objects situated in the immediate vicinity of the overall
“centre of the swarm of points” will contribute very little to
establishing the PC direction in comparison. Objects lying far out
“along” the PC axis extensions may, or may not, display similarly
large (or small) transverse residual distances. However, only the
transverse residual is reflected in the least squares minimisation
criterion.
There are now two approaches, or criteria, for finding the (first)
principal component: the principal component is the direction (axis)
that maximises the longitudinal (“along axis”) variance or the axis that
minimises the squared transverse residual projection distances.
Upon reflection, it may be realised how these two criteria are
actually but two sides of the same coin. Any deviation from the
maximum variance direction in any elongated swarm of points must
necessarily also result in an increase of Σ(ei)2—and vice versa. It will
prove advantageous to have become thoroughly familiar with these
two simple geometrical models of a principal component.
A third conceptual understanding shall also be presented below
(PC = a specific linear combination of all p variables).

Figure 4.6: Projections onto a PC; each object has a transverse residual (not all are drawn in
this projection). All projections are perpendicular with respect to the subspace modelled
(here a 1-dimensional PC).

4.6 Extension to higher-order principal


components
The 3-variable example illustrated above can now be generalised to
higher-order cases.
The swarm of points in the simple illustrations above could be
approximated quite well with just one PC. Suppose now that the X-
point data swarm in fact is more complex and not so simple as to be
modelled by a singular PC (Figure 4.7). There is only one thing to do
after having found PC1 and that is to find one more, called PC2, as
the point data swarm in this case would appear to be quite planar in
disposition. The second principal component, per definition, must lie
along a direction orthogonal to the first PC and along the direction of
the second largest data set variance.
Again, this is a variance caused by some “unknown” phenomenon
or new hidden (latent) variable, the specific origin of which may be
interpreted later (or not), but the manifestations in one specific
direction orthogonal to PC1 is exactly what is represented by the
second principal component, PC2.
One may perhaps at this stage already be wondering how to find
(to calculate) the PCs. This need not be of concern yet, however; at
this stage, it is only important to grasp the geometric concepts of two
mutually orthogonal PCs, collectively defining a plane, i.e. the PC1
and PC2 plane, (Figure 4.7). The mathematics behind and the
algorithmic procedure to find them is actually very simple and will be
described in due course.
It is now conceptually easy to continue understanding higher-
order components. By definition, PC3 will be orthogonal to both PC1
and PC2 while simultaneously lying along the direction of the third
largest variance—and so on for PC4, PC5 etc. The final PCA will
consist of a number of orthogonal PCs, each lying along a maximum
variance direction representing still more “transverse” data
extensions, but of decreasing variance extension.
The PCA method is designed to first find the largest variance PC1,
followed by the orthogonal direction of the second largest variance,
PC2… and so on. Each new higher-order PC describes a smaller
fraction of the total data set variance than the previous. This is a most
convenient feature of PCA.
Figure 4.7: Quasi-planar data swarm which is modelled well with two PCs. Each object will
now be projected onto a 2-dimensional PCA-model made up of PC1 and PC2. The
projection distances will be orthogonal to the PC1–PC2 plane.

Such a system of PCs actually constitutes a new, alternative


coordinate system relative to the original variable space with its p
variables. In fact, a new set of “variables” is available, made up of
each PC calculated, which are uncorrelated with each other since
they are mutually orthogonal.
Thus, these new variables, called PC variables for the moment, do
not co-vary. By introducing the PCs, good use is made of the
correlations between the original variables and a new independent,
orthogonal coordinate system is constructed. Leaving the original
Cartesian co-ordinate system, one is effectively substituting the inter-
variable correlations with a new set of orthogonal coordinate axes,
with which the data analyst will be able to develop a PC model of the
X data structure, but, as shall be shown, with a much-reduced
number of directions.

4.7 Principal component models—scores and


loadings
It is now possible to be able to develop the systematics of PC
modelling more formally.

Definition: A Principal Component model is an approximation to a


given data matrix X, i.e. a model of X, to be used instead of the
original X. A PC model represents the hidden data structure in X by a
series of principal components PC1, PC2, PC3… It is tacitly assumed
that this substitution has some advantages for data
analysis/interpretation purpose(s), an assumption that has been
amply justified by (literally) many, many thousands of PCAs carried
out since its first appearance in its modern form in 1933 (in a seminal
paper by Hotelling [18]). The method was originally outlined in 1901
by the famous statistician Karl Pearson [19].

4.7.1 Maximum number of principal components

There is an upper limit to the number of PCs that can be derived from
an X-matrix. The largest number of components is either n – 1
(number of objects –1) or p (number of variables), depending on
which is the smaller. For example, if X is a 40 × 2000-dimensional
matrix (40 spectra, each with 2000 variables), the maximum number
of PCs is 39. In this case, the largest number of potential PCs is
limited by the number of samples. Notice that the maximum number
of components is being discussed. In general, it will suffice with a
much smaller number, << p.
From a data analytical point of view, using the maximum number
of PCs corresponds to a simple change of co-ordinate system from
the p-variable space to the new PC space, which is also orthogonal
(and with uncorrelated PC axes).
Mathematically, the effective dimension of the PC space, i.e. the
space spanned by the number of PCs in the model developed (much
smaller than p), is termed the rank of X. The methods used to
estimate this critical parameter of a PC model are discussed below.
Since the first few PCs typically define the major fraction of the
variation in a data set, the later PCs will lie along directions where
there is progressively less and less spread in the objects, or perhaps
there are no longer any significant underlying phenomena to model.
These higher order PCs may only be representing random noise.
Thus, all higher-order directions can progressively be thought of as
potential “noise” directions. This gives an indication of how PCA can
decompose the original data matrix X into a structured part (the first
PCs that span the largest variance directions that represent the major
structured phenomena), and the noise part (directions in the data
swarm where the variance/elongation is small enough perhaps to be
neglected).

In any event, the PCA concept is open-ended: if one is interested


in a more effective, more regularised alternative coordinate system
only—PCA delivers. But, much more relevant for the purpose of this
book: chemometric data analysis works on the premise that there is
to be found a small, upper bound to the number of PCs needed to
effectively model the hidden X data structure. It is all about how to
find this number (much smaller than p) reliably and reproducibly. This
number will be denoted as “A” or “Aopt“ (the optimal number of
principal components).

4.7.2 PC model centre


A PCA model thus consists of an upper bound set of orthogonal axes
(A), all determined sequentially, and independently, as the maximum
variance directions. They have a common origin, as can be seen in
Figure 4.8. There are a number of ways to choose this origin.
Sometimes it can be the same origin as the origin for the original p
variables, but this is very rarely optimal and is in fact only used in
extremely special circumstances (so rarely that new data analysts can
safely look away from this highly-advanced option). The
overwhelmingly most frequent choice of origin for the PCs is as the
average object, the average point in the data swarm:

where

is the mean of variable index k, taken over all objects.

Figure 4.8: Centring the PC coordinate system at the average data object (translating the
original coordinate system).

This PC-origin can also be viewed as a translation of the origin in


variable space to the “centre-of-gravity” of the swarm of points
representing X. This procedure is called centring and furthers the
origin of the common principal components coordinate system to be
established, known as the mean centre. Thus the model PC
coordinate system will extend outwards from this mean centre data
point, and will be made up of the individual PC directions calculated.
Observe that the “average point” usually is an abstraction. It does
not have to correspond to any physical object present among the
available samples. It is a very useful abstraction, however, anchoring
the entire PCA just at the right place.

4.7.3 Introducing loadings—relations between X and PCs


Mathematically PCs, strictly speaking, are variance-scaled half-
vectors in variable space, whose directions are determined as
explained above, originating (back-to-back) at the mean centre
object. The directional polarity is useful for practical data analysis, as
is revealed below regarding scores and how to interpret and use
these.
Any single PC direction can be represented as a linear
combination of the p unit vectors defining the variable space, i.e. unit
vectors along each original axis in the variable space. The linear
combination for each PC will contain p coefficients, relating to each
of the original p unit vectors. These shall be called directional
coefficients pka, where k is the index for the p variables and a is the
index for the successive PCs. As an example, p23 would be the
coefficient for the second p-basis vector, defining X2, in the linear
combination that makes up PC3.
These coefficients are termed loadings and there are thus p
loadings for each PC. The loadings for all the PCs in a model
constitute a matrix P. This matrix can be thought of as a
transformation matrix between the original variable space and the
new space spanned by the PCs (centred at the average data point). In
PCA, the loading vectors, the columns in P, are orthogonal. Loadings
provide information about the relationship between the original p
variables and the PC directions. In a way they constitute a bridge
between variable space and PC space.
The PCs are in fact nothing but linear combinations of the original
variables (unit vectors). Returning to a previous definition, the
loadings provide a “weighting” of each variable’s contribution to a PC
direction. When a weighting is high for a variable, i.e. the variable has
a high loading (numerically), this variable contributes significantly to
the variance expressed by that PC. This is important for interpretation
purposes. PC loadings will be discussed in great detail below, and
especially how to interpret them.

4.7.4 Scores—coordinates in PC space

Earlier, residuals were introduced as resulting from the lack of fit of a


projection of objects onto one, or more, PC axes. For this reason, one
often sees PCA and similar methods designated as projection
methods. Consider for instance PC1. If an object i is projected down
onto PC1 (Figure 4.9), it will have a projection “foot point” on this PC-
axis. On this axis, this foot point has a distance (co-ordinate) relative
to the PC origin, falling on either of the half-axes extending out from
this origin when a direction polarity has been assigned (as is the case
in PCA); the “sign” of this distance measure is either negative or
positive.
This co-ordinate is termed the score for object i, and is designated
ti1. The projection of object i onto a possible PC2 will be called the
score ti2, the next ti3 and so on. Just like in variable space, the
projected objects correspond to ordinary points in the new co-
ordinate system, only now expressed with their projected co-
ordinates, scores (ti1, ti2, ... , tiA) only (the projection distances are now
ignored).
Taking the step further will allow generalising to a PC coordinate
system with A components. In this case object i is projected onto an
A-dimensional surface called an A-dimensional subspace (or an “A-
dimensional flat”, highlighting the geometrical 3-dimensional
projection metaphor introduced earlier). Each object will thus have its
own set of scores identifying its exact location in this dimensionality-
reduced subspace of dimension Aopt. The number of scores, i.e.
number of subspace co-ordinates for each object, will be Aopt. When
no confusion can occur the PC index will be called either A or Aopt
below.
Collecting all scores for all N objects generates the score matrix T,
which will have dimensions [N, Aopt]. Notice that the Aopt scores for an
individual object make up a row in T of dimensions [Aopt, 1]. The
columns in the score matrix T [1, N] are orthogonal, a very important
property that will be of great use when the data analyst gets to
interpret the revealed data structures in X.
There is often reason to refer to individual score vectors. A score
vector is a column of T, i.e. the vector of “foot-point locations”
representing all the N objects projected down onto a particular PC.
Therefore, there will be a score vector for each PC, A in number. It
will have the same number of column elements as there are objects,
n.
Usually, “scores” means “elements in the T-matrix” without further
specification. Historically, PCA was first developed in the area of
psychology (“psychometrics”) and the term “score” denoted a
response factor to some particular stimuli. This notation has been
used and preserved ever since in all of chemometrics [20].

4.7.5 Object residuals

There are many advantages in using a PC model of the X-data, i.e.


substituting the A scores for the p variables, but there is always a
price to be paid. This is expressed by the size of the projection
distances ei. They represent potential “information lost” by assuming
that the projected foot points are fully satisfactory for the data
analysis. The projection rendition is thus always but an approximation
to the original data set, the left-out part represented by the object
residuals. Here it pays to remember that the PCs can be thought of as
the result of minimising the sum of squared object distances (object
residuals). If in sum-total these distances, the object residuals, is
“large” this implies that the model fit is not good; the model does not
represent the original data well. Thus much insight about a PCA
model can be gained from inspection of various statistics based on
the object residuals (“misfit statistics”), all to be explored further
below.

Figure 4.9: Scores as projected object coordinates in the A-dimensional subspace.

4.8 Objectives of PCA


The practical objective of PCA is not only to substitute the object
representation from the initial representation in the form of the p
original variables into the new PC coordinate system, but especially
also to gain the advantage of dropping the noisy, higher-order PC
directions.
Thus PCA performs a dual objective: a transformation into a more
relevant and much easier-to-interpret co-ordinate system (which lies
directly in the centre of the data swarm of points), and a
dimensionality reduction (using only the first Aopt PCs which reflect
the effective data structure). The only “problem” remaining is then:
how many PCs are needed for this purpose?

How is the magnitude of Aopt determined?


Aopt is the only “free parameter” of a PCA model. This, the most
important model parameter, is in some approaches determined by an
algorithmic approach—an approach that is rejected strongly here
(never mind that The Unscrambler® program also offers an estimate;
more on this issue later). It is up to the data analyst to decide on the
numerical magnitude of Aopt based on the complete information
generated by the principal component analysis. This includes the
information revealed and interpreted from both the scores, the
loadings as well as the variance explained plots. This personal
responsibility for setting Aopt is a key issue in the present approach to
PCA.
In Figure 4.10 it can be appreciated how the new PC co-ordinate
system amounts to a reduction of the dimensionality from 3 to 2. This
is of course not an especially impressive reduction in itself, but it
should be kept in mind that PCA handles the case of, say, 300 → 2,
or even 3000 → 2 equally well. The 3-D → 2-D (or 1D) reduction is
only a particularly useful geometric metaphor for the general,
powerful dimensionality reduction potential of PCA of any high
dimensionality, p, to the effective rank of the data set, Aopt. In fact,
any large number of variables can often be compressed into a
relatively small number, e.g. 2, 3, 4 PCs. This allows the data analyst
to see the data structures almost totally regardless of the original
dimensionality with but a few plots.
However, all such plots represent projections onto an A-
dimensional subspace, i.e. the data analyst only observes the
relationships between the object foot points. Considerations
regarding the complementary, screened-off residuals shall be further
delineated below.
Figure 4.10: The general PC coordinate system (A = 3). This disposition shall serve as a
metaphor for all higher-dimensional PCA cases.

4.9 Score plot–object relationships


One of the most powerful tools that PC-based methods offer is the
score plot. A score plot is any pair of score vectors plotted against
each other (Figure 4.11); although scores can also be plotted
individually as 1-D serial data plots to reveal underlying phenomena
in-between objects, of particular interest for process data.
Plotting score vectors in the 2-D fashion corresponds to
visualising the projected objects as they are dispositioned in a
particular PC planar sub-space. Score plots are typically referred to
by their score designations, for example t1 vs t2 for the PC1–PC2
score sub-space. Score plots can be viewed as particularly useful 2-
D (or 3-D) “windows” into PC space, where one observes how the
objects are related to one another. The full effective PC space may
certainly not always be fully visualised in just one 2-D plot, in which
case one or a few more score plots are needed.
The most commonly used plot in multivariate data analysis is the
score vector for PC1 versus the score vector for PC2. This is easy to
understand since these are the two directions along which the data
swarm exhibits the largest and the second largest variance,
respectively. As a practical example, in sensory science, the quality of
green peas, as assessed by a trained sensory panel, was analysed
using PCA. A plot of PC1 scores versus PC2 scores, the t1 vs t2 plot,
is shown for this data set in Figure 4.11.
Scores for PC1 are traditionally plotted along the “x-axis”
(abscissa) and the scores for PC2 are plotted along the “y-axis”
(ordinate). Note how the powerful option of having one, two (or more)
of the object name characters serving as plotting symbol (and group
colours from external information) can be used. This option greatly
facilitates interpretation of the meaning of the inter-object
dispositions. The data analyst should use their own creativity in
making effective use of such annotation facilities, of which there are
many in The Unscrambler® software.

4.9.1 Interpreting score plots

With reference to Figure 4.11 it can be seen that there is a singular


object in the far left of the plot, labelled A (in fact there are three
replicate measurements for this object A; the complex issue of
replication is treated in full in chapter 9). It has the most negative
score for PC1, approximately –7.0, and a negative score for PC2,
approximately –2.1. Another observation is that there is a time trend
along PC1. The objects are denoted with letters A, B, C, D, E
denoting the type of peas (“cultivars”) while the numbers 1–5
indicating successive harvest time. The score plot was grouped using
harvest time as the categorical variable and it can be seen that the
objects move systematically from left to right, along the first PC
direction. The “PC1” variable would appear to be strongly associated
with the progress of harvest time. From this it can be concluded that
PC1 lies along the maximum variance of this “hidden” phenomenon,
which can be described as “progressively later harvest time”.
It is noted that “harvest time” was not a variable in the original X-
matrix. Still it is observed as the dominant trend across the score plot
because of the systematic distribution of the harvest time along the
PC1 axis. For this particular example, one is led to the conclusion
that time of harvest is important for the taste of the peas. Since PC1
is the most dominant Principal Component (in fact, it describes 53%
of the total X-variance), harvest time seems to be a very important
factor for the sensory results.

Figure 4.11: Example of a 2-D score plot PC1 and PC2 (t1 vs t2) with external information
annotated for more comprehensive interpretation.

The same reasoning can be followed if interest also is on


interpreting PC2. Only here a different pattern is observed in the
plotting symbols moving down the plot from positive PC2 to negative
PC2 scores. Based on the symbols used to annotate the plot, there is
no visual relationship of the external information and the PC2
direction. However, this direction is described by one of the
measured variables “colour” and this will be described further when
an interpretation of loadings is presented below.
It is noted, and as shall be seen below how this is calculated, that
PC1 is representing a proportion of 53% (of the total data variance),
while PC2 takes care of 26% in comparison; thus the PC1 vs PC2
plot “accounts for 79%” of the total data variance in the data set as
the terminology goes. Thus PCA decomposes the original X-matrix
data into a set of orthogonal components, which may be interpreted
individually (the PC1 phenomenon may be viewed as taking place
irrespective of the phenomenon along PC2 etc.). In reality, of course,
both these phenomena act out their role simultaneously, as the raw
data undoubtedly come from the one-and-only X-matrix. Together
they account for 79% of the total variance in the data set X.
A full interpretation of this PCA on the peas data set is presented
in the following sections (this may be the very first full PCA for many
readers—congratulations), and an astounding comprehension
emerges: PCA is able to model the growth-harvest development for
five apparently very closely related pea cultivars; their general
evolution in the PC1 vs PC2 score plot are closely similar, yet with the
help of PC2, it is possible to discern even minute systematic
differences between these.
This possibility for orthogonal, decomposed data interpretation is
one of the most powerful aspects of PCA. But these are only simple
examples of the type of information that can be obtained from score
plots; there are very many more uses. Score plots are a.o. used for
outlier identification, identification of trends, identification of groups,
exploration of replicate similarities and more. When interpreted
together with loadings, the influence of the original variables can also
be deduced. This will be discussed further below and in more detail in
chapter 6. Referral is also made to numerous application notes to be
found on the CAMO Software homepages.

4.9.2 Choice of score plots


The (t1 vs t2) score plot is the fundamental “work horse” of PCA,
which is always viewed first, with no exceptions (after looking at the
explained variance plot, which helps to determine how many PCs are
to be interpreted). However, any pair of principal components may be
plotted against each other in the same way as t1 vs t2.
Which components need to be addressed is another matter and is
always related to the actual data structure present. Enter the
dimensionality reduction feature of PCA, offering but a few PCs
instead of p variables.
There are two rules of thumb concerning selection of score
plots:
1) Always use the same principal component as abscissa (x-axis) in all
the score plots, e.g. t1 vs t2, t1 vs t3, t1 vs t4, … In this way a method
for gauging all the other PC phenomena against the same
yardstick, t1 is achieved. This will greatly help getting the desired
overview of the compound data structure, even if it needs a higher
number of PCs, 3, 4, 5… In rare occasions PC1 might not be of
particular interest to the data analysis problem at hand (even
though dominating with respect to the fraction of total variance
modelled). It may, for example, be reflecting a very strong
systematic interference which has to be modelled before its effect
can be disregarded, i.e. disregarding PC1.
2) Thus for some applications it is possible that PC1 lies along a
direction that is not relevant for this X-axis plot anchor role. This is
particularly true in complex applications where PC1, in addition to
the interference possibility, may alternatively be associated with
sampling variability that is not related to the chemistry of the
system—or there may be an altogether other reason, which it is up
to the interpretation diligence of the data analysts to discover. If the
time of harvesting in the pea example above was, say, described in
PC3 in place of harvesting in PC4, it would not make much sense
to plot PC1 vs PC2 for studying these aspects. PC1 and PC2
would certainly describe “something” (other), but not what is being
sought for in that case. Correlation is not per se equivalent to
causality (more of this in later demonstrations of data analyses and
exercises).
These rules of thumb are rather general and there are exceptions.
The best advice is to start all data analysis following these simple
rules, but always look out for possible special features. After an initial
analysis, it may be that higher-order score plots are not necessary for
interpretation at all. There are also many interesting cases in which
the problem-specific information is “swamped” by, or distributed
along many components. At other times, both the first largest
components, and the truly insignificant higher-order ones have to be
discarded: the particular subject matter investigated may be revealed
specifically in one, or more, of the intermediate components.

A case in point: In the 1980s KHE was involved in geochemical


prospecting in Sweden; the following features are courtesy of an
erstwhile pioneering Swedish prospecting company, Terra Swede AB.
Working with 1 kg moraine overburden samples, for which over 30
geochemical main and trace elements were analysed, an appropriate
multivariate approach was critically necessary and both PCA and
Partial Least Squares Regression (PLSR, see chapter 7) were
employed. The geochemical prospecting campaign was directed
towards finding new, hidden gold mineralisations (gold was only one
of more than 30 X-variables measured for).
The specific gold-correlated variables (proprietary information)
could in fact all be isolated in PC-3, which amazingly only accounted
for 8% of the total variance (PC1: 57%; PC2: 17%). The first two
components were related to major geological processes responsible
for the overall moraine deposits and their chemistry, i.e. chemistry not
related to rock fragments originating from buried gold mineralisations.
Still, these glacial processes and their impact on the overall chemical
makeup of the primary moraine samples were of a dominating
magnitude so as to “control” the first two PCs (accounting for no less
than 74% of the total variance). It was only after successful isolation
of the following PC3 that the geochemical mapping and exploration
gained momentum—all based on the plotted geographical disposition
of the scores of PC3.
The task of analysing the raw geochemical data, X (~2200
samples) were overwhelming: consider, for example, interpreting
more than 30 individual geochemical maps simultaneously, one for
each of the 30 elements analysed for—an impossible challenge even
for trained geologists. With PCA this job easily became the
successful culmination of the screening phase of geochemical
prospecting, by which the company was able to “zoom in” effectively
to a very few areas of maximum potential. One of these turned out to
be the proverbial “pot of gold” at the end of the rainbow, indeed
making the first breakthrough for the exploration company, and a
gold ore body was soon ready to be developed. As it happened, the
company was later sold to an interested, much larger entity in the
mining and exploration sector—with PCA now playing an
instrumental role in all further endeavours.
This example is about PCA in a very specific context (geologists
specialising in geochemical exploration are naturally few and far
between), but the usefulness, versatility and power of applied PCA is
easily understandable for all data analysts whether working with other
types of data, however different. There is a powerful lesson to be
learned about the power of creative use of PCA, both in standard
applications as well as in more problem-specific contexts. This
informed use of applied PCA is a key feature of multivariate data
analysis—and there is a wide carry-over potential from the practical
PCA examples outlined above: food and beverage consumption, pea
flavour characteristics or “hidden” gold mineralisation signatures—the
swathe of data types that can be analysed profitably with PCA is
enormous.

4.10 The loading plot–variable relationships


A loading plot can be viewed as a “map of variable interrelations”. As
with scores, loading vectors can be plotted against one another.
These plots are important for interpretation in their own right, but take
on their full role when compared with their corresponding score plots.
In this regard, as scores relate to samples, loadings relate to the
variables used to characterise the samples.
Above, the object relationships were studied with the help of
Figure 4.11, the PC1 vs PC2 score plot. Figure 4.12 displays the
corresponding (p1 vs p2) plot, made by plotting loading vector PC1
versus loading vector PC2. Note that the points plotted here
represent variables (instead of objects). The loading plot provides a
projection view of the inter-variable relationships (variable similarities),
such as these are expressed by correlations (assuming the data set is
auto-scaled, see further below).
The loading plot shows how much each variable “contributes to
each PC”, i.e. how much each original variable contributes towards
defining the direction of each PC. This explanation makes for easy
understanding when met with for the first time, but recall that each
PC can be represented as linear combinations of the original unit
vectors

and that the geometrical understanding of loadings is as directional


coefficients for the PCs. The loadings are simply the mixing
coefficients in these linear combinations. Each variable contributes to
more than one PC, in fact all variables contribute to all PCs
calculated, only with different coefficients for each. The loadings must
be different, as PCs per definition represent orthogonal directions in
the variable space.
In Figure 4.12 the x-axis facilitates plotting the pk1, the loadings for
all the variables making up PC1, k = [1 … p]. The y-axis,
correspondingly, shows the loadings defining the direction of PC2 in
the variable space i.e. pk2.
The variables “Sweet”, “Fruity” and “Pea Flav” are located to the
right of the plot. “Sweet” thus contributes strongly to PC1 but not to
PC2, since its loading value on the PC2 axis is very close to 0. An
earlier look at the (t1 vs t2) score plot for the peas data found that PC1
was related to harvest time, and the inferred relation to pea flavour
can, strictly speaking, first be appreciated with this loading plot; “Pea
Flav” also “loads” very high on the positive PC1 direction (as the
parlance goes). From this relationship between three variables in the
loading plot it can be deduced that measurements of “Sweet” can be
used, together with the other similar variables “Fruity” and “Pea Flav”,
to evaluate harvest time. The later the peas are harvested, the
sweeter they are.

Figure 4.12: Loading plot of PC1 versus PC2 (p1 vs p2).

From the loading plot it can be seen that other variables also
contribute to PC1, but at the opposite end of the p1 axis (represented
by negative loadings on PC1), these are the variables “Mealiness”,
“Hardness” and “Off Flav”. It does not take an expert to deduce that
sweet peas do not have an off flavour, nor are hard mealy peas fruity
and have a characteristic pea flavour. Therefore, properties that lie at
approximately 180° along a PC direction from each other typically
represent opposite features that are negatively correlated with one
another to different degrees.
Due to the orthogonal properties of PCs, it can be deduced that
Sweetness has very little to do with the property described by PC2.
From the loadings plot, PC2 seems to be describing the colour of the
peas, with those located in the positive PC2 direction being greener
than those in the negative PC2 direction (being whiter in colour).
Some variables display positive loadings, that is to say positive
coefficients in the linear combinations, while others have negative
loadings. For instance, “Sweetness” contributes positively to PC1,
while “Off Flav” has a negative PC1 loading (as well as a very small
positive PC2 loading). PCA loadings are usually normalised to the
interval [–1,1], more of which in the next section on Correlation
Loadings.

4.10.1 Correlation loadings

When a PCA [and for that matter a Principal Component Regression


(PCR) or PLSR, see chapter 7] has been performed and a two-
dimensional plot of loadings is displayed, a Correlation Loadings plot
can be used to aid in the visualisation of the variable data structure
[21, 22].
As the sum of the squared loadings for each PC in PCA is 1.0 it is
difficult to set a fixed cut-off for when it is relevant to interpret
variable relationships in the loading plot. For this purpose, correlation
loadings can be computed for each variable as the correlation with
the particular PC. Recall that components are linear combinations of
all p variables, a kind of “super variable”. In addition, this new plot
contains two ellipses to convey how much variance is taken into
account. The outer ellipse indicates 100% explained variance for
each variable, while the inner ellipse indicates 50% explained
variance. The importance of individual variables is thus visualised
more clearly in the correlation loadings plot compared to the standard
loading plot (compare Figure 4.12 to Figure 4.13).
Variables close to each other in the correlation loading plot reflect
a high positive correlation but only to the degree that the two
components explain a significant portion of the variance of X
(variables in diagonally opposed quadrants are negatively correlated
to each other). In Figure 4.12, for example, variables “Sweet” and
“Fruity” are highly correlated and overwhelmingly explained in the
positive PC1 direction. “Colour 2” and “Colour 3” are also highly
correlated but they are explained by the positive PC1 and PC2
directions. “Texture” is not well described by PCs 1 and 2 (as it lies in
the inner ellipse).

Figure 4.13: Correlation loadings of the pea sensory data, PC1 vs PC2. Compare Figure
4.12.

Figure 4.14: Correlation loadings for the pea sensory data, PC3 vs PC4.

Furthermore, as the plot clearly displays, if none of the variables


are explained to more than, for example, 50% it also indicates if this
PC has any useful information or not. The threshold for deciding what
would be the least relevant fraction of explained variance of course
depends on the nature of the data; this is why the Explained Variance
plot (section 4.12.9) is a welcome companion plot. The above is also
a lead-in to a more general discussion about the difference between
significance and relevance.
Figure 4.14 shows the correlation loadings for PCs 3 and 4 for the
same data set.
From the PC3 vs PC4 loading plot, “texture” is an important
variable along PC3 as it has the highest loading and lies close to the
100% ellipse. However, PCA on an array of completely random data
will also result in loading plots in which some variables just by chance
will show high loadings. This is so because the sum-of-squares for all
variables is always 1.0 for each PC, no matter if there is systematic
structure in the data set or not. Loading plots must therefore always
be interpreted and accorded careful consideration as to the fraction
of the total data set variance captured by the particular component(s),
which could be low. In contrast to these subtleties, the correlation
loadings show clearly that PC3 (10%) is dominantly due to the
variable Texture and that PC4 (5%) has no variable explained well
(the variable Skin is close to the inner ellipse boundary).
Another important feature of correlation loadings is that they are
invariant of the scaling of the original variables apart from a possible
rotation of the axis if the variables are weighted or not. This means
that if the model is based on mean-centred data with weights 1.0 (as
is sometimes an unfortunate, misunderstood option for data from
instrumental measurements, e.g. in spectroscopy), the correlation
loadings will reveal if the variables with small variances carry the
same information content as those which show up with high loading
values.
Figure 4.15 illustrates this for a data set from a spectroscopic
application. The data are near-infrared spectra of mixtures of three
alcohols for which a PCA was performed without weighing the
variables as 1.0.* For this type of instrumental data, it is often
assumed that since the variables are of the same type, variables with
a higher empirical variance are more important than variables
displaying a low(er) variance. It is not so common to display 2-D
loading plots for instrumental data as 2-D scatter plots, but this can
give valuable information related to the complementary score cross-
plot and line correlation loadings plots.
Thus, in Figure 4.15 the left 2-dimensional loadings plot shows a
group of variables having high loadings on negative PC1 whereas
PC2 is described by only a few variables, also with positive
contributions from PC1. The line correlation loadings plot to the right
shows the loadings for one PC at a time and shows how the actual
spectral profile correlated to the positive and negative PC directions,
thus making it simpler to interpret.
This feature is identical for similar loading plots.
When PLSR regression is described in chapter 7, it will be shown
that the loadings model the structure of X only, however, in the case
of PLSR, the scores originate from the covariance of X and Y. The
loading-weights (W) represent the covariance (correlation if the
variables are auto-scaled to unit variance) between the individual X-
variables and Y for each PLSR factor. The loading weights have been
found to be the most relevant in interpreting the importance of the X-
variables for prediction modelling. But even so, also in this setting, a
correlation loading plot combined with the loading-weights and
regression coefficient plots give a valuable synoptic overview for each
component alone, or in combination. This is where correlation
loadings are a valuable diagnostics tool for many of the methods
discussed in this book.

Figure 4.15: Comparison of 2-dimensional and 1-dimensional correlation loadings plots for
spectroscopic data.
4.10.2 Comparison of scores and loading plots

The corresponding scores and loadings plots are complementary and


give valuable information about both the objects and the variables
when studied together. Figure 4.16 shows the comparison of the
scores and loadings plots for the peas sensory data discussed earlier
in this chapter. Here the objects in the score plot are numbered
instead of, as previously, shown by their annotation codes. The use of
optional name-character(s) as plotting symbols has many advantages
that will gradually become clear in many examples below.
Sample (= object) A1#3 has a position in the score plot that
corresponds to the direction of variable “Off-
flavour”/“Mealiness”/“Hardness” in the loading plot. This means that
this sample has a high value for these variables. Sample A5#5 is very
sweet, and so on. The knowledge gained about PC1 would now
suggest that early harvesting time seems to give off-flavoured peas
that are hard and mealy, while late harvesting (positive PC1 scores)
results in sweet peas with strong pea flavour. Now PC2 can be
interpreted as a colour axis. The variables relating to colour do not sit
closely on the PC2 axis (like the sweetness and off flavour variables),
however, they are primarily defined by this PC.
There is probably nothing particularly surprising to food science in
the above interpretations of this simple data analysis. However, the
complementary interpretation strategy of using the score plot to
come up with the how and the loading plot to understand why is a
very general, indeed a universal feature. To be specific:
How? (...are the objects distributed with respect to each other, as
shown by the decomposed PC score plots?). The Score plot shows
the object interrelationships as revealed in the variance maximised
set of PCs plotted.
Why is this so? (...which variables go together, as manifested by
their correlations, which are defining the PCs?). The Loading plot
shows the effective correlation interrelationships between
variables, and which are the most dominantly responsible for the
disposition of the objects in the complementary score plot.
The loading plot is used for interpreting “the reasons” behind
the object distribution as shown in the scores plot.
In many cases one uses similar 2-D score/loading plots illustrated
above, but not always; in practice data sets can be more complicated
than what can be modelled with the use of just two PCs, but this only
means that some of the higher-order components are also needed.
These issues will become clearer as hands-on experience grows by
doing exercises and eventually one’s own data analyses.

Figure 4.16: Complementary scores and loadings plots establishing the optimal basis for
data structure interpretation.

4.10.3 The 1-dimensional loading plot

In some cases, (as shown in Figure 4.15), inclusions of many


variables may restrict the usefulness of the 2-dimensional loading
plots. In spectroscopy, for example, where there can be up to several
thousand variables, the 2-dimensional loading plots are generally too
complex for simple interpretation and result in very overloaded plots
(especially if variable names are annotated). This is generally of very
little use, but these plots can occasionally be useful for detection of
selective variables, i.e. variables not correlating with many others. The
type of detailed interpretation as was done with the pea data set
above is often very difficult based on 2-D plots in such cases. It is
then necessary to use the 1-vector loading plots for one PC at a time.
These are often referred to as “loading spectra”, because they often
take on the appearance of “typical spectra”, but only in a
morphological sense. Nevertheless, experience from spectroscopy is
a distinct bonus when interpreting such loading spectra and their
relationships.
Figure 4.17 is a 1-dimensional PC2 loading plot from a PCA of a
set of mid-infrared spectra of various edible oil samples (described in
detail in chapter 10). The 1-D loading plots offer a great advantage,
as they are very useful for the assignment of spectral bands. As can
be seen from Figure 4.17, such a loading plot indeed shows where
the second greatest source of variance occurs in the data, which in
this case can almost exclusively be related to the single peak
observed at 960 cm–1. The loadings in Figure 4.17 belong to a PC that
describes trans-fatty acids in the oils. For comparison, the raw mid-
infrared spectra of all of the oil samples are presented in Figure 4.17.
Figure 4.17: Line loading plots in spectroscopic data can help to isolate single sources of
variability within complex data structures.

In the spectral region chosen, the general profile and features of


the data are quite similar. The raw data was grouped by oil type and
the spectra of “Corn Margarine” have a distinct spectral feature (after
preprocessing the data, refer to chapter 5). Note that this is a very
realistic example of the typical way one goes about interpreting the
“meaning” of a particular principal component in PCA of
spectroscopic data. Hopefully it is apparent why the 1-dimensional
plotting (a “line plot”) is a very useful feature for all data sets of the
“spectral type”, be this “real chemical spectra” or data that take on
the appearance of spectra only in an analogous sense. This latter is
important. The 1-dimensional line plot option is not restricted to
spectroscopic data but can, indeed should also be applied to many
other spectra-like data types.
A final note on the graphical display of loadings and loading plots.
2-D score-plots can be understood as 2-D variance-maximised maps
showing the projected inter-object relationships in variable space,
directed along the directions of the particular two PC components
chosen. By a symmetric analogy: 2-D loading plots may be viewed as
2-D maps, showing projected inter-variables relationships, directed
along the pertinent PC-component directions in the complementary
object space.
The object space can be thought of as the complementary co-
ordinate system made up of axes corresponding to the n objects in
matrix X. In this space, one can conveniently plot all variables and get
a similar graphical rendition of their similarities, which will show up in
their relative geometrical dispositions, especially reflecting their
correlations. The analogy between these two complementary
graphical display spaces is matched by a direct symmetry in the
method used to calculate the scores and loadings by the Non
Iterative Partial Alternating Least Squares (NIPALS) algorithm [23] (see
section 4.14).

4.11 Example: city temperatures in europe


this example applies PCA to City Temperature data, taken from a
table of monthly temperatures for 26 European cities. In addition,
there is a category variable named “Region” that organises the cities
in five groups (North, East, South, West, Central Europe). This
example illustrates how background knowledge, “domain-specific
knowledge”, allows informed exploratory multivariate data analysis
and how categorical information can be used directly for visualising
groups of samples and for validation insight.

4.11.1 Introduction

The criterion of PCA is to maximise the modelled variance for a set of


principal components, thereby modelling orthogonal latent data
structures. Although this criterion is strictly mathematical it will in
many applications reveal data structures that are recognisable to the
“owner” of the data in terms of both well-known as well as new,
unknown, underlying phenomena in the data system. The City
Temperature example is small in magnitude (26 × 12), but rich in
didactic power to illustrate many of the above issues and features.

4.11.2 Plotting data and deciding on the validation scheme


Figure 4.18 shows a line plot of the data grouped after region; the
names of the cities are deliberately not given.
The initial line-plot of the raw data show the obvious feature that
temperatures are higher during summer months compared to winter
months (in the northern hemisphere). As expected, there is a
tendency that the southern cities have higher temperatures than the
Nordic cities, especially in the winter. A few apparent anomalies can
also be seen on the right side of the plot, more about this later.
The first thing to decide on is centring and scaling. In this case the
variables will be mean centred but not scaled to unit variance
because they are all measured on the same scale. Observe, however,
that these data can just as well be scaled—the data analytical results
will be identical in the sense that the relative score pattern(s) will not
be changed. N.B. This example is not meant to set a procedural
standard, usually in the beginning the data analyst will do well to
always auto-scale—as no harm can ever be done.
The next question is how the model should be validated. The most
conservative option would be to set aside, say, 40% of the cities as a
separate test set. However, as there are only 26 objects, it may be
decided to have all the cities in the model, to interpret the underlying
structures in toto. Full cross-validation tests whether the pattern for
the monthly temperatures is stable when one city is taken out of the
model at a time. Another option would be to leave out one region at
the time, thereby testing the hypothesis that the relationship between
the variables is not changing between regions. This would yield a
more conservative validation estimate. The topic of validation is a
central theme in this book and a more detailed analysis of this topic is
given in chapter 8).

Figure 4.18: Line plot of the temperature data, annotated and grouped after region.

4.11.3 PCA results and interpretation

A PCA model was performed with full cross validation. The score plot
for the first two PCs is shown in Figure 4.19. The model explains 87%
in the first and 8% in the second PC. The validation variance (not
shown here) was almost identical; 87% and 7%, which indicates that
the model is stable with respect to successively keeping out one city
and project it onto the model based on the other 25. The ellipse
(known as Hotelling’s T2 ellipse) represents a 95% confidence interval
around the model mean and is one criterion for detecting outliers. As
can be seen, there are no clear outliers in the model space when PC1
and PC2 are jointly assessed, however, Moscow lies slightly outside
the ellipse. Based on the position of cities in the score plot, the first
component corresponds to a geographical north–south axis, whereas
the second component describes the complementary east–west axis.
This latter observation is interpreted more to reflect varying proximity
to the Atlantic Ocean, which is the probable causal effect concerning
the yearly temperature profile, and not the geographical east–west
axis per se.
The loadings show to what extent the monthly variables are
important in the various PCs, as a function of how the variables were
scaled before the modelling. It is common to present loadings as line
plots for data where it is most natural to look at the raw data as one
line per object. The first loading in Figure 4.20 confirms that
temperatures are indeed higher in southern Europe than in the
northern parts (positive loadings for all months). Also, the north–south
difference is slightly more pronounced in the winter than in the
summer as was (perhaps) also seen by the data analyst in the line
plot of the raw data.
Figure 4.19: Score plot for PC1 vs PC2 for the European City Temperature data.

Figure 4.20: Line plot of loadings for the first two PCs for the European City Temperature
data set. (Blue line PC1, red line PC2)

The pattern for PC2 can be interpreted as the variation of


temperature for the cities over the year; the PC2 loading looks
remarkably similar to the raw data profiles.
This is a powerful example of how PCA decomposes the hidden
data structure into two distinctly different phenomena (a geographical
north–south polarity and a yearly temperature profile). Of course,
these two phenomena play out their role in the yearly temperature
developments in the cities in Europe in a thoroughly interrelated
fashion—behold the power of data analysis!
From the score plot it can be seen that Dublin and Bucharest span
the extremes of this PC2 dimension. However, all cities along the line
between Dublin and Bucharest have the same average yearly
temperature because their score on PC1 is the same.
The data model itself does not “know” anything regarding the
geographical location and climatic particulars of these cities, but this
information is inherent in the data, and is very effectively brought to
the attention of the diligent data analyst/interpreter. “The data always
speak objectively about the latent structure(s)”. Here is a sneak
preview of coming attractions regarding advanced features and
interpretation of PCA results: what could be the reason behind, the
cause of the conspicuous small “dipping deviation” from a suspected
smooth year-profile shown by the PC2 loading spectrum for the
month of June? This feature is not immediately obvious either in
Figure 4.20 or Figure 4.21.
Although in this case the loadings could be easily interpreted from
the line plot of the raw data, this is not always the case. As the
starting point for the model is the mean centred data, the raw data
could alternatively also have been visualised after mean centring.
Figure 4.21 thus shows the data after mean centring, and it is now
evident that the original data can, in fact, be reconstructed by
multiplying scores with loadings: .
Figure 4.21: City Temperature data after mean centring, grouped and annotated according
region.

The Influence plot is important because it to some degree can be


used for detecting outliers (Figure 4.22). The abscissa shows the
Hotelling’s T2 statistic with the critical limit, and the ordinate
represents the residuals (in this case the F-residuals) which are a
measure of the samples’ fit to the model. This makes it possible to
distinguish between samples that are extreme or influential within the
model space and samples that have high residuals because their
pattern for the variables does not follow the other samples (in the
model). Moscow, Marbella and Dublin are the cities with the largest
distance from the model centre after inclusion of two PCs in the
model and consequently show the largest Hotelling’s T2 values. Since
the influence plot is shown corresponding to two PCs, there is a one-
to-one relationship between the Hotelling’s T2 critical limit and the
confidence ellipse in the 2-D score plot; Moscow lies slightly outside
the ellipse.
Belgrade has a high residual after two PCs, which is also shown
by plotting the scores for PC1 vs PC3 as shown in Figure 4.23.
To understand what is special about Belgrade, the line plot of the
loadings for PC3 is where to look (or the loadings for PC1, PC2 and
PC3 as a 3D scatter plot). As seen for November and December,
Figure 4.24, the loadings have high absolute values for these two
months. Interestingly, the differences between the temperatures for
November and December for Belgrade and what would have been
the expected values for a normal pattern are approximately 5° and 8°,
respectively. By multiplying the loadings with the score for Belgrade
for PC3, one gets (–0.45) × (–11) = 4.95 and (–0.72) × (–11) = 7.92
which actually gives the residual for Belgrade after two PCs in the X
residual matrix (not shown). In this case the reason for the
discrepancy was actually the data source itself from which the data
was transcribed.
This example illustrates that although PCA is “only mathematics”
it often reveals underlying structures that can be interpreted
physically, chemically, geographically… when in possession of the
relevant domain-specific knowledge. Therefore, the person that owns
the data and knows their origin should preferably also be the one that
analyses the data. But it is just as useful if the owner and the data
analyst both sit in front of the PC screen as the PCA results come up.

Figure 4.22: Influence plot after two PCs of the European City Temperature data set.
Figure 4.23: Score plot of PC3 vs PC1 of the European City Temperature data set.

Figure 4.24: Line plot of loadings for PC3 of the European City Temperature data set.

Also, patterns that do not match preconceived expectations will


be spotted; sometimes this is due to error in the data and these
samples can be taken out with a good conscience (in fact,
transcription errors are among the most often cause of many outliers).
On other occasions, unexpected patterns may lead to enhanced
understanding of the system under observation and open up the
opportunity for innovation, revealing scientific anomalies as it were...
4.12 Principal component models
A more formal description of PC modelling can now be presented
with a view to the more practical aspects of constructing adequate
PC models.

4.12.1 The PC model

X = TPT + E = Structure + Noise

PCA is concerned with the decomposition of the raw data matrix X


into a structure part and a noise part. The equation as written above
hints at how (by describing the separation with respect to E: the error
part, i.e. the noise).

4.12.2 Centring
The X-matrix used in the equations above is not cast precisely as the
raw data set. The original variables have first been centred using
equation 4.1:

Centring was previously presented and is more fully described in


chapter 5, as a translation of the variable space coordinate system
origin into the mean data object, as well in connection with a first
description of loadings. (It is sometimes claimed to be desirable to
analyse without centring, but this is in fact only in the rarest of cases;
this applies only to very special situations. Such occasions and
situations are not instructive when the objective is to get the first
understanding of all other aspects of PCA; non-centring shall only be
touched upon later, however, when in its right context.)
A longstanding, well-established procedure within chemometrics
will be followed (and traditional notation shall be used) which does
not differentiate between centred and un-centred data. For the novice
data analyst, X is always to be centred. When residuals are discussed
below this theme will be touched upon again, but generally: X is
henceforward used as the designation for the centred data matrix.
This model is defined in equation 4.2.

In PCA the assumption is that X can be split into a sum of a matrix


product, TPT, and a residual matrix E. T is simply the score matrix
described above, and PT is the accompanying loading matrix
(transposed), so there is really nothing new here. But the essentials of
PCA data modelling can now be presented using some simple
mathematics, representing the geometric understanding achieved so
far.
The goal is to determine T and P and use their outer product
instead of X, which will then, most conveniently, have been stripped
of its error component, E:
The PC model is the matrix product TPT
E is not a part of the model per se. E is the so-called residual
matrix. It is that part of X which cannot be accounted for using the
available number of PCs, A; in other words, E is not “explained” by
the model, nor should it be (for which an extra number of
components would have been needed). Thus, E is by definition the
part of X that is not to be modelled by (not included in) the product
TPT. E is therefore a good measure of “lack-of-fit”, which describes
how close the model is to the original data, and for this purpose, a
measure of the magnitude of E is needed. While the data analytical
use of PCA models is mainly concerned with the first data structure
part, TPT, the complementary “goodness-of-fit” measure residing in E
(a large E corresponds to a small model fit and vice versa) is also a
must.
Figure 4.25: The PC model as a sum of outer vector products.

4.12.3 Step by step calculation of PCs

Equation 4.2 is compact. It is useful to write the equation as individual


PC contributions, as individual outer vector products as in Equation
4.3

An outer product results in a matrix of rank 1. Here, ta is the score


vector for PCa and is n × 1-dimensional. pa is the corresponding
loading vector; since it is p × 1-dimensional, is thus 1 × p-
dimensional. Each outer product ta is therefore n × p-dimensional,
i.e. the same dimension as X and E, but all these ta matrices have
the exact mathematical rank 1. This is illustrated graphically in Figure
4.25.
Equation 4.3 makes it easier to understand the actual PC
calculations than the compact matrix equation. PCs are often
calculated one at a time by an iterative algorithm, as is firm tradition
in chemometrics. The following outlines how PCs are calculated in a
first introductory method overview:
1) Calculate t1 and p1 from X
2) Subtract the contribution of PC1 from X: E1 = X – t1p1T
3) (Note: at this stage X = t1p1T + E1)
4) Calculate t2 and p2 from E1
5) Subtract the contribution of PC2 from E1: E2 = E1 –
6) (Note: At this stage X = + + E2)
7) Calculate t3 and p3 from E2
and so forth until the Aopt components have been calculated. The
subtractions involved are referred to as updating the current X-matrix,
or in some cases “deflation”, meaning that the remaining total data
set variance has been reduced by the PC just modelled. It is one of
the most general features of the usefulness of PCA that usually Aopt
<< p. Aopt is the dimension of the final structure sub-space used.
Notice that the letter E is used from the moment PC contributions
are subtracted. When the residual matrix is referred to, it can be seen
why E is used in the step-by-step model calculations.

4.12.4 A preliminary comment on the algorithm: NIPALS

The basis for the step by step calculation of PCs is the Non-Iterative
Partial Alternating Least Squares (NIPALS) algorithm [23], see section
4.14. The fact that T and P are both orthogonal results in a very
efficient algorithm for PC calculation. Suffice here to highlight that it is
the specific delimitation between the data structure part, TPT, and the
noise part, E, which is carried out by establishing the numerical value
of Aopt. When Aopt has been determined (correctly), the PCA model
can be considered complete. Of course, the key question remains:
what does “determined correctly” mean?
A signifies the number of PCs to be calculated; A << p. Recall that
the maximum number of possible PCs is min(n – 1, p). If the number
of PCs calculated is deliberately chosen to be A ≡ min(n – 1, p), a so-
called “full model” has been calculated. This means that
decomposition of X amounts to replacing the original co-ordinate
system (the variable space) with a new co-ordinate system (the PC
space) with the exact same full dimensionality, p. In this case the new
origin corresponds to the mean data object, due to centring, but the
number of PCs (and therefore the dimensionality) has not been
reduced. Only by determining Aopt << p, can the PCA advantage, the
splitting of X into a structure part and a noise part, be achieved. If the
full set of PCs is used, all the noise contributions remain in the
decomposition and therefore E will be 0.
The data analyst, or the software PCA program, must arrive at Aopt
in order that the model, TPT, only contains the “relevant structure” so
that the “noise”, as far as possible, is collected in E. This is a central
theme in most of multivariate data analyses. This objective is not
trivial, however, and there are some pitfalls.
Most PCA software will provide information on which to base a
decision, or will trigger-happily offer a proposal for the “correct model
dimensionality”, Aopt, but in the end it is up to the analyst to decide
how many PCs to use. There are many situations where the human
cognitive facilities excel, which simply cannot be programmed
regardless of how far the field of Artificial Intelligence has been
developed. The software suggestions for optimal A are always
algorithmically derived, and in this endeavour everything hinges on
which optimality criteria is used by the particular “A-algorithm”.
Rather than skipping this important point by decreeing that all new
data analysts can safely rely on such an algorithmic approach, this
book emphatically demands that the readers take it upon themselves
to fully understand the underlying principles behind PCA, so as to be
able to take proper responsibility for this key issue and therefore also
be able to assess the validity of the algorithmic offerings, which are
also routinely spewed out by all competent software packages (“as a
service to the user”).
There are thus no two ways about it. The responsible question
(always) is: “how is the optimum number of PC model components,
Aopt, arrived at”? The answer is in principle simple: keep track of the
magnitude of the E-matrix as the PCA model evolves one PC after
another—and understand full well how to use all three types of PCA
visualisations (scores, loadings, residual variances). Experience is
everything. Do as many exercises as possible—and then some!

4.12.5 Residuals—the E-matrix

The residual term E can be used to monitor an accumulating PC


model fit; large E = insufficient fit, small E = good model. The
magnitude of E is directly connected to A, the number of PCs, as well
as to the data structure present in X, but generally: a small A ⇒ more
noise in E; a larger A ⇒ less noise in E.
However, these terms “small”, “large” and “good” are imprecise.
What is a large E? E is a matrix consisting of n × p matrix elements—
what if in the matrix E some elements are large and some are small?
It is obvious that a definition and quantification of these relationships
need to be precisely defined.
Evaluation of E is always relative to the total variance of a given
data set, X, which is calculated with respect to the centred origin.
This is the new origin in the centre of the data point swarm from
which all PCs are developed (calculated). For systematic reasons, it is
useful to think of this point as a “zeroth Principal Component”. If the
development of a model via step-by-step approximations to the data
can be imagined, each step being more refined than the previous, the
very first approximation is to represent X through the average object,
the zeroth Principal Component. In general, this will undoubtedly be a
very bad PC model for almost all real-world data sets, X, but the point
is that this is never intended to be a full model—it is only a
beginning...
This contribution is subtracted from X. Subtraction of the zeroth
Principal Component is identical to mean centring of the raw data
matrix X. Thus for A = 0, the residual matrix termed E0, is the same as
the centred X. E0 plays a fundamental role as the reference when
quantifying the (relative) size of E as the modelling evolves, and thus
also when beginning to get a grip on “how small” is a small E?

4.12.6 Residual variance

The residuals will change in magnitude as more PCs are calculated


and subtracted from X; the residuals must become smaller as more
PCs are added to model the data structure. This is reflected in the
notation for E. E is ascribed an index that designates how many PCs
have been calculated so far and included in the current model. These
higher-order E matrices are compared with E0, the reference.
At the start of all PCA, E0 is identical to X in Equation 4.3 and if
there is no TPT term calculated yet, the PC model is also E0:

It is advantageous to compare the variance of the developing


residual matrices E in terms of fractions of the total data set variance
E0. So for A = 0, E = 100% (of E0). The residual variance is 100% and
the modelled or explained variance is 0%. Note that the variances are
additive. The size of E is expressed in terms of squared deviations or
variance. E is an error term, so it is natural to evaluate it in this proper
statistical, squared, fashion as a variance.
Summation of the squares of the individual E-matrix elements, i.e.
the individual error terms will be dealt with below. There are two ways
that these summations can be performed: either along the rows,
which provides object residuals, or down the columns, which results
in variable residuals.

4.12.7 Object residuals

The squared residual of an object i, , is given by Equation 4.5:

and the residual variance is ResVari = /p. Taking the square root of
this sum delivers a measure of the geometric distance between
object i and the model “hyper plane”, i.e. the “flat” or space spanned
by the current A PCs as expressed in the original variable space.
Thus, the object residual is nothing but the geometric projection
distance between the object and the model representation. The
smaller this distance is, the closer the PC representation (the “foot
point”) of the object is to the original object. In other words, the rows
in E are directly related to how well the model fits the original data—
by using the current number of A model components.

4.12.8 The total squared object residual

The squared residual of one object was defined in Equation 4.5.


When developing a PCA model, it is desired to fit all the objects as
well as possible, simultaneously. Therefore, the need to define a total
sum of squared residual distances that accounts for all objects is
required. The total squared object residual is defined as the sum of
all the individual squared object residuals, as described by Equation
4.6.

Figure 4.26: Relative residual variances for the peas sensory data.

and the total residual variance is ResVarTot = e2tot/(p × n). In general,


this is referred to as the “total residual variance” without specifying
objects.
The residual variance can be plotted for each object, as shown in
Figure 4.26 for the peas sensory data presented earlier in this
chapter.
This plot is used mainly to assess the relative size of the object
residuals. For example, in Figure 4.26 it can be seen that object A5#1
has a visually larger residual variance than the other objects. The
model does not fit or “explain” this object as well as the others. The
plot could be an indication that object A5#1 may be an outlier, i.e. it is
not like the rest of the objects. The concept of outliers will be returned
to several times in this book, as they are very important. An object
like A5#1 may be the result of erroneous measurements or data
transfer errors, or a sampling error or otherwise…, in which case it
should perhaps be removed from the data set, so it is not modelled
together with, and influences the model for all relevant data. Or it may
in fact be a legitimate and significant object representing important
phenomena that the other objects do not include to the same extent.
In the multivariate data analysis realm everything is always “problem
dependent”.

4.12.9 Explained/residual variance plots

A PCA model is thus calculated in a step-by-step manner,


sequentially including PCs one by one; the task is completed once a
reliable “stopping rule” is established, Aopt << P. PCA always starts
with a total residual variance given by the centred X-matrix, E0. Then
t1 and p1 are used to calculate their outer product, which is then
subtracted from E0, resulting in the matrix E1. Modelling E1 by adding
a new PC will give rise to a new total residual variance, , which is
compared with the previous total residual variance, . This new
must, by definition, be less than —as it is approximating the X-
data using a criterion that minimises the distances between model
and objects, and the residual variances are measures of these
distances. For each new PC, an updated total residual variance is
obtained which is smaller than the previous one. This can be plotted
as the total residual variance as a function of the number of PCs
(Figure 4.27). This is the third type of key PCA plot (scores, loadings,
residual variance); these three standard plots must always be
evaluated together.
For the total residual variance plot, the graph function must be a
decreasing function of the number of current components, A. It must
decrease towards exactly 0 when A reaches its maximum, i.e. is
equal to min (n – 1, p).
Explained variance is the converse of residual variance and in this
setting, the maximum value of explained variance can take is 100%.
Thus, for a PCA model, the more PCs that are added, the closer the
explained variance plot will converge to 100%. It is a matter of choice
for the user which plot to use, residual variance plots provide a scale
in terms of overall variance, where explained variance plots describe
the information contained in the model as a percentage of what can
be explained overall.

Figure 4.27: Total X-residual calibration variance.

4.12.10 How many PCs to use?

The details in the residual variance plot, Figure 4.27, help find the Aopt
for a given data set. There are several ways to go about this.
There exists an empirical rule which says that a large break in this
function—going from a steep slope to a much slower decrease—
often represents the “effective” dimensionality; the optimum A. In
Figure 4.27 this is at PC number 1 (with 3 representing the point
where there is no more variation). There is a logical argument behind
this rule of thumb. Recall that the PCs represent directions of
maximum variances, i.e. along the elongated directions of the data
swarm, in decreasing order. When projected down onto these
directions, the distance measure from the objects to the PC model
will decrease. Also, remember the duality of maximum elongation
variances and minimisation of the transverse residual distances for
each component. As long as there are (still) new directions in the data
swarm with relatively “large” variances, the total residual distance
measure will decrease significantly when adding still one more PC.
This translates into a relatively large decrease in the total residual
variance from PCa to PCa + 1, corresponding to a relatively steep slope
in the plot, from component a to the next.
One may use the mental metaphor that the next PC would still
have something “to bite into”, or that there still is some definite
direction in the remaining projections of the data swarm for the next
component to model. This goes on until the remaining data swarm
does not show preferred orientations (elongations) any longer. At this
point there will no longer be any marked gain with respect to adding
any further PCs (“gain” is here to be understood as modelling gain,
represented by a significant total variance reduction). Consequently,
the total residual variance curve will start to flatten out and the gain
per additional PC will be significantly less than before, thus producing
a slope break. Once this “noise region” has been reached, all of the
subsequent PCs will represent slowly decreasing E matrices as the
added components are in fact only modelling random variation.
To conclude: the optimal number of PCs to include often is the
number of PCs that gives the clearest break point in the total residual
variance plot. But this is only a first rule-of-thumb. Alas, there are
plenty of exceptions from this rule most of which are because of one
form or other of “special” data structure. Note that non-standard
variance plots will be the rule, for example, as long as influential
outliers have not yet been policed out of the training data set. Marked
“clumpiness” is also a special data set structure that will give rise to
an apparent number of components that very likely will not stand up
to diligent final modelling. This issue is taken up again in the central
chapter on validation (chapter 8).

4.12.11 A note on the number of PCs

Interpretation of the PCs is always necessary in PCA. Often this may


be nothing more than carefully comparing the patterns in the score
plots with knowledge about the specific problem. The main data
analytical objective is to arrive at the “correct” number of
components, before one can perform any interpretations of the
meaning of the Aopt components retained. If too few components are
retained, resulting in what aptly is termed an under-fitted model, the
interpretation runs the risk of relating only to the most dominating
parts of the data structure, with an absolute certainty of leaving
something significant out. Think of the gold mineralisation PCA
example presented earlier in this chapter: if only the irrelevant first
two PCs were retained—the price to pay would literally be the gold-
related PC3!
Using too many components on the other hand, clearly leading to
an over-fitted model, is equally bad because the risk of the frustrating
task of trying to ascribe meaning to what in reality is the noise
structure only.
Thus, it is a basic PCA assumption that directions of “small”
variance do not correspond to significant data structure(s), but rather
to noise. In PCA it is tacitly assumed that large variances correspond
to systematic phenomena hidden in the data and that these dominate
over lesser variance contributions which rapidly will be thought of as
“random noise” (e.g. sampling errors, measurement errors). When a
specific data set is analysed the intent is to look for definite
phenomena. The philosophical stand behind PCA and related
multivariate methods is that the “large” PCs are likely to be correlated
with the information sought, while the “smaller” ones usually turn out
to be noise and thus irrelevant to the data structure modelling. These
lesser PCs should therefore not be included in the modelling; they
should remain in E. Therefore, by “filtering” the noise out of a data
set, an analyst can concentrate on the structured, in principle,
noiseless part. The total residual variance plot is generally used to
assess where the modelled structure stops and the noise starts.
However, this is by no means a simple and straight forward
procedure—and it should under no circumstances be automated.
The structure in multivariate data sets can be surprising when
revealed graphically, and the crucial task of finding the “optimal”
number of PCs for a given X is absolutely not something that can be
left to an algorithm, no matter how ingenious. A very important lesson
here is that these evaluations are always carried out in the problem-
specific context of the analyst’s full knowledge about the problem, or
situation from which the data matrix X originates. It is bad form
indeed to analyse data without this critical regard for the problem
context—indeed no interpretation is possible without it.

4.12.12 A doubtful case—using external evidence

Consider, for the sake of argument, the case in which it is known,


from irrefutable external evidence, that there is some definite level of
residual variance below which the modelling should not proceed. Say
that a number of chemical variables are being analysed, which
happen to be comparable and all more-or-less have the same definite
measurement uncertainty, e.g. 10–12%. Accept for this illustration
then that there is no point in measuring with a relative uncertainty
better than 12%. When trying to model the data matrix X, it would
then not make sense to include more PCs than the number best
suited to give a residual variance above this level (but, of course, as
close as possible). If the next component for example takes the
residual variance from 15% to 6%, it should not be included in the
analysis.
The point here is that indisputable external evidence must override
all internal data modelling results. However, such evidence must be
totally reliable, must be proven beyond doubt. There are cases where
the external factors have not held true upon closer inspection; the
modelling results using more components were later found to be
correct. If ever in a situation like this, one should neither reject the
modelling results immediately, nor ignore the external evidence. It will
be prudent to reflect again on the results and the evidence before
deciding. Again, there is no substitute for building up a large personal
data analytical experience.

4.12.13 Variable residuals

Above the squared object residuals were studied, which were


calculated by summing squared E-elements along the rows in E. If
instead the summation was performed along the columns, the
corresponding variable residual variances are obtained. Similarly, like
in the case for objects, a squared residual for each variable can be
defined by Equation 4.7:

And a total squared variable residual can also be defined by Equation


4.8:

Here only the former will be discussed. The residual variance per
variable can be used to identify non-typical variables, “outlying”
variables, in a somewhat similar fashion as for the objects. These
cannot, however, be interpreted in an exact analogous fashion in
terms of distances without introducing a complementary object
space in which variables can be plotted. This concept lies outside the
scope of this introductory book, however.

4.12.14 More about variances—modelling error variance

The E-matrix represents a measure of the modelling error. This is the


error arising because a complex data structure is being modelled with
but a limited, small number of PC components A << p; this could be
termed the modelling error variance, or the modelling error for short.
In PCA this error denotes the deviation between model and real data
(X).
This is a different viewpoint opposed to the modelling error
associated with the methods PCR and PLSR (to be discussed in
chapter 7), where interest is mostly concerned in a measure of the
prediction error (related to Y).
The term “explained variance”, synonymous with modelling error,
was introduced previously in this chapter. Recall that the residual
variance is compared to the total residual variance for A = 0. At this
point the total residual variance is 100% and the explained variance
is 0%. When A = min(n – 1, p) the residual variance is 0% because E
is 0, and the explained variance is 100%. The explained variance is
the variance accounted for by the model, the TPT product (always
relative to the starting level, the mean centred X, E0). An easy relation
to remember is as follows:

%Explained variance + %Residual variance = 100%

This is shown graphically in Figure 4.28.

4.13 Example: interpreting a PCA model (peas)


A formal evaluation of the peas data set used earlier in this chapter
will now be provided with full analysis steps and interpretations.
Figure 4.28: Explained and unexplained residual variance always total 100%.

4.13.1 Purpose

The purpose of this analysis is to find which of a number of sensory


variables are best suited to describe the quality attributes of green
peas as determined by a trained sensory panel.

4.13.2 Data set

The data set, also used previously, consists of a compilation of a


number of replicate measures of 12 sensory variables that have been
determined that best characterise pea taste and appearance when
positioning such products in the market for sale. After reformatting
there are 60 pea samples (objects). The names of the samples again
reflect harvest time (1–5) and pea type (A–E). The variables were not
presented in great detail earlier: Each X-variable represents sensory
ratings of all the pea samples, on a scale from 0 to 9, as carried out
by a panel of trained sensory judges. In this example, the complete
PCA is presented.
4.13.3 Tasks

Study score-, loading- and residual variance-plots of the data and


interpret the PCA model.

4.13.4 How to do it
Depending on the qualifications of the sensory panel, it is common
practice to auto-scale the data (refer to chapter 5). The present pea
model was auto-scaled data and the method of cross validation was
used to validate the model (see chapter 8 re. the necessary
qualifications).
The PCA overview for this model is presented in Figure 4.29. It
can be seen from the Explained Variance plot that the first two PCs
explain around 79% of the total variance in the data. This is regarded
as good for sensory analysis, due to the relatively high noise level
(relatively high sensory judgment reproducibility variance) in this kind
of “measurement”.
The Explained Variance plot suggests that there are three or four
components that could be interpreted (note the point where the blue
curve flattens out in the plot). The red curve is the validation curve
and is an indication of model quality. More on this in chapter 8.
Now study the score plot. For convenience, it has been grouped
by Harvest Time (as previously shown in Figure 4.11). Previously is
was stated that the PC1 direction was influenced by time (growing,
maturing), which is directly related to the variables “Sweet”, “Fruity”
and “Pea Flavour”. These are diametrically opposed to the variables
“Mealiness”, “Hardness” and “Off Flavour”.
To understand the nature of the second source of variation, along
PC2, the variables “Colour 1”, “Colour 2”, “Colour 3” and
“Whiteness” occupy the PC2 space. “Whiteness” and “Colour 1”
occupy a similar space and are therefore correlated. This would
indicate that “Colour 1” and “Whiteness” are a measure of the same
thing. At approximately 180° to these variables, the variables “Colour
2” and “Colour 3” lie (imaging a line connecting the two sets of
diametrically opposed variables passing through the origin). This
indicates that “Colour 2” and “Colour 3” are correlated and since
green peas are being assessed, it can be concluded that these two
variables are a measure of the greenness of the peas. This indicates
that PC2 is a “colour direction”. It does not line up exactly with PC2
and therefore, colour and flavour (PC1) are also correlated. This is
why the scores look like they distribute along a gradient along PC1
and PC2.
The scores and loadings pairs for PCs 3 and 4 are provided in
Figure 4.30, this time displaying the loadings as correlation loadings.
From Figure 4.30, it can be seen that PC3 describes a further 10%
of the total data variability and is mainly only related to the variable
“Texture”. The scores plot indicates no trending of texture to Harvest
Time or Cultivar. PC3 can therefore be assigned as a measure of pea
texture and nothing else. PC4 describes an additional 5% of the data
variability and is in turn weakly related to the variable “Skin”. The
correlation loadings plot would indicate that this variable is not highly
correlated to PC4, however, and a conclusion could be not to include
PC4 as it is not really informative, it is only indicative.

Figure 4.29: PCA overview of the assessment of green peas using sensory data.
4.13.5 Summary

There is no clear break in the Explained Variance plot, but two PCs
describe 79% of the total variance while the third explains an
additional 10%. The two first PCs are simple to interpret and are
probably sufficient to determine the most important variables for
description of pea quality. The clue here is to carefully follow the
quantitative fractions of the total variance associated with each PC.
The score plot showed that the PC1 direction describes the
harvesting time since the samples are distributed from left to right
according to their Harvesting Time numbering. There is no similar
obvious clear pattern in PC2 from the scores plot.
The loading plot shows that “Pea Flavour”, “Fruitiness” and
“Sweetness” co-vary. “Hardness”, “Mealiness” and “Off Flavour” are
also positively correlated to each other, while they are negatively
correlated to “Pea Flavour”, “Sweetness” and “Fruitiness”, since the
two groups are on opposite sides of the origin along PC1 (i.e. they are
diametrically opposed). This means that PC1 mostly describes how
the peas taste and feel in the mouth, which is perhaps not so
surprising: the first direction is defining what trained sensory judges
base their assessment on. The corresponding score plot indicates
that taste is indeed related to harvest time—the riper the peas, the
sweeter they taste.

Figure 4.30: PC scores and loadings for PCs 3 and 4 for the assessment of green pea
quality.

Along PC2, “Colour 1” and “Whiteness” were correlated to each


other and are negatively correlated to “Colour 2” and “Colour 3”
projected near the top of the plot. This means that pea samples
projected to the bottom of the score plot are whiter, while those
projected to the top are more colourful (greener).

4.14 PCA modelling—the NIPALS algorithm


In 1966 Herman Wold invented the NIPALS algorithm [23]. This is
claimed to have taken place on the back of an envelope—literally,
which goes a long way to explaining why it does not require any
advanced mathematical training in order to be able to grasp the
essentials. For the present exposition this acronym is used to signify
Non-linear Iterative Projections by Alternating Least-Squares,
although the original meaning was slightly different.
This algorithm has since the 1970s been the standard workhorse
for the chemometric computing behind bilinear modelling (first and
foremost PCA and PLSR), primarily through the pioneering work by
one of the co-founding fathers of chemometrics, Svante Wold
(Herman’s son). The history of the NIPALS algorithm in the
chemometric context has been told by Geladi [24], Geladi and
Esbensen [2], Esbensen and Geladi [3]. The latter two references
actually deal with “The early history of chemometrics”, a topic of
interest to some new chemometricians, hopefully.
In the present introductory treatise on multivariate data analysis,
the main features of the NIPALS algorithmic approach will be
presented for two reasons. 1) Deeper understanding of the bilinear
PCA method following the introductory geometric projection
approach. 2) Ease of understanding of the subsequent PLSR
methods and algorithms, and as a basis for understanding other
methods at a more advanced level. Thus, an in depth presentation of
any specific issues will not be treated here; suffice to appreciate the
general projection/regression characteristics of NIPALS.
The NIPALS algorithm proceeds according to the following steps.
1) Centre and scale the X-matrix appropriately
Index initialisation, f: f = 1; Xf = X
2) For tf choose any column in X (initial proxy t-vector)
3)
4) tf = Xpf
5) Check convergence: if < criterion (10–6), stop; else go to
step 2.
6)
7) f = f + 1
Repeat 1 to 6 until f = A (optimum number of components), or
min(n – 1, p)

Explanation of the NIPALS algorithm:


1) The process starts with the centred (usually also scaled) X-matrix.
2) It is necessary to start the algorithm with a proxy t-vector. Any
column vector of X will do, but it is advantageous to choose the
largest column, max |Xi|.
3) Calculation of loading vector, pf, for iteration no. f.
4) Calculation of score vector, tf, for iteration no. f.
Step 3 represents projection of the object vectors onto the fth PC
in the variable space. By analogy one may view step 2 as the
symmetric operation projecting the variable vectors onto the
corresponding fth component (in the so-called object space). Note
how these projections also correspond to the regression formalism
for calculating regression coefficients, for which reason steps 2 and 3
also have been described as the “criss-cross regressions” heart-of-
the-matter of the NIPALS algorithm. “Criss-cross projections” may be
an equally good understanding.
1) Convergence? NIPALS usually converges to a stable t-vector
solution in less than, say, 20–40 iterations (empirical experience).
The stopping criterion may be 10–6, or less, as desired. A difference
smaller than this for the last two t-iterations signifies that the
NIPALS algorithm has reached a practical stable solution, that is to
say that the proxy PC in variable space has stabilised to the
maximum variance direction sought.
2) Updating:
The updating step is often called deflation. Regardless it consists
of subtraction of component no. f.
The PC model TPT is calculated for one component dimension at
the time. After convergence, this rank one model, is subtracted
from X. A very important consequence of the way NIPALS goes about
its business is that both the set of score-vectors and loading-vectors
are mutually orthogonal for all f. This is directly responsible for the
superior interpretation features of PCA.
The primary characteristics of the NIPALS algorithm are that the
PCs are calculated one-at-a-time. NIPALS goes about this iterative
PC calculation by working directly on the appropriately centred and
scaled X-matrix. This is an approach for bilinear analysis that sets
itself apart from several other numerical calculation methods with
which to compare, such as the Singular Value Decomposition (SVD)
method and the so-called direct XTX diagonalisation methods, the
description of which falls outside the scope of this book. Appropriate
references for these endeavours can be found in Martens and Næs
[14].

4.15 Chapter summary


This chapter introduced the fundamental concepts of the method of
Principal Component Analysis (PCA). PCA is an Exploratory Data
Analysis (EDA) method used to reveal the hidden (or latent) structure
in complex data sets. Below it will be shown how PCA also is an
extremely versatile data modelling facility in connection with other
data analysis objectives, e.g. classification, discrimination…
PCA finds this structure by decomposing the data into two sets of
vectors, one describing the object relationships (scores) and one
describing the variable correlations (loadings). There is a mutual
relationship, complementarity, between scores and loadings as one
cannot be defined nor understood (interpreted) without the other. This
is a reflection of the fact that objects cannot be measured without
quantitative measure (data) relating to the variables (and vice versa:
variables are always related to the objects they characterise).
Together, the scores and loadings define a Principal Component, i.e.
a direction in space that describes the maximum amount of variability
in that direction.
There are as many principal components (PCs) to be calculated as
there are sources of real information in the data. If too many PCs are
included in a model, only noise may be captured in higher PCs,
therefore the data analyst applies experience-based stopping rules to
determine when the appropriate number of optimal components (Aopt)
has been reached. In general, the PCA model has the form:

Data = Structure + Noise

In mathematical terms, this translates to the PCA model,

In practice, what this means is that the original data X can be


represented (to a good approximation) by multiplying each score and
loading vector for each PC and summing them (summing up to Aopt).
The fidelity of the model determines how well the original data is re-
constructed using the Aopt components retained. The ability to
decompose data into rank 1 scores and loadings has great
interpretability features when trying to describe the nature of the data
being analysed. The residual and explained variance plots are a
measure of how many PCs to interpret.
PCA is highly graphical in nature; the adage that a picture is worth
a thousand words is extremely apt. PCA can, for example, be used
for unsupervised cluster analysis (see chapter 10) where groups of
objects may be visualised based on the particular variables
measured, or it may be used to determine if one batch of products is
equivalent to another, when multiple process variables and
measurements have been made.
PCA can equally as well be applied to spectroscopic and non-
spectroscopic data. It has been found that the use of appropriate
plotting techniques can reveal chemical information pertaining to
specific sample groups and many other features of the data. PCA is
used to determine whether a data set has enough systematic
variability in order to justify a bilinear model and make use of this in
future applications, for example regarding Multivariate Statistical
Process Control (MSPC) (see chapter 13).
PCA is a technique that all data scientists and engineers should
have in their toolkit. Since it is so highly visual it can act as a map to
point an analyst in the right direction based on the full multivariate
data advantage, rather than taking the limited two-variable “scatter
gun” approach. It is used to make systematic decisions based on
interpretable plots and an assurance that the model can be validated.
Chapter 5 introduces the common methods of preprocessing that
is a fundamental first step in data analysis. By removing the rough
edges of raw data, preprocessing makes analyses such as PCA and
later in regression analysis (chapter 7) and multivariate classification
(chapter 10) much more interpretable by removing unwanted variance
components that would otherwise use up one or more components in
PC space. Chapter 6 then goes into the practice of applying PCA to
data in a pragmatic way, so that experience in the method can be
gained in a fast and intuitive manner.

4.16 References
[1] Fisher, R.A. (1936). “The use of multiple measurements in
taxonomic problems”, Ann. Eugenics 7, 179–188.
https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
[2] Geladi P. and Esbensen, K. (1990). “The start and early history
of chemometrics: Selected interviews. Part 1”, J. Chemometr. 4,
337–354. https://doi.org/10.1002/cem.1180040503
[3] Esbensen, K. and Geladi, P. (1990). “The start and early history
of chemometrics: Selected interviews. Part 2”, J. Chemometr. 4,
389–412. https://doi.org/10.1002/cem.1180040604
[4] Wold, S., Albano, C., Dunn III, W.J., Edlund, O., Esbensen, K.,
Geladi, P., Hellberg, S., Johansen, E., Lindberg W. and
Schöström, M. (1984). “Multivariate data analysis in chemistry”,
in Chemometrics, Mathematics and Statistics in Chemistry, Ed
by Kowalski, B.R. D. Reidel Publ., pp. 17–195.
https://doi.org/10.1007/978-94-017-1026-8_2
[5] Kowalski, B.R. (Ed) (1984). Chemometrics, Mathematics and
Statistics in Chemistry. D. Reidel Publ. ISBN 978-94-017-1026-
8
[6] Sharaf, M.A., Illman, D.L. and Kowalski, B.R. (1986).
“Chemometrics”, Volume 82 in Chemical Analysis. John Wiley
and Sons.
[7] Esbensen, K.H. and Geladi, P. (2009). “Principal Component
Analysis (PCA): Concept, geometrical interpretation,
mathematical background, algorithms, history”, in
Comprehensive Chemometrics, Ed by Brown, S., Tauler, R. and
Walczak, R. Wiley Major Reference Works, Vol. 4, pp. 211–226.
https://doi.org/10.1016/B978-044452701-1.00043-0
[8] Ehrenburg, A.S.C. (1978). Data Reduction—Analysing and
Interpreting Statistical Data. Wiley. ISBN 0-471-23398-6.
[9] Brereton, R. (1990). Chemometrics, Applications of Mathematics
and Statistics to Laboratory Systems. Ellis Horwood. ISBN 0-13-
131350-9.
[10] Jackson, J.E. (1991). A User’s Guide to Principal Components.
Wiley. https://doi.org/10.1002/0471725331
[11] Johnson, R.A. and Wichern D.W. (1988). Applied Multivariate
Statistical Analysis. Prentice-Hall.
https://doi.org/10.2307/2531616
[12] Krzanowski, W.J. (1988). Principles of Multivariate Analysis—A
User’s Perspective. Oxford Science Publications, Oxford
Statistical Science series No. 3. ISBN 0-19-852230-4.
[13] Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate
Analysis. Academic Press. ISBN 0-12-471252-5
[14] Martens, H. and Næs, T. (1989). Multivariate Calibration. Wiley.
ISBN 0-471-90979-3
[15] Massart, P.L., Vandegiste, B.G.M., Deming, S.N., Michotte, Y.
and Kaufman, L. (1988). Chemometrics: A Text Book. Elsevier.
ISBN 0-444-42660
[16] Höskuldsson, A. (1996). Prediction Methods in Science and
Technology, Vol. 1. Basic Theory. Thor Publishing, Denmark.
ISBN 87-985941-0-9
[17] Jolliffe, I.T. (1986). Principal Component Analysis. Springer
Verlag, New York. https://doi.org/10.1007/978-1-4757-1904-8
[18] Hotelling, H. (1933). “Analysis of a complex of statistical
variables into principal components”, J. Educ. Psychol. 24, 417–
441. https://doi.org/10.1037/h0071325
[19] Pearson, K. (1901). “On lines and planes of closest fit to
systems of points in space”, Phil. Mag. Ser. B. 2, 559–572.
https://doi.org/10.1080/14786440109462720
[20] Harmon, H.H. (1976). Modern Factor Analysis, 3rd Edn.
University of Chicago Press.
[21] Lorho, G., Westad, F. and Bro, R. (2006). “Generalized
correlation loadings. Extending correlation loadings to
congruence and to multi-way models”, Chemometr. Intell. Lab.
Syst. 84, 119–125.
https://doi.org/10.1016/j.chemolab.2006.04.023
[22] Swarbrick, B. and Westad, F. (2016). “An overview of
chemometrics for the engineering and measurement sciences”,
in Handbook of Measurement in Science and Engineering, Ed.
by Kutz, John Wiley & Sons, Hoboken, NJ, pp. 2331.
https://doi.org/10.1002/9781119244752.ch65
[23] Wold, H. (1966). “Estimation of principal components and
related models by iterative least squares”, in Multivariate
Analysis, Ed by Krishnaiah, P.R. Academic Press, NY.
[24] Geladi, P. (1988). “Notes on the history and the nature of partial
least squares (PLS) modelling”, J. Chemometr. 2, 231–246.
https://doi.org/10.1002/cem.1180020403
* Elsewhere in this book (see chapter 5 on scaling and pre-preprocessing), it is discussed
whether, and why, it is advisable to apply auto-scaling by default. It will be appreciated that
this issue more revolves around whether the empirical variance is approximately constant
over the entire wavelength range, than with a specific “type” of data.
Chapter 5: Preprocessing

5.1 Introduction
It is a fair statement that in more cases than not, data requires some
form of preprocessing. In general, preprocessing is but a minor
modification of a data set, for example, to minimise the impact from
“extraneous noise” such that structural information is more readily
extracted. An excellent analogy is viewing the moon on a slightly
cloudy night. It is obvious that the moon is in the sky but the thin
layer of cloud does indeed distort from a desired clear view. If
preprocessing could be likened to a strong wind, this could blow the
layer of cloud away, making the moon much more visible and clear.
There are three major data types when preprocessing methods
are considered. These are:
1) Discrete variables. These originate from systems that generate
process data such as temperature, pressure, pH readings etc. or,
for instance, sensory data where individual products are assessed
for characteristics such as taste, colour etc.
2) Spectroscopic variables. These are typically the most common
variables encountered in chemometrics. These are typically
absorbance values generated at each wavelength (wavenumber) for
methods such as infrared, near infrared, Raman, UV-vis. There are
other types of spectroscopic-like variables, e.g. acoustic frequency
variables.
3) Time Series variables, These are typically of the form of response
variables measured over time at regular intervals and various
methods exist to extract the true underlying structures of noisy
data.
A new generation of models that includes aggregation (“fusion”) of
discrete variables with spectral data are also available that require
methods such as block weighting of the data. For the purposes of
this chapter, only discrete and spectroscopic variable preprocessing
will be discussed. The interested reader is referred to the text by
Montgomery, Jennings and Kulahci [1] for a more detailed discussion
of how to preprocess time series data.
Chromatographic data can fall into a grey area between
discrete/spectroscopic and time series due to the nature of the data
and its generation. This chapter briefly touches on the preprocessing
of chromatographic data as well.
As mentioned, preprocessing is a method of minimising
extraneous noise in a data set. This is important for the multivariate
methods discussed in chapters 4, 7 and 10, since these methods aim
to extract the greatest sources of variability in the data set. If noise
effects are large in the data that can easily be minimised by some
form of simple mathematical transformation, then the effect is
explicitly removed before the model is developed, rather than making
the model more complex by implicitly removing the effect using a
model factor.

5.2 Preprocessing of discrete data

5.2.1 Variable weighting and scaling


Discrete data are typically the easiest types of data to preprocess as
in most cases, they only require some form of scaling to each other.
This type of preprocessing is typically referred to as variable wise
processing and there are three commonly used approaches,
1) Mean centring
2) Variance scaling
3) Auto-scaling
Each of these approaches will be investigated using a set of
discrete, dissimilar variables to show how the effects of relative
magnitudes and ranges can seriously distort the underlying structure
of the data when comparing variables with each other.
As an example, consider a chemical reaction that is being
monitored using: Temperature, Pressure and pH instrumentation, with
the following acceptable process ranges,

7 ≤ pH ≤ 7.5
250 ≤ Temperature (°C) ≤ 260
1200 ≤ Pressure (kPa) ≤ 1250

It may, or may not, be immediately apparent that a small change


in pressure will completely cloud a small change in pH if all variables
were modelled on an equal footing. The question is, how can the
variables be given equal weighting so that they can each contribute
to a multivariate model?
As a first option, the data could be mean centred by subtracting
the mean of each variable from the objects of the corresponding
object, in this way, all variables would now be centred around zero.
This is shown in Figure 5.1 where the raw data is shown in the top
plot and the mean centred data in the bottom plot.
The raw data now show the magnitude differences in each
variable, with pH almost invisible in the plot. Mean centring of the
data removes the general magnitude. The magnitude of the variability
of each variable around zero can now be appreciated in its relevant
measurement units. This highlights that a small change in pressure
would completely outweigh a small change in pH. The key point
being made is the word “weigh”.
Figure 5.1: Raw and mean centred chemical process data.

There are a number of solutions available to scale variables, some


based on parametric statistics, others based on non-parametric
statistics and a popular choice is the use of variance scaling. In this
approach, the variables are left un-centred and are divided by their
standard deviation. This approach is really only statistically sound if
the variables are normally distributed, but in the multivariate,
empirical modelling world, this is just considered a source of
variability that can potentially be modelled. The variance-scaled data
are shown in Figure 5.2.
Even after variance scaling, the magnitude of the variables is still
not comparable, however, each variable can now contribute on a
comparable variance scale.
The most common approach of comparing dissimilar variables in
chemometrics is the approach termed auto-scaling, which means first
to mean centre the data set followed by dividing (weighting) each
variable by its pertinent standard deviation. This has the effect of
putting each variable both on an equal level with an equal scaling, so
that small changes in one variable can be compared to small changes
in another. With auto-scaling the bilinear data modelling methods
(PCA, SIMCA, PLS-R etc.) are in effect decomposing the scaled
variance–covariance relationships between all p variables—which is
the mutual correlations between all variables, refer to chapter 1. The
auto-scaled chemical reaction data are illustrated in Figure 5.3.
It is clearly seen that each variable’s variability can now participate
equally during the data modelling process. As is the case with
variance scaling, auto-scaling is strictly speaking only theoretically
valid if the sample distribution of variable values is normally (or near
normally) distributed (refer to chapter 2 for a discussion on normality
testing). In multivariate analysis, data analysts are typically not overly
concerned about the distribution of the data, but more about the
variance data structures, i.e. if there is a spurious or even large
changes in the data and how this correlates with all other variables.
Another objective concerns detection of outliers.

Figure 5.2: Variance scaled chemical process data.


Figure 5.3: Auto-scaled chemical process data.

For those who wish to adhere to statistical practices, there are


variants of the auto-scaling theme that accommodate for non-normal
data distributions. These include the combinations of
1) Median centring and interquartile range scaling.
2) Minimum centring and range scaling.
3) Spherising: a multivariate version of auto-scaling.
These have certain specific uses in very specific data sets,
however, for the majority of multivariate data analysis in a practical
and pragmatic setting, mean centring followed by standard deviation
scaling is by far the most common method used—i.e. auto-scaling.
Variables may also be weighted guided by a priori information
obtained externally, or even weighted to zero, i.e. the influence of a
particular variable can be completely removed from a contemporary
analysis. Another popular choice of scaling is the logarithmic
transformation. While there exist many other types of variable
transformations, including the well-known Box–Cox [2] and Johnson
transformations [3], this introductory text only focusses on the
simplest of the weighting and scaling methods.
For an excellent overview of a broad spectrum of practical
preprocessing in chemometrics, the interested reader is referred to
the instructive book by Beebe, Pell and Seasholtz [4]. The reader is
also strongly recommended to read the eminently illustrative paper by
Deming, Palasota and Nocerine [5].
5.2.2 Logarithm transformation

Many phenomena (e.g. physical, biological, geochemical or medical)


display skewed distribution characteristics that may be compensated
for example by a logarithmic transformation:

X* represents the transformed data matrix X. Figure 5.4 shows the


effect of the logarithmic transformation of a skewed data set and how
it results in a more symmetric distribution.
If there is no prior knowledge about the data, a study of the
histograms of the variables is an excellent way to start—in fact this is
mandatory. If the frequency distribution is very skewed, a variance-
stabilising transformation such as the logarithmic function may help.
It is also important to make use of background knowledge when
transforming individual variables such as temperature, as its effect in
a process is rarely linear. When this is said, one should not try many
individual transforms just to make the distribution more normal. Most
of the multivariate methods do not require normally distributed
variables, but the residuals from the models should not show any
structure. Logarithms are also very useful in the linearisation of
spectroscopic data and more on this topic will be provided in section
5.3.1.

5.2.3 Averaging
Row-averaging is used when the goal is to reduce the number of
samples in a data set, for example to reduce the effect from
measurement uncertainty (e.g. from reference measurements or
sampling) or to reduce the effect of noise (e.g. instrumentation
precision limitations). Data sets with many replicates of each sample
can also often be averaged with the advantage to ease handling
regarding validation and to facilitate interpretation. The result of
variable averaging (or column-averaging) is a smoother data set. But
there is a price to be paid in reduced resolution; averaging is by no
means always a beneficial option—everything depends on the data
analysis context.

Figure 5.4: Logarithm transformation of skewed data into a symmetric distribution.

While taking the arithmetic mean of discrete variables may be


helpful in better estimating the population centre of a data set, typical
situations in routine applications or of averaging are found in fast
instrumental measurements, for instance spectroscopic X-
measurements that replace time-consuming Y-reference methods. It
is very common to acquire several scans (co-adds) for each analytical
sample. One common question that arises is whether to average the
scans and predict one Y-value for each sample, or should several
predictions be made and the average of these used as the final
result? The answer is that both alternatives give the same result,
which is why averaging can also be done on the calibration data and
its reference values.
The general aspects of “replication”, which reflect a much more
complex situation than the replication of analytical scans situation,
are described in more detail in chapter 9.
5.3 Preprocessing of spectroscopic data
Spectroscopic data come in many various types and forms, but differ
with respect to discrete data; they typically consist of many variables,
and all on the same measurement scale very often giving them some
sort of continuity appearance when going from one variable to the
next (a spectral appearance). Of the spectroscopic methods, near
infrared spectroscopy (NIRS) has received the most chemometric
attention. One could almost say that chemometrics was designed for
NIRS as the spectral features obtained from an NIR spectrum are
typically broad and overlapping. This means that NIR spectra often
require preprocessing to reveal certain structures in the data that
makes them more amenable to modelling. Many of the examples in
this book are based on the application of multivariate methods to
NIRS data as it provides a wealth of opportunities to show how
various method work. In particular, NIR spectra are typically collected
on solid materials, where what are known as additive and
multiplicative effects act together to offset and skew the spectra with
respect to each other and a common baseline. Methods for
correcting and minimising these effects are discussed later in this
chapter, but for an extensive overview of the NIRS method and the
preprocessing of such data, the interested reader is referred to the
chapter by Swarbrick in the Handbook of Measurement in Science
and Engineering [6].
In fact, all forms of spectroscopic data may require some form of
preprocessing, whether Raman, mid infrared (mid-IR), UV-vis, X-ray
absorption spectroscopy (XAS) etc. The two most common
transformations are those that linearise the original data and those
that smooth it. These approaches are discussed in the following
sections.

5.3.1 Spectroscopic transformations

Spectral data abound. A light source (specific to a particular range of


the electromagnetic spectrum) is used to illuminate a sample.
However, to quantify how much light the sample absorbs (or reflects),
requires the use of a standard reference measurement as a baseline.
The two most common modes of spectral data acquisition are
Transmission: Where light is passed through a sample and the light
that is not absorbed is collected on a detector.
Reflectance: Where light is incident on a sample and what is
reflected is collected on a detector.
In the case of transmission, the reference measurement is the
incident light, unimpeded on the detector while in reflectance mode, a
suitably reflective material with minimal to no absorbance
characteristics is used. This incident light intensity is defined as I0 for
each wavelength (λ). After collection of the reference spectrum, the
sample is introduced into the path of the same light intensity where
the transmitted light or reflected light is measured at each
wavelength. This sample light intensity is defined as I.
Transmission (T) is defined as the percentage of transmitted light
received by the detector with respect to the incident light intensity I0
and is defined in equation 5.1,

A similar expression exists for the Reflectance (R) collected from a


sample with respect to the reference spectrum and is defined in
equation 5.2.

Unlike transmission, reflectance is measured on a scale of 0–1.


When the transmission, or reflectance values, have been collected
over all wavelengths covered by the spectrometer, they can be
plotted in an intensity (Y) vs wavelength or wavenumber plot (a “line
plot” in the Unscrambler® parlance) in which the plotted points are
connected along the wavelength direction to form a spectrum. The
transmission and reflectance spectra are measured on a non-linear
scale, i.e. the peak maxima do not respond linearly with a similar
change in concentration of a material in the sample. This is the
reason that most spectroscopists prefer to work with data
transformed to the absorbance scale. An absorbance spectrum can
be generated from either a transmission or a reflectance spectrum
using equation 5.3.

The logarithmic transformation is a linearising transformation of


the data such that changes in concentration of components now can
be linearly related to the absorbance of the constituent. This is known
as Beer’s Law (strictly Beer–Lambert’s law) which is stated in its
general form by equation 5.4.

where A is the absorbance of the sample at a particular wavelength, ɛ


is known as the molar absorptivity constant of an absorbing species
and is wavelength specific, b is the pathlength of the light that passes
through (or off) the sample and c is the concentration of a particular
species being measured in the sample.
Beer’s law is typically used in laboratory applications of UV-Vis
spectroscopy where a sample is physically preprocessed by
digestion, extraction and dilution such that the final absorbance
readings are on a scale of 0–1 (the region of linearity where Beer’s
law applies). In order to develop a linear relationship between A and
c, the term ɛb must be kept constant. To do this, the choice of a
single wavelength will keep ɛ constant and if the sample to be
measured is a homogeneous, non-scattering liquid, then it can be
contained repeatably in a cuvette of constant pathlength.
Calibration is performed through the preparation of standard
solutions of the analyte of interest covering a concentration range in
the expected region where the unknown sample is expected to
absorb. From Beer’s law, a straight-line calibration curve results and
the unknown samples absorbance can be used to predict the
concentration of the analyte in the sample. The use of physical
preparation usually removes any chemical interferents that may affect
the absorbance reading and by dilution, the calibration curve can be
made linear. This sounds like a lot of work to do to get a predicted
value.
In a real situation where fast measurements are required to obtain
business critical results, the physical preparation of samples is (in
many cases) not feasible, so how can selectivity and specificity of an
analyte be achieved for spectroscopic measurement performed on
samples as they exist in nature or in a process? By the use of
preprocessing, of which much more will be discussed in this chapter.
When solid samples are measured by spectroscopic techniques,
sample packing and density changes will affect the absorbance
scale, that introduces variability in the data due to a physical, rather
than chemical influence. Referring back to Beer’s law (equation 5.4),
in order for absorbance to change for a sample measured at one
wavelength with the same amount of analyte in the sample, the only
parameter that can change is pathlength. In solid samples, pathlength
is typically a function of particle size, for instance, when particles are
large, the spaces between the particles are larger, allowing more light
to penetrate into the sample, thus increasing the effective pathlength
that the light has to travel, before emerging from the sample and
being collected by the detector. When the particle size is small, the
particles tend to pack tighter and the space between particles is
small. This results in less penetration of the light into the sample and
thus a smaller pathlength results. This situation is shown
diagrammatically in Figure 5.5 where the effects on the absorbance of
a material changes with pathlength.
Beer’s law is just a re-expression of the equation of a straight line
where the slope term (b) is represented by ɛb and the intercept (b0) is
set to zero by measuring a blank sample, thus forcing the intercept
through the (0,0) point. It will be shown in chapter 7, that a
multivariate regression equation is an extension of the univariate
straight line equation and in particular, for a spectroscopic calibration
model, the regression coefficients (bn) represent the chemical
importance of the measured absorbances at the n-wavelengths for
predicting the analyte of interest. But wait, is this not just an
extension of Beer’s Law? The answer is yes. As will be shown when
regression models are discussed, the form of the model is defined by
equation 5.5.

Figure 5.5: Change of absorbance when particle size is changed.

This is an inverse form of Beer’s law, however, the regression


coefficients (bn) may be interpreted as the combinations of ɛ and b
modelled from the data, representing both chemical and physical
aspects of the prediction process. Thus, in terms of spectroscopic
data, preprocessing can formally be defined as a mathematical
reduction in the effects of pathlength variability in order to reveal the
underlying chemical information (ɛ) in the data. This is the reason why
b-coefficients have some interpretability after all (even though
individual component loading-weights display much more fidelity, see
chapter 7). The methods used to reduce unwanted physical effects
are discussed in the remainder of this chapter. In general, the initial
inspection of all absorbance spectra is recommended.

5.3.2 Smoothing

All data generated by a scientific instrument contains some level of


measurement noise that should be removed as much as possible
before multivariate analysis methods are applied. Smoothing is one
such method, crude but effective, that helps reduce the noise without
reducing the number of variables. It is a row-oriented transformation,
i.e. a single variable is mostly influenced by its immediate
neighbouring variables.
This transformation is relevant for variables which are themselves
a function of some underlying influence, for instance the existence of
intrinsic causal spectral intervals (absorbance). In general, smoothing
cannot be performed on non-numeric data, but can be applied when
there are missing data in the set.
In smoothing, X-values are averaged over one segment
symmetrically surrounding a data point. The raw value of this point is
replaced by the average over the segment, thus creating a smoothing
effect. The nature of the smoothing window is determined by the
algorithm used. Four commonly used algorithms are listed as follows.
Moving Average: Also known as boxcar filtering, replaces a central
data value in a selected window by averaging the values within a
segment of data points.
Savitzky–Golay: Fits a polynomial to the data points selected in a
predefined smoothing window and replaces the centre point with the
fitted value.
Median Filter: Replaces the central value in a selected smoothing
window with the median value of the points.
Gaussian Filter: Computes a weighted moving average within the
predefined smoothing window selected for the data points.
The smoothing methods listed above are recursive methods, i.e.
the selected smoothing window is moved across the spectrum and
the process repeated until all points have been processed. Only the
moving average and Savitzky–Golay methods will be described
further in this section.

Moving block smoothing


As the name suggests, a moving block smoothing filter requires a
data analyst to set an appropriately sized window (block) of points to
be smoothed followed by a recursive (boxcar) application of the
smoothing window over the entire region of interest. The smoothing
filter is the arithmetic mean of the data points within the selected
window.
Starting with the raw data, a smoothing window is defined based
on prior knowledge of the spectroscopy being used and the width of
the peaks of interest to be smoothed. If too large a window is chosen,
the signals may be too much dampened, “smoothed out”, but if too
small a window is chosen, not enough noise is filtered out.
Experience is king. Figure 5.6 shows the moving block smoothing
window applied to a noisy Gaussian signal.
The smoothing window size must always be an odd integer value
since the average is calculated for the centre point of the window.
This ensures that the new, filtered point aligns with the absorbance
value in the original x-scale.
Figure 5.6: Moving block filter applied to a noisy Gaussian peak.

Savitzky–Golay smoothing
The Savitzky–Golay filter (also available as a derivative function, refer
to section 5.3.5) fits a low order polynomial function to the window
size defined. In this case the centre point of the window is replaced
by an estimate of the fitted polynomial and like the moving block
smoothing algorithm, it is applied recursively across the region of
interest. It is likewise a requirement of the Savitzky–Golay filter
window that the window size is an odd integer value. Figure 5.7
shows how the Savitzky–Golay smoothing filter works and more
details are provided when a discussion of the derivatives is presented
later in this chapter.

5.3.3 Normalisation
Normalisation is a row wise (or object wise) method that puts all
objects (or variables) on an even footing. Many of the transformations
described and applied so far have been column-transformations, i.e.
making specific preprocessing or transformations which act on one
column-vector individually (single-variable transformations). The
process of normalisation rescales, i.e. normalises, each object into a
common sum, for example 1.00 or 100%.
The row sum of all variable elements is computed for each object
individually. Each variable element is then divided by this object sum.
The result is that all objects now display a common size—they have
become “normalised” to the same sum area in this case.
Normalisation is a row analogy to variance scaling (refer to section
5.2).
Normalisation is a common object transformation. For instance, in
chromatography it is used to compensate for (smaller or larger)
variations in the amount of analyte injected onto the chromatographic
column. It would be of considerable help in the analytical process if
this particular measurement variance can be controlled by a simple
data analytic preprocessing, otherwise (as will be discussed in great
detail in chapter 7), the inclusion of an increased number of data
model components would have to be included in order to model such
irrelevant input variations. In such cases, the use of preprocessing
explicitly removes unwanted physical artefacts rather than relying on
modelling to implicitly account for them, thus unnecessarily
increasing its complexity.
Figure 5.7: Savitzky–Golay filter applied to a noisy Gaussian peak.

There are several other data analysis problems where


normalisation can be used in a similar fashion, even though the
physical or chemical reasons for the phenomena compensated for
may be very different from the chromatographic regimen mentioned
above. For example, using nuclear magnetic resonance (NMR)
spectroscopy (a very different spectroscopy compared to the
vibrational and electronic spectroscopies already discussed in this
chapter, particularly in its mode of collection), a method known as
peak normalisation can be used to scale the entire spectrum to the
height of a known, included standard compound. Peak normalisation
is also a useful method when using mass spectroscopic data for
pattern recognition purposes.
The general calculation of a normalised spectrum is provided in
equation 5.6 for area normalisation.
The denominator in equation 5.6 represents the area normalisation
factor. Replacing this value with the range of the data results in range
normalisation, or if the denominator is replaced with the maximum
value in the spectrum, then maximum normalisation results. The
choice of the appropriate normalisation factor is subject matter
dependent, as described by the chromatographic and NMR
examples.
Beebe et al. [4] suggests that the use of normalisation after the
application of a derivative (refer to section 5.3.5) reduces pathlength
variations when collecting NIR spectra and this is consistent with the
statement in the previous section that the entire purpose of
preprocessing is to minimise pathlength effects (or similar analogous
effects) in samples.
In Figure 5.9, some mid-IR spectra of oils are presented with a
common baseline offset. Figure 5.8 shows the effect of area
normalisation and peak normalisation (at 1743 cm–1).

5.3.4 Baseline correction

In some spectroscopic methods, the spectra only appear to have a


common and flat offset to each other. A typical example occurs for
mid-IR spectra as shown by example in Figure 5.9.
The spectra in Figure 5.9 show a constant offset across the entire
spectral range. The method of baseline correction aims to subtract
the common offset from the data such that they better overlay each
other. This type of offset is known as an additive effect and is
primarily caused by purely pathlength differences imposed by the
measurement system, or due to sample density differences.
Figure 5.8: Area and peak normalised mid-IR spectra of oils.

The baseline offset corrected spectra for the oil samples is


provided in Figure 5.10.
After baseline offset correction, the physical effects of sampling
have been minimised, revealing the true uniformity of the sample
spectra. In applications such as library searching and raw material
identification, such baseline effects and their correction are important
for reliable classification. More on classification methods applied to
oil data is presented in chapter 10.
Figure 5.9: Mid infrared spectra of oil samples showing a common offset effect.

In some cases, the baseline offset is not described by a common


offset factor, but may also be affected by linear or non-linear baseline
shifts. In some cases, both common offset and sloping baselines can
occur in the same data. The simplest correction method when the
slope change of the data is constant and linear is a two-point
baseline correction method. In this context it is assumed that the data
can be expressed as the simple function,
Figure 5.10: Baseline offset correction of oil spectra.

where Xuncorr is the uncorrected spectra, Xtrue is the true signal free of
baseline effects and α + βx are to be determined to correct the
spectra such that,

This type of situation is common in chromatography and some


forms of spectroscopy. A simulated example of a Gaussian
chromatographic peak with both baseline offset and linear sloping
baseline is shown in Figure 5.11.
Figure 5.12 shows the same data as Figure 5.11 now corrected for
both slope and offset effects.
When the baseline effect becomes more complex than the linear
situation, the method of detrending (refer to section 5.3.7) can be
used to correct polynomial baseline shifts.
Figure 5.11: Gaussian chromatographic peak showing both baseline offset and linear sloping
baseline.
Figure 5.12: Gaussian chromatographic peak with baseline offset and linear sloping baseline
removed.

5.3.5 Derivatives

By definition, derivatisation (also known as differentiation) is, generally


speaking, calculated as the difference between the second point in
the spectrum minus the first (divided by a constant centring factor).
The first observation to be made about the derivative is that it centres
the corrected spectrum around the zero line. It also measures the
slope of the spectral features (since by definition, derivatives measure
the rate of change of data).
Using a simple quadratic equation as an example, the concept of
derivatives is shown and extended to the general case of
spectroscopic data. Consider the following quadratic equation,
The formal calculation of the first derivative results in

The constant term in the original polynomial is the intercept


(offset) term of the equation and under first derivation, disappears.
This is how the derivative works in the case of spectroscopic data.
When there are only additive effects dominating the data, the first
derivative can be used to remove these effects. This is shown in
Figure 5.13 for the first derivatised Gaussian curves with baseline and
slope effect shown in Figure 5.11.

Figure 5.13: First derivative spectra of the baseline and linearly sloping Gaussian
chromatographic data.

The first derivatised data shows how the baseline and slope effect
are simultaneously removed. The peak maxima in the original data is
now shown as a zero point. This is because the slope at the peak
maxima is zero. This is sometimes why first derivatives are not the
preferred method for the analysis of spectroscopic data as zero
points are often difficult to interpret when the spectra become
complex. In these cases, the second derivative is preferred as shall
be explained in the following. Returning to the quadratic equation
introduced earlier in this section, the second derivative is provided as
follows,

The second derivative defines the curvature in the data and is


often times used to correct for the quadratic baseline effects
encountered in NIRS (refer to Swarbrick [6]). The major benefit of
using the second derivative is that the zero points in the first
derivative now become the peak minima in the second derivative.
Figure 5.14 shows the second derivative of the Gaussian
chromatographic peak data.

Figure 5.14: Second derivative spectra of the baseline and linearly sloping Gaussian
chromatographic data.

Changing the derivatised data to a peak minimum, usually results


in much better interpretability from such data.

Comparison of derivatives

Using a simple Gaussian curve as an example, Figure 5.15 shows


how derivatives work by comparing the raw, first and second
derivative data together on the same plot.

Derivative orders

The simple difference derivate works by subtracting the n + 1th point


from the nth point and moves this recursively across the entire
spectrum until the derivative spectrum has been generated. If the raw
data are perfectly continuous, this will result in a smooth derivative
spectrum. However, in practice, most data are inherently noisy, due
to a variety of reasons, e.g. instrument electronics, mechanical
vibrations... In this case, taking the simple derivative of a noisy
spectrum will lead a highly irregular result (in fact the spectrum stays
noisy). This is typical for instrument types such as Fourier transform
NIR (FT-NIR) spectra. Figure 5.16 shows the simple difference first
derivative of an FT-NIR spectrum.
Figure 5.15: Comparison of derivatives and raw data.

The reason for the presence of such noise can be explained from
a mathematical point of view. The original data were collected on an
absorbance scale between 0 and 1. Data point resolution determines
the vertical distance between two successive absorbance values. For
the spectrum in Figure 5.16, the difference between two absorbance
values for a major peak is in the order of 0.007 absorbance units. This
is dangerously close to the signal to noise limit of the instrument for
which reason derivatisation can in fact amplify noise. The noise is
further amplified for the second derivative and is shown for
comparison for the FT-NIR data in Figure 5.17.
Figure 5.16: Simple first difference FT-NIR spectrum showing how noise can be amplified in
the derivative calculation.

Note the scale differences between the raw data, the first
derivative and the second derivative spectra. In each case, there is an
approximate two order of magnitude decrease in the scale and for
the second derivative, this is now bordering on the signal to noise
capabilities of the instrument. For a detailed discussion of signal to
noise ratios in analytical instrumentation, the interested reader is
referred to the book by Adams [7].
To overcome such issues with noisy derivatives, two commonly
used derivative algorithms are
1) The segment-gap derivative and
2) The Savitzky–Golay derivative

The segment-gap derivative

The segment-gap derivative enables the computation of derivatives


using an algorithm that allows selection of a gap parameter and a
smoothing parameter. The principles of the segment-gap derivative
are based on a modification of the moving average algorithm where a
suitable smoothing window is used to calculate the average point in
the centre of the window. By setting a smoothing window, the effects
of noise are reduced. This will be shown in more detail when
Savitzky–Golay derivatives are presented.

Figure 5.17: Simple second difference FT-NIR spectrum showing how noise can be amplified
in the derivative calculation.

To allow flexibility of the derivative calculation, a gap size can be


set such that different sizes between the windows are possible,
however, the most common value for the gap size is 1. For such
functions, Norris suggested that derivative curves with less noise
could be obtained by taking the difference of two averages, formed
by points surrounding the selected wavelength locations, Norris [8].
Norris introduced the term segment to indicate the length of the
wavelength interval over which absorbance values are averaged, to
obtain the two values that are subtracted to form the estimated
derivative. If too large a segment is defined, the resolution of the
peaks will decrease. Too narrow a segment (smaller than the half-
band width of the peak) may again introduce noise in the derivative
data.

The Savitzky–Golay derivative


The Savitzky–Golay algorithm was introduced in section 5.3.2,
Savitzky and Golay [9]. It was extended to be a derivative and has
found widespread usage in analytical chemistry. The algorithm is
based on performing a least squares linear regression fit of a
polynomial around each point in the spectrum to smooth the data.
The derivative is then the difference of the fitted polynomial at each
point. The algorithm includes a smoothing function that determines
how many adjacent variables will be used to estimate the polynomial
approximation of the curve segment. Figure 5.18 shows how the
algorithm works.
Following the procedure of Figure 5.18, a polynomial of defined
order (typically 2) is fitted to the data in the smoothing window. The
window size must again be an odd number as the smoothed point in
the window lies at the centre and is defined as A. The window is then
moved along by a one-point increment and a polynomial is fit to the
data spanning the smoothing window. This becomes point B. A point
C can be calculated in a similar manner. The first and second
derivatives are calculated as per equations 5.7 and 5.8.
Figure 5.18: Diagrammatic representation of the Savitzky–Golay derivative.

In equations 5.7 and 5.8, A, B and C are smoothed absorbances


and replace the single absorbance values used in the simple
difference calculations. This is where the effect of the polynomial
comes into play as it provides a smoothed point that will assuredly be
a less noisy datum than the original; therefore, the final derivative is
smooth with minimal noise characteristics.
The most important question that the data analyst has to answer
is: “what is an optimal size of the smoothing window?” Although
many practitioners take an empirical approach to setting the window
size through trial and error, there is actually a scientific approach. The
type of spectroscopy used will typically determine the range of full
width at half maximum (FWHM) of the set of peaks expected in the
spectrum. As long as the derivative window is less than half the
FWHM, the possibility of distorting the information in the original
spectrum, or perhaps even removing it, is reduced. There is no
reason why a single window has to be used across the entire
spectrum either. If some parts of a spectrum contain meaningful
information but are particularly noisy, while other parts also contain
information but are less noisy, then trying to compensate for noise in
one part of the spectrum may be detrimental to the less noisy part. It
is the skill and experience of the spectroscopic analyst that dictates
how to use smoothing windows.
Another important fact to keep in mind when deciding on a
smoothing window is the spectral resolution of the instrument (or
digitalisation data point resolution, but in most cases, these are two
physically completely different things). If the digitalisation data point
resolution is small, there are more points in the spectrum and
therefore the possibility exists to increase the smoothing window size
without loss of spectral features. This difference regarding resolution
is important: digitalisation data point resolution is not in any way the
true physically effective resolution (N.B. FT instruments excluded). As
an example of the effect of different smoothing windows, the first
derivative of the FT-NIR data shown in Figure 5.16 is shown in Figure
5.19 for a number of smoothing points.
Figure 5.19: Application of the Savitzky–Golay derivative to FT-NIR spectra using different
smoothing windows.

The spectra in Figure 5.19 show the trade-off between noise


reduction and spectral feature retention that is faced by all data
analysts. As always, it is preferential to have a specific reason to
choose a particular transformation. Smoothing, in all its variants, is
really not to be understood as a trial-and-error optional supermarket.
It should be based on domain expertise and user experience with
similar data.

Caveat: To the degree that smoothing has not resulted in aggravated


noise characteristics, but has in fact benefitted the spectral signal in
accordance with the above criteria and cautions, the derivatisation
orders-of-magnitude loss in signal strength can be effectively
counterbalanced by applying auto-scaling in the data
analysis/modelling. Strictly speaking scaling (alone) is the operative
process here, but full auto-scaling i.e. including the centring, will not
have any negative effects. By always using auto-scaling until
considerable personal experience has been accumulated, the
beginning data analyst will save her/himself a lot of grief (and
mistakes).

Caution regarding the use of derivatives


The following provides a simple checklist that should be taken into
account before applying a derivative to spectroscopic data,
Swarbrick [6].
1) Is the baseline shift linear or non-linear? This aids in the
determination of the derivative order to use, i.e. if the baseline
appears to be quadratic, then a second derivative is a more
appropriate option.
2) The larger the smoothing window, the more noise removed.
However, if too large a window is used, important spectral
information may be reduced or even lost through “averaging out”.
3) Derivatives enhance noise! Since the process of derivatisation is
successive differences between points, depending on the data
point resolution of the spectrum, the first derivative scale can be 1–
2 orders of magnitude lower in absorbance scale than the original
scale of the data. The second derivative has an even smaller scale
(sometimes close to the noise level of the instrument), but in all
cases sheer smoothing achieved its beneficial objective,
subsequent auto-scaling will compensate effectively.
4) When applying a smoothing window, by definition, the value of the
window size is odd such that the centre point in the window is the
one that is smoothed. This means that the starting and ending
points of the spectrum cannot be calculated for the window size
defined. If this is not taken into account, the possibility of losing
information at the edges of the spectra may result in poor models.
Compensation for this is, of course, built into many of the
algorithms employed by responsible software vendors.
Overall, derivatives are undoubtedly the most used preprocessing
technique for correction of vibration spectroscopic data. If the above
rules are followed, the preprocessing can be optimised efficiently and
can be applied to new data in identical fashion before predictions are
made using chemometric models.

5.3.6 Correcting multiplicative effects in spectra

Unlike additive effects, multiplicative effects are typically wavelength


specific and therefore affect the spectrum in different ways,
depending on which part of the spectrum is being treated. Scatter
effects are primarily associated with NIRS, where two main
mechanisms of light reflectance occur.
1) Specular Reflectance: This is reflected light from the sample that
has little to no information in it due to only little interaction with the
incident radiation.
2) Diffuse Reflectance: This is reflected light that is scattered at large
angles to the incident light and contains a wealth of chemical and
physical information due to the radiation highly interacting with the
sample material.
Specular reflectance occurs in the short wave region of the NIR
spectrum between 720 nm and 1100 nm. This is the region where
transmission measurements are traditionally performed and these too
are also influenced by scatter effects.
Diffuse reflectance occurs in the long region part of the NIR
spectrum between 1100 nm and 2500 nm as a result of elastic
collisions between the radiation and the sample material. In particular,
for solid samples, the particle size (and particle size distribution) is
often roughly of the same magnitude as the radiation as the spectrum
approaches the 2500 nm region. This typically results in non-linear
baseline effects in the spectrum that require correction in order to
enhance the chemical information in the data. Two common scatter
correction methods used for a variety of spectroscopic applications
are standard normal variate (SNV) and the multiplicative scatter
correction (MSC) algorithms. These will be discussed in detail.
Standard normal variate (SNV)

The method of standard normal variate (SNV) was introduced by


Barnes et al. [10] as a model-free method of correcting spectral data
for scatter effects. By model-free, it is meant that the SNV algorithm
only uses the data of the empirical spectrum to correct for scatter,
rather than relying on an external model, as is the case with MSC
(described below in section 5.3.6). The SNV algorithm is very similar
to auto-scaling discussed in section 5.2, but it does not find a
specific correction for each individual wavelength. Instead it
calculates a global mean and standard deviation for the entire
spectral range selected. Mathematically, this is described by equation
5.9.

where X*I,j is the SNV preprocessed spectrum for object i at


wavelength j, Xj is the absorbance value at wavelength j, is the
mean absorbance value calculated over the entire wavelength region
selected for correction for object I and si is the standard deviation of
the entire absorbance scale for object i over the entire wavelength
region selected.
The overall effect of SNV is to centre the data at the global
average level of the spectrum and to scale the data to unit variance
(therefore the name standard normal variate). This is a highly versatile
approach and is probably the most used of the scatter correction
methods due to its simplicity. Removing the overall mean spectrum
from each wavelength, acts to remove additive effects, while dividing
by the standard deviation, acts so as to reduce any residual scatter
effects in the data. Figure 5.20 shows the effect of applying the SNV
correction to NIR data collected in the 1200–2500 nm region.
Figure 5.20: Application of SNV to NIR spectra collected in the 1100–2500 nm region of the
spectrum.

The data in Figure 5.20 were collected on pharmaceutical samples


to analyse moisture content. The effects of the scatter and baseline
offset in the original data, mask the true underlying chemistry. When
the SNV preprocessing is applied, the baseline offset is minimised
and the chemistry at 1930 nm (the 5263 cm–1 band in spectroscopic
wavenumbers) (related to moisture) is being augmented very
effectively. When the topic of multivariate regression is introduced in
chapter 7, if the baseline and scatter effects are left in the data, these
will typically result in the largest sources of variation in the data. By
preprocessing using SNV in this case, the effects were minimised and
it is more than likely that the first component/factor of the multivariate
model will be based on chemistry and not on residual physical
effects.

Multiplicative scatter correction (MSC) and related


techniques

Multiplicative scatter correction (MSC) is a transformation method


that also can be used to compensate for both multiplicative and
additive effects. MSC was originally proposed by Martens et al [11]
and was designed to deal specifically with light scattering effects.
However, a number of analogous effects can also be successfully
treated with MSC including:
Path length variations
Offset shifts
Interference
The idea behind MSC is that the two undesired general effects,
amplification (multiplicative) and offset (additive), should be removed
from the raw spectral signals to prevent them from dominating over
the chemical signals or other similar signals, which often are of lesser
magnitude. This, in general, will enable more precise and accurate
modelling, based on the corrected spectra. MSC is known as a
model-based scatter correction technique as it requires a training set
to define what a typical set of “representative samples” looks like
(N.B. representative with respect to the scatter effects that is). MSC
performs a least squares fitting process that calculates the mean
spectrum of the set and subsequently calculates parameters that
make new spectra look as close as possible to the mean spectrum.
The general model for MSC is provided in equation 5.10.

where Xi,j is the original spectrum for object i collected over


wavelengths j. ai is the offset parameter calculated by least squares
for sample i. bi is the slope parameter calculated by least squares for
sample i. is the absorbance value collected over wavelengths j. δi,j
is the residual after scatter correction is performed and is assumed to
contain the chemical data.
Unlike regular least squares, where the error term δi is to be
minimised, in MSC, the intent is to remove the modelled part of the
data (sic.) such that δi is maximised in terms of chemical information.
In order to apply MSC to a new data set, the mean spectrum of the
training set must be available to regress the new spectrum against in
order to calculate the parameters ai and bi.
When a new spectrum is collected, it is corrected according to
one of the processes described in equations 5.11 to 5.13.
Common offset model:

Common amplification model:

Full MSC:

In practice it is up to the data analyst to select the range of X-


variables upon which the MSC is to be “trained”. One should
preferably select a part of the spectrum that contains no clear
specific chemical information, or the least presumably relevant
chemical information. It is operatively critical that this “MSC-basis”
should only comprise background wavelengths, in so far as this is
possible. If it is not known where this information is, one would
typically try to use the whole range of p X-variables, but, as will be
shown in section 5.4.2, when used indiscriminately MSC can actually
be downright detrimental to a data set. When a larger range of
wavelengths is included in the correction, this also implies a risk of
including noise in the correction or even worse, the accidental
inclusion of some of the “chemical specific” wavelengths in this
correction, resulting in the systematic removal of chemical
information. Software packages like The Unscrambler® are only a few
programs that have implemented the MSC algorithm correctly using
intelligent functions, e.g. such as omitting declared test set samples
from the base calculation. Be wary of cheap-and-nasty instrument
vendor chemometrics programs that may have indiscriminately
implemented MSC… This may be the reason why some models out
there may not be working properly.
The MSC basis is calculated on this carefully selected X-range
only and the MSC coefficients (ai and bi) are calculated accordingly.
The whole data set, including the test set, will be corrected using the
training set MSC base. Finally, a calibration model is made on the
corrected spectra. Before any future prediction, new samples must of
course also be corrected using the same MSC base.

Figure 5.21: Application of MSC to NIR spectra collected in the 1100–2500 nm region of the
spectrum.

Figure 5.21 shows the application of the MSC preprocessing to


the FT-NIR spectra that were previously preprocessed by SNV for
comparison.
The results of Figure 5.21 indeed look very similar to the results
for SNV in Figure 5.20. This is typically the case when the spectral
data show generally similar profiles to the mean spectrum. However,
when the spectra are highly diverse in profile, the MSC transformation
can be highly detrimental and may even destroy the structure of the
data.
To compensate for such situations, Martens and Stark [12]
developed the extended multiplicative scatter correction (EMSC)
preprocessing method. The EMSC model is provided in equation
5.14.

In this model, the first two terms are the usual MSC model. The
supplementary terms and represent “channel weighting
factors” to take into account linear and quadratic wavelength
dependent effects. The use of EMSC will be shown in section 5.4.2
where this time a series of chemically diverse spectra will be used as
an example.
An extension to the EMSC model is known as modified EMSC
(mEMSC) first introduced by Martens et al. [13]. In this method, the
EMSC equation is extended with a term to include external
information to direct the scatter effect preprocessing. The new
information (defined by the term G in equation 5.15) is often supplied
in the form a “good spectrum”, i.e. a known constituent’s pristine
information can be used to import useful information in the
preprocessed data—or it can actually be in the form of a “bad
spectrum”, i.e. an interferent whose influence of the spectrum is to be
removed as much as possible. These features make the mEMSC
approach very versatile. The use of the mEMSC model will also be
demonstrated by example in section 5.4.2.

5.3.7 Other general preprocessing methods

The following only provides a brief description of some other, not so


common preprocessing methods used for spectroscopic data. The
first is the method of detrending and the second is correlation
optimisation warping (COW).

Detrending

In their original paper on SNV, Barnes et al. [10] suggest that after
preprocessing the data with SNV, this should be followed by the
process of detrending. Detrending is a transformation which
minimises non-linear trends, thus SNV and detrend in combination
reduces multicollinearity, baseline shift and curvature. Detrending is
similar to a sloping baseline correction (refer to section 5.3.5) and
calculates a baseline function as a least squares fit of a polynomial to
the sample spectrum. These transformations are applied to individual
spectra and are distinct from other transformations which operate at
each wavelength in a given set of spectra. As the polynomial order of
the detrend increases, additional baseline effects are removed. (0-
order: offset; first-order: offset and slope; second-order: offset, slope
and curvature.)
Typically, detrending is performed by using a second-order (or
higher degree) polynomial in regression analysis where absorbance
values (or y-variables) and the independent variable or x-variable (w)
is given by the corresponding wavelengths (equation 5.16):

where A, B, C (and D, E) are the regression coefficients for correcting


the baseline effect. The base curve in the above relationship is given
by the fitted values X*i and thus derived spectral values subjected to
SNV followed by detrend become (equation 5.17):

This calculation removes both baseline shift and curvature which


may be found, e.g., in diffuse reflectance NIRS data of powders,
particularly if they are densely packed. The use of this transform does
not change the shape of the data, as is the case with the application
of derivatives. Figure 5.22 shows how a second-order detrending can
remove the quadratic baseline effect inherent in the SNV
preprocessed spectra shown in Figure 5.20.
Figure 5.22: Comparison of SNV and SNV followed by detrend on NIR spectra collected in
the 1100–2500 nm region of the spectrum.

NIR spectra typically exhibit quadratic baseline effects from


around 1500 nm to 2500 nm primarily due to elastic scattering
mechanisms. The detrend corrected SNV spectra of Figure 5.22
shows how a second order detrend function will also minimise the
quadratic baseline by “straightening” the spectrum around the zero
line. To determine whether the detrend adds any value to a
developed model can only be assessed by proper validation and
comparison of the alternative preprocessing options. Such empirical
comparison is a very powerful use of the validation facility, as always
ideally carried out based on a test set. However, (N.B.) for carefully
designed comparisons in which only strictly bracketed preprocessing
options vary, informed use of cross-validation may be acceptable
(segmented cross-validation, conceptual cross-validation, refer to
chapter 8 for more details).

Correlation optimisation warping (COW)

COW, Tomasi [14], is a method for aligning data where the signals
exhibit shifts in their position along the x-axis. COW can be used to
eliminate shift-related artefacts in measurement data by correcting a
sample vector to a reference. COW has applicability to data where
there can be a poor alignment of the x-axis from sample to sample,
as can be the case for example with chromatographic data, Raman
spectra and NMR spectra. One example of such data is
chromatography, where peak positions change between samples due
to changes in mobile phase or deterioration of the column. Another
example is in NMR spectroscopy, where matrix effects and the
chemistry itself induce position changes in the chemical shifts.
The method works by finding the optimal correlation between
defined segments of the data for which there is a shift in position. The
result of this procedure is one shift value per segment. These are then
interpolated to give a so-called shift-vector for all data points, and a
mapping function (move-back operator) which moves the samples
back to the reference profile’s position.
To cope with various shift lengths, it is suggested to pad a data
table with zeros before performing the shift alignment. Alignment is
done by allowing small changes in the segment length on the sample
vector, and those segment lengths being shifted (“warped”) to
optimise the correlation between the sample and the reference
vector. Slack refers to the maximum increase or decrease in sample
segment length, and provides flexibility in optimising the correlation
between the samples and reference.

Figure 5.23: Simulated misaligned chromatographic peaks corrected using correlation


optimisation warping (COW).

The reference sample is the sample used as the reference for


COW, and this should of course be a carefully selected
“representative sample” with the main peaks present. Segment length
is defined by the data analyst, the size of the data segment that data
are divided into before searching for the optimal correlation. It must
always be smaller than the number of variables divided by 4. The
slack is the flexibility in adjusting the segment size to give the optimal
fit to the reference data, and is the allowed change in position to be
searched for. Slack must be set to be less than, or equal to, the
segment size chosen. Figure 5.23 shows a simple example of COW
applied to shifted Gaussian curves (representative of a shifted
chromatographic peak or NMR peak).
The method of COW is a highly iterative process where segment
size and slack has to be set over a number of intervals in order to get
the alignment correct. When the data become more complex than a
Gaussian peak, this iterative process can take considerable time to
get right, so the data analyst must be patient and attempt the
correction over a number of times, preferably over a number of
sessions in order to reflect on changes and take a fresh view of the
problem each time.

5.4 Practical aspects of preprocessing


ONLY ever apply preprocessing if the method applies to the
technique being used and when there is adequate knowledge of why
(and how) the preprocessing works. NEVER LET A SOFTWARE
PACKAGE DETERMINE PREPROCESSING OPTIONS
AUTOMATICALLY.
This latter is tantamount to failure and is in fact only a reflection of
the lack of ability of the data analyst to actually learn what is needed.
Many vendors of instrumentation have a lot to answer for in this
respect (they actually devalue the role of the chemometrician by
stating that this is an easy topic and one that can be automated by
the(ir) software). A recent, horrific example shared by one of the
author’s colleagues was a software package in which was suggested
a fourth derivative, followed by a smoothing—and then application of
a second derivative. Hopefully it is clear that this is beyond utter
rubbish and if it was the case, there would be no need to write
textbooks such as this as software could do it all. This, fortunately is
not the case and informed education is continuing here… Anytime a
black box approach is taken to preprocessing, at its most idiotic, by
using all the available methods, learning has not been achieved, will
not be achieved and cannot ever be achieved. While this might be OK
for a vendor to boost instrument sales by promoting ease of use, it
usually ends up in more effort than it is worth, by orders of
magnitude.
In a practical situation, the following must be considered when
applying a preprocessing to new data.
1) Spectroscopic data ALWAYS require some degree of smoothing.
The smoothing window is determined by the data point resolution
with which the spectra were collected—and whether or not
important information will be smoothed out if the window is too
large. The smoothing may not give better models as such but gives
a better visual appearance.
2) ALWAYS look at the spectral data by plotting using line plots. It
might be beneficial to mean centre the data before plotting as
minor systematic variation will more easily be detected. This will
reveal gross outliers but will also indicate (to the trained eye):
a) If the data are affected by additive, multiplicative or both effects
simultaneously.
b) If there is indeed chemical diversity in the data (as is expected
when building a regression model).
3) Use a scatter effects plot to determine the nature of the effects to
be corrected for.
4) If derivatives are to be used, ALWAYS remember that these contain
a smoothing function, so data pre-smoothing is not required. When
considering a derivative,
a) First derivatives are used to remove purely additive effects,
however, the peak maxima in the raw data will become zero
points in the derivative and these are harder to interpret.
b) Second derivatives are used to account for quadratic baseline
effects as well as additive effects and have the advantage over
first derivatives that the peak maxima in the raw data are now
peak minima in the second derivative spectrum. Be aware
though, the second derivative will enhance noise and may
require a larger smoothing window to be applied (refer to point 1
above regarding window size).
c) A single derivative does not necessarily have to be applied to the
entire spectrum. It can be applied to only part (but ensure that
the smoothing interval does not remove important features from
the ends of the abridged spectra). In some cases, the application
of different derivatives to different parts of the same spectra will
result in better feature extraction, particularly where the noise
characteristics of the instrument change over the wavelength
scale.
5) Scatter correction techniques are only to be used when there are
clear signs of scatter present in the original data. This must, in
principle, be observable in the raw data. There are model-based
and model-free corrections available. These offer the advantage
over derivatives that they can account for multiplicative and
additive effects in the same correction process and they also retain
the original profile of the data.
6) More than one preprocessing can be applied to a data set in the
case when one preprocessing method was not fully able to fully
minimise an effect. Typical examples include derivative followed by
scatter correction, SNV followed by detrend etc. In the case of the
application of scatter correction only, smoothing must ALWAYS be
applied first.
7) Scatter corrections are only applicable to the range of the spectrum
that they are applied to. It is important not to include noise or
erratic regions of the spectrum in the correction.
8) Do not overuse preprocessing, i.e. only use 2–3 methods if
absolutely required, for example, smooth followed by SNV,
followed by detrend is acceptable, but smoothing followed by
derivative followed by scatter correction followed by another
derivative is not. Experience and knowledge is king.
9) In the end auto-scaling can be applied to all successfully smoothed
data sets, although some of the corrections will end up doing the
same job—auto-scaling will in these cases simply have no, or only
a very small effect (one that is never adverse). For the novice data
analysts, always using auto-scaling in the data modelling (as in
ALWAYS) is an easy option with which to build one’s own
experience—and this approach will take the new data analyst a
very long way before having accumulated enough experience, and
confidence, with which to address the many, more challenging
pretreatment options.
If these rules are followed or adapted in some way into the
problem at hand, the preprocessing method will be robust and simple
to implement, but most importantly, the data will in many cases be
interpretable even without the application of a model to the data, i.e.
the information should just fall out when the preprocessing is
performed correctly (although there are cases, such as measuring
protein in wheat, where this may not occur).
Remember, preprocessing bad data will not, indeed can never,
result in improved models. This is why all efforts should be put into
proper sampling and data collection in the first place. Too many times
the present authors have heard the story “we were just told to collect
data however we wanted to do it—and chemometrics will sort it all
out”. Again, this will not work, and there is no substitute for
implementing sound sampling and data collection practices. Again,
from experience, in many situations where an organisation has been
struggling to make a model, the advice given (certainly not from us)
was to “collect more samples and the calibration will come out in the
wash”. Hopefully by now, it can be realised that this is very
incompetent advice, never to be accepted under any circumstance.
Preprocessing is about the removal of small variances in the data
to reveal the sought after information in the data more clearly. Using
an analogy of a wrinkled shirt, visual inspection will reveal that the
object is a shirt, but in a very unusable form in a formal situation. The
preprocessing in this case is a hot iron to remove the wrinkles, thus
resulting in a presentation that is more acceptable for its purpose.
However, if a blowtorch was used to straighten out the wrinkles, there
will be something very wrong and the final result will be distinctly
unacceptable. There will be no princess for the data analyst at this
ball.
5.4.1 Scatter effects plot

One of the simplest and most powerful tools a data analyst has at
their disposal for determining the type of preprocessing to use on a
data set is the scatter effects plot. This plot shows each sample
plotted against the average (mean) sample for a selected set of
samples (the object base). Scatter effects display themselves as
differences in slope and/or offset between the lines in the plot.
Differences in the slope are caused by multiplicative scatter effects.
Offset error is due to additive effects. Sometimes the lines show
profiles that deviate considerably from a straight line. In such
instances, caution must be taken when applying scatter correction,
as major chemical information may be confused with systematic
scatter effects and therefore lost in the transformation. For an
excellent reference of this situation, refer to the article by Martens et
al. [13] and section 5.4.2 where the gluten–starch data set is
presented.
Applying multiplicative scatter correction may improve the model
if these types of scatter effects are observable, but as stated over
and over, caution must always be exercised when applying any
preprocessing method to a data set. Figure 5.24 provides four
example situations to look for when assessing the scatter effects plot.
The four situations described in Figure 5.24 are described as
follows:

Situation a): When the individual spectra in a set have exactly the
same profile as the mean spectrum, plotting the data against the
mean will result in straight line fits. The reason the lines do not overlay
each other is due to a purely additive effect. In this case, simple
baseline correction or derivatives should be considered.

Situation b): When the individual spectra form straight lines when
plotted against the mean, again, this indicates that the mean
spectrum captures the principal characteristics of all the samples,
however, the individual spectra have different slopes with respect to
the mean. This is an indication that a scatter effect needs to be
corrected in the data. Since the data emanate from a single point,
there is no indication of an additive effect. In this case methods such
as SNV or common amplification MSC may be useful in correcting
such data.

Situation c): This is a combination of situations a) and b) above where


the effects of scatter and baseline offset are working in parallel in the
data. This type of data requires a scatter correction method with
additive correction and possibly the use of a second additive
preprocessing step to remove residual effects not accounted for by
the first method.
Figure 5.24: Typical data patterns when investigating the scatter effects plot.

Situation d): In this situation the individual spectra are in no way


similar to the(ir) mean spectrum. This occurs when there is a lot of
chemical diversity embedded in the spectra, for instance, binary
mixtures of powder including the two pure components or for more
complex scenarios. The mean spectrum would then be the average of
the two (or more) pure components and therefore the spectra are not
representatively consistent profiles across the samples. This type of
data is best handled by using EMSC in cases where scatter is
involved or simple baseline correction/derivative where scatter is not
a dominant effect.
No matter what type of spectroscopic data is being considered for
preprocessing, the method used to correct the data should be
directly related to the type of effect being corrected. The blind use of
a single preprocessing method for all data is completely discouraged
in favour of the situation where subject matter expertise is the driving
factor. Figure 5.25 shows an example of situation a) where only
additive effects are affecting the spectra. This is shown as the scatter
plot showing all spectra parallel to each other and only offset in the
vertical direction.

5.4.2 Detailed example: preprocessing gluten–starch


mixtures

The data described here come from a publication from Martens et al.
[13] on the use of transmission NIRS for measuring binary mixtures of
gluten and starch at various concentrations. This data is also
presented in chapter 7 on regression to show how the choice of
effective preprocessing can lead to the simplest modelling situations.
Figure 5.25: Scatter plot of FT-NIR data only exhibiting additive effects.

The premise behind this example was that no matter what the
sample packing density or the pathlength used (within reason) for
measuring a binary mixture of powders, the composition of the
powders should be able to be predicted using the simplest (i.e. 1
component/factor model), provided of course that the powder
samples have been carefully prepared and that the area presented to
the analytical instrument is as “homogeneous” as possible for each
sample.
100 spectra of gluten–starch mixtures were collected ranging from
0% to 100% of each constituent in 20% increments. For each
concentration, the powders were scanned in transmission mode in a
variable pathlength cell to introduce variations in the scanning and
the powders were also either packed loosely or tightly to introduce
packing density effects. The spectra of the samples are shown in
Figure 5.26.
Inspection of the data in Figure 5.26 reveals that additive effects
are dominant, i.e. the spectra are highly offset from each other along
the absorbance scale and the other feature observed was an
appreciable chemical diversity between many of the spectra. The
scatter effects plot for this data is provided in Figure 5.27.

Figure 5.26: Raw NIR transmission spectra of gluten–starch mixtures.

In Figure 5.27, the top plot shows all data plotted in the scatter
effects plot where the distinctive offset can be seen. The bottom plot
extracted three of the most diverse spectra in the set, the
characteristic non-linear behaviour indicates that the sample spectra
are very dissimilar from the mean. A scatter correction method like
MSC would not be able to distinguish this variability from scatter
effects and will distort the data structure. This is shown in Figure
5.28.
Figure 5.27: Scatter effects plot of gluten–starch mixtures measured by transmission NIRS.
Figure 5.28: MSC preprocessed gluten–starch data.

The data in Figure 5.28 show that MSC cannot handle the
variability associated with the 100% gluten samples. The
concentration changes between 0 and 100 are also non-linear
indicating that MSC is not an appropriate preprocessing method for
this data. The EMSC algorithm is better able to handle these
wavelength dependent scattering and diversity effects and the
application of this preprocessing method to the data is shown in
Figure 5.29.
The EMSC preprocessing is better able to handle the diversity of
the spectra and shows tight groupings of all of the concentration
ranges. A careful inspection of the absorbance centred around 840
nm shows that gluten concentration varies non-linearly with
absorbance. This was also observed when this data was modelled
using PLSR (refer to chapter 7, section 7.13.6).

Figure 5.29: EMSC preprocessed gluten–starch data.


To correct for the non-linearities, the use of modified extended
multiplicative scatter correction (mEMSC) was used in the original
publication, Martens, et al. [13]. The researchers found that the
difference spectrum of gluten and starch could be used as a “good
spectrum” in the mEMSC algorithm. This is a particularly inspired
example of advanced preprocessing—and a recognition of the
original authors innovative approach to such a practical situation. The
mEMSC corrected spectra are shown in Figure 5.30.
The mEMSC preprocessing has a completely different effect on
the data compared to all other algorithms. It was able to separate the
spectra by composition in a linear manner. The application of PLSR
to this and the other data presented in this example is provided in
chapter 7, section 7.19 (and perhaps not surprisingly, this resulted in
a particularly simple multivariate calibration model).
This example shows how the correct selection of a preprocessing
method is highly important for the form of the corrected spectral data
that results, especially when it comes to the subsequent modelling.
As mentioned earlier in this chapter and typified by Figure 5.30, when
the correct preprocessing method is applied, the information should
simply “fall right out” of the data. This means that the data analyst
has “corrected the data correctly. Experience is King!
Figure 5.30: mEMSC preprocessed gluten–starch data.

5.5 Chapter summary


In this chapter, the critically important topic of preprocessing was
introduced for discrete and spectroscopic variables. The main goal of
preprocessing is to remove unwanted sources of variation due to
noise and other extraneous effects such that the true underlying
structure of the data is revealed and made readily available for
modelling. In particular, it was stated for spectroscopic data that
(based on an argument around Beer’s law) preprocessing is a method
for correcting for pathlength effects when pathlength cannot be
controlled.
In spectroscopic applications, where preprocessing finds most
usage in chemometrics, two main sources of unwanted variation arise
either from additive (sample packing and density effects) and/or
multiplicative effects (those caused by light scattering phenomena).
When additive effects dominate, methods such as baseline correction
and derivatives are useful options for preprocessing the data. A
special case of non-linear baseline correction called detrending was
also introduced for more complex situations. When multiplicative
effects dominate, methods such as SNV and MSC are better suited.
A detailed discussion of the merits of multiplicative scatter
correction methods was provided where the method of MSC was
compared to two augmented methods, extended multiplicative
scatter correction (EMSC) and a version of EMSC that also allows the
inclusion of external information to guide the scatter correction to
even more powerful results; this latter process is known as modified
extended multiplicative scatter correction (mEMSC). These methods
were applied to a data set in which the key diverse chemistry signals
were embedded in quite a cloak of effects formerly screaming for
proper preprocessing. These examples highlighted the fact that
domain expertise, and as much experience as can be developed, is
required for informed preprocessing, as opposed to resignation from
the proper responsibilities of the competent data analyst.
A brief discussion was also presented on a different type of
correction that applies to the x-scale of the data known as correlation
optimisation warping (COW). This uses an algorithm that allows
samples to be aligned using x-axis warping and finds use in methods
such as chromatography and nuclear magnetic resonance (NMR)
spectroscopy.
Preprocessing should only be applied to data if the effect of the
method on the data set at hand is well-known. The use of scatter
effect plots can help in directing the choice of preprocessing method.
If knowledge about a particular preprocessing is not known, it only
takes 10 minutes to read about a method in a book or to perform a
web-based search.
Indiscriminant use of preprocessing can actually destroy the data
structure (rather contrary to the desired result): automated
preprocessing choices provided by some analytical instrument
vendor software’s should never be considered under any
circumstances. Preprocessing is used to take the rough edges off
data and to reveal hidden, more important features—not to
completely change the data structure into something that it is not.
In a sense, what proper sampling understanding is to analysis
(eliminating unwanted, unnecessary total sampling error (TSE, see
chapter 3) in relation to chemical, physical analysis producing the
data), preprocessing is to data analysis (removing further unwanted,
unnecessary errors that will be produced at the interface between
sampling and data analysis if proper and correct preprocessing
countermeasures are not taken). This is shown diagrammatically in
Figure 5.31.

Figure 5.31: Specific removal of unnecessary errors in analysis and in data analysis.

5.6 References
[1] Montgomery, D.C., Jennings, C.L. and Kulachi, M. (2008).
Introduction to Time Series Analysis and Forecasting. John
Wiley & Sons.
[2] Montgomery, D.C. (2005). Design and Analysis of Experiments,
6th Edn. John Wiley & Sons.
[3] Chou, Y.M., Polansky, A.M. and Mason, R.L. (1998).
“Transforming non-normal data to normality in statistical
process control”, J. Qual. Technol. 30(2), 133–141.
[4] Beebe, K.R., Pell, R.J. and Seasholtz, M.R. (1998).
Chemometrics: A Practical Guide. John Wiley & Sons.
[5] Deming, S.N., Palasota, J.A. and Nocerino J.M. (1993). “The
geometry of multivariate object preprocessing”, J. Chemometr.
7, 393–425. https://doi.org/10.1002/cem.1180070506
[6] Swarbrick, B. (2016). “Near infrared (NIR) spectroscopy and its
role in scientific and engineering applications”, in Handbook of
Measurement in Science and Engineering, Vol. 3, Ed by Kutz, M.
John Wiley & Sons, pp. 2583–2608.
https://doi.org/10.1002/9781119244752.ch71
[7] Adams, M.J. (1995). Chemometrics in Analytical Spectroscopy.
The Royal Society of Chemistry.
[8] Norris, K. (2001). “Applying Norris derivatives. Understanding
and correcting the factors which affect diffuse transmittance
spectra”, NIR News 12(3), 6. https://doi.org/10.1255/nirn.613
[9] Savitzky, A. and Golay, M.J.E. (1964). “Smoothing and
differentiation of data by simplified least squares procedures”,
Anal. Chem. 36, 1627–1639.
https://doi.org/10.1021/ac60214a047
[10] Barnes, R.J., Dhanoa, M.S. and Lister, S.J. (1989). “Standard
normal variate transformation and de-trending of near-infrared
diffuse reflectance spectra”, Appl. Spectroc. 43, 772–777.
https://doi.org/10.1366/0003702894202201
[11] Martens, H., Jensen, S.Å. and Geladi, P. (1983). “Multivariate
linearity transformation for near-infrared reflectance
spectrometry”, in Proc. Nordic. Symp. Appl. Stat. Stockland
Forlag Publ., Stavanger, Norway, pp. 205–234.
[12] Martens, H. and Stark, E. (1991). “Extended multiplicative signal
correction and spectral interference subtraction: new
preprocessing methods for near infrared spectroscopy”, J.
Pharm. Biomed. Anal. 9, 625–635.
https://doi.org/10.1016/0731-7085(91)80188-F
[13] Martens, H., Nielsen, J.P. and Engelsen, S.B. (2003). “Light
scattering and light absorbance separated by extended
multiplicative signal correction. Application to near-infrared
transmission analysis of powder mixtures”, Anal. Chem. 75,
394–404. https://doi.org/10.1021/ac020194w
[14] Tomasi, G., van den Berg, F. and Andersson, C. (2004).
“Correlation optimized warping and dynamic time warping as
preprocessing methods for chromatographic data”, J.
Chemometr. 18, 231–241. https://doi.org/10.1002/cem.859
6. Principal Component Analysis
(PCA)—in practice

This chapter follows on from chapter 4 to reinforce the concepts of


Principal Component Analysis (PCA) in a practical manner. The
purpose of learning PCA is to apply it to the real-world problems
faced on a daily basis. It is intended to be a step-by-step common
approach to developing a PCA model, however, prescriptive learning
is always discouraged as one size does not fit all. The steps provided
are those that the authors have found useful to solve the majority of
problems faced over the years. It is the responsibility of the data
analyst to find those techniques and approaches that work best and
adapt their own approaches to solve context specific problems.

6.1 The PCA overview


The flowchart in Figure 6.1 provides a generic overview of the
common approach to developing a PCA model.
The flowchart of Figure 6.1 can take one of three possible paths,
with all paths starting by the selection of representative data
(chapters 3, 8, 9).

Path 1: One pass model


1) Preprocess the data based on past experience/subject matter
expertise (chapter 5).
2) Visualise the data using preliminary plots to understand the data
structure better (if possible).
3) Define a validation strategy (test set preferably). This is somewhat
of an exegesis for PCA modelling, but absolutely necessary for
regressions modelling.
4) Generate a PCA model and interpret it. Ensure that the number of
PCs for interpretation is a fair reflection of the complexity of the
system being studied.
5) Ensure that the model can be suitably validated.
6) Accept model and use it for future investigative or process control
applications.

Figure 6.1: A general procedure for developing a PCA model.

Path 2: Re-Optimisation Path


1) Same as Path 1 steps 1–4.
2) If it is suspected that the current preprocessing is not optimal,
typically shown by a PC1 loadings profile that is the mean of the
original data set, but also can be indicated by other aspects of the
loadings plot. Step 1) in path 1 is revisited and a new model is
generated.
3) Inspect and interpret the new model. If it is acceptable and it can
be validated, then use the model for further applications. Ensure
that the model is not overfitted in any way.

Path 3: No Model Possible


1) In the case that path 2 was taken and re-optimisation of the
preprocessing did not result in a valid model, the use of variable
reduction can be tested.
2) If the reduction of variables still does not result in a valid model (as
evidenced in the explained/residual variance plots), then it is most
likely that the current data set has an idiosyncratic data structure or
is “too noisy” for evaluation—or the samples simply do not contain
any systematic variation that can be modelled.
3) Revisit the sampling strategy in terms of:
a) Span of the sample set with respect to variation, i.e. do not take
100 sub-samples from the same region and expect variations.
b) Re-evaluate the measurement system for its calibration state and
quality of measurements produced.
c) Verify that the measured variables indeed contain information
that can be modelled. If this is not the case, find a set of
measurements that can meet the current analysis objectives.
d) There may also be severe TOS-sampling issues at the beginning
of the “lot-to-analysis” pathway (chapter 3, 8, 9).
Unfortunately, every data modelling situation is different and the
above procedures should be used as a guide for adaptation to the
system currently being analysed. Experience cannot be taught; it
must be gained through practice and learning from mistakes.
Multivariate analysis is not a prescriptive method that one just follows
to achieve a result. Although this can be implemented as a Standard
Operating Procedure (SOP) that can be prescriptive for a particular,
well-defined situation, in order to make it prescriptive, the procedure
must be validated according to path 1 above.
6.2 PCA—Step by Step
Using the approach outlined in section 6.1, a first, general description
of how to perform PCA is presented in a more detailed manner.
Enough should have been gained from the previous chapters for
gathering a broad overview of the most important issues involved in
robust model building.

Step 1: Problem Formulation


What is the purpose and end goal of this particular data analysis?
This is the most pertinent question to be asked. It is the data analyst’s
responsibility to ensure that the data set contains enough relevant
information to apply PCA to solve the problem! This translates into a
question not often considered carefully enough: which variables and
which objects—and naturally this is the most problem-specific issue
of all. This is where correct sampling and measurement system
analysis is required.
On the other hand, there is a major advantage: it does not matter
if the data set contains any amount of additional information. The
multivariate analysis will find this easily enough and may lead to
surprises not originally expected when first considering the analysis.
Occasionally, it turns out that important information was to be found
in this additional realm, even though at the outset this was not
thought to be the case. Time spent in analysing particular data
analytical objectives is generally very well spent.
However, there is also a possible downside: even the best data
analytical method in the world cannot compensate for a lack of
information, i.e. a bad or ill-informed choice of variables/objects. The
famous saying “GIGO” applies also to PCA: Garbage In–Garbage
Out.

Step 2: Plotting Raw Data: Getting a “Feel” for the Particular Data Set
There is a saying in multivariate analysis, “Visualise before you
analyse”. When dealing with discrete data sets, use bar plots or line
plots to understand the data structure from an initial univariate
perspective. Histograms of single variables can provide valuable
information on data distribution to assess whether variables
potentially contain information, but of course this process becomes
tedious when the number of variables is even only moderately large.
The use of Descriptive Statistics (chapter 2) may be a better option in
these cases.
For spectroscopic or chromatographic data, the use of line plots
is recommended as the general profile of the data can be assessed
and any gross outliers can very easily be detected (and justifiably
removed) before performing multivariate analysis.

Step 3: The First Runs


Always start with an initial, open-minded PCA of the total data set.
This will allow familiarisation with the overall data structure and is very
useful for screening purposes. Always centre the data, X. Various
preprocessing schemes can always be evaluated (although in general
a qualified opinion about which scaling/preprocessing method would
be appropriate). Review the chapter on preprocessing (chapter 5) for
more details. More will be revealed throughout this book, but the
range of all preprocessing alternatives is a great challenge not
necessarily mastered quickly—auto-scaling is definitely suggested as
a first strategy until more advanced alternatives are mastered.
In the first runs the computation of “too many” PCs is necessary,
to be sure that there are more than enough to cover the essential
data structure(s). There is a real risk of missing the slightly subtler
data structures, for want of a few more components in the initial runs.
Until the data set is internally consistent (free from all significant
outlying objects and/or variables etc.), there is no point in determining
the optimum number of PCs, as this number may change depending
on what is changed in the data set next.

Step 4: Exploration and Initial Interpretations


The first few score plots are investigated to determine the presence
of major outliers, groups, clusters and trends etc. If the objects are
collected in two separate clusters, as an example, an interpretation
must be performed to determine which phenomena separate them
and decide whether the clusters should be modelled separately, or
whether the original objective of analysing the total data set still
stands.
Be especially aware if the score plots show suspect outliers, as
they will also affect the loadings, usually in severe fashions. In this
case do not use or interpret the loadings to detect outlying variables
at this stage of the data analysis. Although, in general, one should
have good reasons for removing anything from the data set, on the
other hand, too much caution can drag down the analysis. Several
“initial runs” may have to be performed for the successive exclusion
of outliers before the data set can be satisfactorily characterised. The
prudent path is to exclude outliers only one (or a few) at a time,
“peeling away the layers of an onion” successively, until a stable,
“homogenous core” of data has been arrived at. One excludes all
outlying objects before embarking upon the subtler pruning away of
information-lacking variables; this order of outlier exclusion (objects
before variables) is extremely important. The most common error that
inexperienced data analysts often make is to leave “too much” as it
is; in other words, one does not take sufficient personal responsibility
with respect to deleting outlying objects, variables, divide into sub-
groups etc.
At some time, the final data set will have been arrived at.

Step 5: The Later Runs


At this stage, the data set may now be slightly to somewhat reduced,
and perhaps divided into subsets, depending on whether there are
clusters that have to be modelled separately etc. Make sure that that
all actions performed on the data set are recorded to arrive at the
present data set: which objects and variables were removed and why;
the best type of preprocessing etc. The Unscrambler® keeps a
meticulous log of everything that has been applied to a given data
set.
The later runs also consist of PCA calculation but now on the final
data set of course.

Step 6: Inspection of Variance Plot to Determine the Optimum


Number of PCs
There is no point in evaluating the optimum number of PCs before the
data set is in satisfactory final form. Using the total residual variance
plot, or alternatively, the total explained variance plot in order to arrive
at the optimum number of principal components to use, Aopt must
now be performed. Always relate to external knowledge pertaining to
the specific data set when setting the Aopt to use for the interpretation
of the data set.

Step 7: Inspection of Scores and Loadings—Final Interpretations


Now it is time to interpret the PCA model by looking at the
complementary scores and loading plots, for exactly Aopt PCs
validated for use. This can be considered to be the heart of the matter
in PCA. Sometimes a re-evaluation of the original data analysis
objective may have arisen along the way. This is defined by path 2 in
section 6.1. where the possibility of preprocessing, re-optimisation or
variable selection may be performed.

Step 8: Analysis of the Error Matrix E


It is important to fully investigate for any systematic structure in the
residuals after the “successful” Aopt model has been calculated,
evaluated and interpreted. The object residual variances and the
variable residual variances should all now show no systematic
variations that could potentially be modelled. If there are problems
here, e.g. if more outliers are found, simply go back to the modelling
phase, repeating steps 2–8 (path 2 in section 6.1). If the modelling
process has been performed correctly, there should be no surprises
found in the E matrix.

6.3 Interpretation of PCA models


One should always try to compare the model based on Aopt PCs with
the best estimate available regarding the “expected” dimensionality.
In section 6.4. an example of a NIR-spectroscopic investigation of
mixtures of alcohol in water is presented. Here, only one or two PCs
may be expected to be appropriate, reflecting a two-component end-
member mixing system (refer to simplex designs in chapter 11).
However, mixing alcohol with water also gives rise to physical
interference effects which require one or maybe even two additional
PCs for a complete description. The number of PCs in practice is
therefore not 1–2 but rather 2–3. On the other hand, if the actual PCA
on the alcohol/water spectra came up with, say, 5–7 components,
this should be treated as suspicious. Such a large number of
components for this simple system clearly implies a gross overfitting
—unless, say, contaminants were at play. This last point exemplifies
the power of multivariate analysis, i.e. the complexity of the observed
model motivates the data analyst not to accept the model, but use
the diagnostic tools of PCA to determine the root cause of the
problem in a systematic and well-planned approach.
Assessing the expected model dimensions even from a more well-
known problem at hand is often not a simple task; in spectroscopic
calibration for instance, instrumental faults/artefacts and/or
background effects often give rise to additional PCs which then must
be included in the model. Only with sufficient knowledge of the
particular data analysis problem at hand is it possible to sort out such
data analytical effects or artefacts when met with.

6.3.1 Interpretation of score plots—look for patterns

Figure 6.2 shows a (t1 vs t2)-score plot for the analysis of gasoline
spectra used to build a calibration model for the prediction of Octane
Number (refer to chapter 7 for more details on this problem), where
three distinct groups can be observed. In this type of situation, it is
advisable to use coded object-names in a way that reflects their
compositions or other known relevant external properties as shown in
this figure. In this case the objects are grouped by high, medium and
low octane number grading.
Necessary Advice: It is the authors’ experience that when
external bodies provide “well-labelled” data sets that it is never a
surprise to find some of the worst labelling practices in play. The best
advice that can be offered is to be absolutely consistent with sample
naming and labelling—double-checking is a bliss. It will make the
lives of all involved much easier and will typically (from experience)
halve the time taken to analyse the data. Too much time is wasted in
data analysis trying to sort out what may be meant by the person who
compiled the data in the first place. Always decide on a consistent
naming convention and stick to it.
To look for patterns, it might be useful to draw lines between
objects with identical properties and to encircle groups of objects
with similar properties, as has been done in Figure 6.2, manually by
all means on a print out, or aided by some relevant computer
software. It is perfectly admissible to use hand-drawn guide lines on
any type of plot—in fact this is recommended throughout the
modelling phase when new information is still emerging and is being
assessed. The issue here is not how to do this—the issue is to use
whatever appropriate annotation to the plot in question, which will
help with the particular interpretation needs.
The annotated plot in Figure 6.2 shows that octane number
increases along a (virtual) axis that runs from lower left to upper right;
i.e. both PCs 1 and 2 contribute to the determination of octane
number and thus in the decomposed score-plot (which above had
been claimed to result in orthogonal, individually interpretable
components). It is of no great surprise to state that real-world
systems are regrettably not always that simple. There is no guarantee
that all data systems will necessarily be structured in such a simple
fashion so as to be stringently decomposable only into one-to-one
phenomena–PC-component relationships. For more complex data
sets, PCA results are nevertheless decomposed into easily
interpretable axes, which require subject matter expertise in the
particular domain to interpret.
As is found in Figure 6.2, software packages such as The
Unscrambler® always list the explained (modelled) calibration
variance for the PCs displayed. In this case, PC1 describes 86% of
the data and PC2 describes a further 12% for a total of 98%
explained in two PCs. Higher order PCs therefore influence less than
2% of the model, so interpretation or outliers in higher order PCs
would be absolutely irrelevant.

Score Plots—Outliers
Figure 6.3 is the scores plot from the same set of gasoline samples
described in Figure 6.2, this time with two samples added that
contain an additive. The scores plot now shows that two severe
outliers, objects marked as the additive, are now detected. In this
case one can appreciate the enormous effect outliers can have on a
PC model. PC1 is used almost exclusively to describe these two
objects in opposition to all others (88% in total). The other objects lie
nicely on a line that, although somewhat oblique to, is well described
by PC2 (10% explained variance). If the objects that are detected as
outliers are erroneous or non-representative (note: they may not be—
even if they at first look very different from the others), they distort the
whole model because of their large leverage effect. If/when they are
removed, the other samples/objects become evenly distributed
throughout the whole score plot (Figure 6.2). Note also that in Figure
6.2 what PC1 described as important accounted for 86% of the total
variability and this information in only 10% explained in PC2 from
Figure 6.3. Such is the effect of large, strongly influential outliers.
Clusters or groups of objects do not always imply problems, but if
such groups are clearly separated, it might be necessary to model
each group separately. Decisions on this type of issue is, needless to
say, only possible based on critical external knowledge. In general, in
score plots, objects close to each other are similar and the further
objects separate, the more dissimilar they become. The topic of
outliers is presented in section 6.6 of this chapter.
Figure 6.2: Annotated score plot. Use problem-specific, meaningful names to ease
interpretation.

Figure 6.3: Score plot with objects with additives revealed as gross outliers.

6.3.2 Summary—interpretation of score plots


The following points should be used as a guide in the interpretation of
PCA score plots.
Objects close to the PC-coordinate origin are very much “typical”,
the most “average”. The ones furthest away from the origin may be
extremes, or even outliers, but they may also be legitimate end-
members. It is the data analyst’s problem-specific responsibility to
decide which is which.
Objects close to each other are similar, those far away from each
other are dissimilar. In particular, in any combination of PC axes
plotted, samples that lie along a virtual 180° line on either side of
the PC axes have inverse properties from one another (i.e. based
on the measurement variables used to describe the data). They are
negatively correlated with one another.
Following on from the above point, samples that lie exactly on two
PCs which are orthogonal to each other, i.e. have properties that
are independent of each other and are described by their
corresponding single PC loadings exclusively. Samples lying at 90°
to each other when plotted in PC score space also have
independent properties but may be described by a combination of
PC loadings.
Objects in clear groups are similar to each other and dissimilar to
other groups. Widely separated groups may indicate that a model
for each separate group may be appropriate. This is the basis of
the classification method of Soft Independent Modelling of Class
Analogy (SIMCA) presented in chapter 10.
“Isolated objects” may be outliers—objects that do not fit in with
the rest of the sample set used.
In the ideal case, objects, belonging to one coherent data class,
typically should be “well spread” over the whole plot. If they are
not, the data structure is to some degree “clumpy” and problem-
specific, domain knowledge must be brought in for deeper
understanding.
By using well-reflected object names that are related to the
important external properties, one may better understand the
meaning of the principal components as directly related to the
problem context.
The layout of the overall object structure in score plots must be
interpreted by studying the corresponding loading plots.
6.3.3 Interpretation of loading plots—look for important
variables

Variables with a high degree of systematic variation typically have


large absolute variances, and consequently large loadings. In a two-
vector loading plot they lie far away from the origin. Variables of little
importance lie near the origin. This is a general statement, which is
always scaling-dependent, however. When assessing importance, it
is mandatory also to consider the proportions of the total explained
variance along each component—if, e.g., PC1 explains 75% and PC2
only 5%, then variables with large loadings in PC1 are much more
important than those with large loadings in PC2—in fact 15 times as
important!

Correlation / Covariance
Variables close to each other, situated towards the periphery of the
loading plots, co-vary strongly, proportionally to the degree distanced
from the PC origin (again relative to the overall total variance fractions
explained by the pertinent components). If the variables lie on the
same side of the origin, they co-vary in a positive sense, i.e. they
have a positive correlation (auto-scaled data). If they lie on opposite
sides of the origin, more or less (some latitude is needed here) along
a straight line through the PC origin, they are negatively correlated
with each other. Correlation is not automatically reflecting a causal
relation, however; interpretation is always necessary.
Most importantly: loadings, which are at approximately 90° to
each other with respect to the origin, are not co-varying, i.e. they vary
independently of each other. Loadings close to a PC axis are
significant only to that PC. Variables with a large loading on two PCs
are significant on both PCs. Figure 6.4 provides a graphical
representation of these situations.
Figure 6.4: Graphical representation on how to interpret 2D loading structures.

With reference to Figure 6.4, the following conclusions may be


drawn from the loadings structure.
Variable A is only positively correlated to PC1.
Variable B is only negatively correlated to PC1.
Variables A and B have opposing properties, i.e. as variable A
increases, variable B decreases and vice versa. Variables A and B
are negatively correlated.
Variable C is perfectly positively correlated only to PC2.
Variable C is orthogonal to variables A and B and thus has
properties independent of these two variables. Variables A and B
influence the PC1 direction only and variable C influences the +PC2
direction only.
Variables D and E are positively correlated to both PC1 and PC2
and, in particular, are positively correlated to each other.
Variables D and E are negatively correlated to variable F, which is
also described by PCs 1 and 2.
Variable G shows the least variability for the PCs plotted.

Spectroscopic data
In spectroscopic applications, and similar for many X-variable data
sets, the one-vector loading plot is often the more useful. Again, large
loading values imply important variables (e.g. wavelengths). With
reference to the loadings plot of Figure 6.5, the following conclusions
can be drawn:

Figure 6.5: Spectral line loadings of the PCA performed on gasoline samples by NIR
spectroscopy.

PC1 describes 89% of the total variability in the data set.


This variability is mostly stemming from the region 1100–1250 nm.
In particular, the major source of variability can be interpreted as
when the absorbance bands around 1150 nm increase, the
absorbances around 1200 nm decrease (and to a lesser extent the
absorbances around 1400 nm also decrease).
The opposite of the above point is also true if the absorbance at
1150 nm decreases.
Since these loadings are based on spectroscopic absorbances,
these have chemical meanings and therefore interpretation is
possible.
6.4 Example: alcohol in water analysis
This data originates from one of the first courses on multivariate
calibration held in Norway. Mixtures of methanol, ethanol and iso-
propanol were measured with NIR spectroscopy in transmission
mode over 101 wavelengths (1100–1600 nm). This application
concerns principal component analysis and multivariate calibration
and has been described by Harald Martens and Tormod Næs in the
textbook Multivariate Calibration [1].

Data set background:


This data set illustrates the practical use of principal component
analysis (PCA) on a set of designed alcohol mixtures. The spectra
were transformed into absorbance scale as according to Beer–
Lambert’s law, which means that concentration is proportional to
absorbance [2]. Two sets of spectra are investigated; one with raw
data and one after baseline correction with Multiplicative Scatter
Correction (MSC), refer to chapter 5 for more details on this
preprocessing method.

Data set organisation:


16 calibration objects and 11 test set objects.
Three concentrations of the alcohols summing to 100%.
101 wavelengths, every 5th nanometre in the range 1100–1600 nm.

Investigating the raw data:


Two data sets were available for this investigation, the spectral data
(X) and the concentration data (Y). Plotting raw data is highly
informative and since the mixtures contain three components, it is
easy to visualise the relation between the components using a 3-D
scatter plot. This data is visualised in Figure 6.6.
Figure 6.6: Plot of alcohol end-member (chemical “components”) concentrations showing
the mixture design relationship.
Figure 6.7: Raw NIR transmission spectra of alcohol mixture data.

The arrangement of the data points in Figure 6.6 is noticeable. The


point pattern shows a regular distribution in the form of a mixture
simplex (refer to chapter 11 on DoE for more information on mixture
designs). The distribution of the points is such that an exact
mathematical model can be fitted to this system. In this case, the
objective was to use the experimental design to generate a
calibration set of data that is as in dependent as possible and covers
the widest possible experimental space.
Figure 6.7 plots the raw spectral data for each observation as a
line plot.
The raw NIR spectra have a similar overall profile, however, one
sample looks to be suspect in the region 1200–1220 nm, showing an
abnormal spike. This spectrum was removed from further analysis
due to this deviation and the data set was subsequently split into a
calibration set and a test set.
Figure 6.8 shows the score plot with its associated explained
variability plot when applied to the raw transmission spectra.
The explained variance plot suggests that 2–3 PCs describe this
system and the combined PC1 and PC2 explained variance is 96%.
While this is an excellent decomposition of 100 spectral variables
down into 2 interpretable dimensions, the shape of the score plot
bears no reflection of the original mixture design shown in Figure 6.6.
This is a first indication that either preprocessing is required or, the
mixtures were prepared incorrectly. Fortunately, the use of
preprocessing does not require the physical re-preparation of
samples (alternative preprocessing can be effectuated in silico), so
this is the first approach taken to further investigate the data.

Figure 6.8: PCA scores and explained variance plots for raw spectra alcohol mixture data.

MSC was the preprocessing method chosen first, even though


transmission spectra were obtained. This is because this
preprocessing is an excellent non-linear transformation that can help
with the fact that the original absorbance scale is outside of the
typical linear range for the Beer–Lambert law to hold. The MSC
preprocessed spectra are shown in Figure 6.9.
Figure 6.9: MSC corrected alcohol mixture spectra.

Figure 6.10 shows the score plot with its associated explained
variability plot when applied to the MSC-corrected transmission
spectra. In this case, a two-PC model is an adequate representation
of the data, particularly for mixture data where it is assumed that if
two components of a three-component system are known, the third
component can be found by difference.
The shape of the score plot now resembles the original mixture
design and is an indication of the adequacy of the preprocessing
applied to the data. However, because the points do not align
perfectly with the edges of the simplex, this is an indication of
uncorrected non-linear mixing in the system.
There are plenty of preprocessing methods to try on a data set,
however, as explained in great detail in chapter 5, it is bad form to try
out all alternative scaling, transformations or normalisations
indiscriminately without problem-specific justification. This situation
was exemplified by the previous example where the preprocessing
was chosen as a non-linear correction function and its use was
justified based on the two-dimensional score plot describing 97% of
the total data variability. Preprocessing is not an option that one
selects willy-nilly from shelf abundance in the data analytical
supermarket. At this point, all that is required to know is that the
correct choice of preprocessings are all problem dependent, and that
a wrong choice may actually lead to interpretations that are not
relevant to the specific problem.

Figure 6.10: PCA scores and explained variance plots for MSC-corrected alcohol mixture
data.

6.5 PCA—what can go wrong?


There are quite a number of things that can go wrong in a PCA.
Unfortunately, it is not always simple to detect errors in the PCA
results themselves. Looking at residuals is one way of checking, but
the residuals need not always show that something is wrong. The
most important check is that the results should comply with the
specific problem understanding and that they are internally
consistent. The best strategy is to avoid making mistakes due to the
analysis!
Software packages such as The Unscrambler® provide added
diagnostics such as validation score plots and stability plots, more of
which will be provided on the section on outliers (section 6.6).
Below follows a list of some of the most common potential pitfalls
encountered in PCA. While it is not fully comprehensive (even the
most experienced of the authors of this book has not finished making
illuminative errors), one may certainly use it as a useful checklist.

6.5.1 Is there any information in the data set?

This was defined in path 3 in section 6.1 and can to some extent be
observed when any interpretations do not make sense. However,
there are also some very subtle mistakes that can be made here.
Take the case of carrying out a PCA on a data set and the conclusion
drawn is that PC1 corresponds to viscosity, say, while in reality it may
just as well correspond to temperature (but not enough time was
taken to reflect on the problem formulation, and this important
external influence “missed” specifying that temperature should also
be measured). In this case, it is not possible to discover this without
being aware of the temperature dependence of viscosity and defining
the experimental protocol accordingly, for instance, by keeping the
temperature constant in some of the experiments. In a direct PCA,
this kind of lack of reflection of the entire problem domain can easily
lead to misinterpretations, in multivariate regression it can be a
severe problem. Again, it all boils down to subject matter expertise
related to the specific domain of the problem.

6.5.2 Too few PCs are used in the model


This means that not all the potential information is being fully
exploited in the data. Potential information is lost and while not being
the worst mistake that can be made, it should of course be avoided
through careful analysis of plots and model diagnostics.

6.5.3 Too many PCs are used in the model


This can be an equally serious mistake, indeed. In this case noise is
included in the model (the exact opposite of what PCA is being used
for in the first place). The noise contribution must lead to erroneous
interpretations; the analysis will always be wrong, at least partly.

6.5.4 Outliers which are truly due to erroneous data were not
removed

Obviously, this will result in an invalid model. Errors are being


modelled in this case instead of the interesting variations, to some
significant extent.

6.5.5 Outliers that contain important information were


removed
To put it bluntly, something important will be missed. The model will
be inferior with respect to the optimal model, as it will not describe all
the phenomena hidden in the data set.

6.5.6 The score plots were not explored sufficiently

All relevant score plots were not investigated carefully, and too many
important clues were missed. Errors 6.5.4 and 6.5.5 above are
connected to this mistake.

6.5.7 Loadings were interpreted with the wrong number of


PCs
This may give rise to serious misinterpretations. Variables with
important information may have been removed because they seem to
show up as outliers. Remember that the loadings constitute the
bridge between variable space and sample space. If the “wrong” PC
space has been chosen, the “bridge” will not take the analysis to the
right place. The bridge will be the wrong one and the loadings cannot
be trusted.

6.5.8 Too much reliance on the standard diagnostics in the


computer program without thinking for yourself

This is a very common mistake—and the most serious of them all!


The diagnostics may be adequate and helpful most of the time, but
one must always use one’s own problem understanding and check
for consistency and consequences. Remember that the computer
program has no knowledge of your specific problem—it runs along
standard algorithmic procedures that may not necessarily apply
equally well to all specific data sets, the present The Unscrambler®
included.
Note: It is the authors common experience here that software
programs that allow “automated procedures” are simply unreliable
because one is not in a position in which the automated result
actually fits reality, and when it does not. This is probably the most
common cause of misinterpretation in multivariate analysis. Avoid all
such claims by software vendors at all times. Besides, if
chemometrics was that easy, there would be no point in writing a
textbook such as this.

6.5.9 The “wrong” data preprocessing was used


This is described in path 2 in section 6.1 and is rather a tricky point.
Preprocessing/pretreatment of the data set is often important or
essential for relevant and valid modelling results. The correct type of
preprocessing is generally given by the type of problem, but this is
certainly up to the data analyst to decide; the software unfortunately
cannot be made clairvoyant. The wrong preprocessing may well
nearly always give rise to misinterpretations! This introductory
textbook deals with some of the most important types of
preprocessing in chapter 5.
Hopefully this list of possible errors is not a deterrent to the
application of multivariate analysis, some of which cannot even be
detected when they arise! Experience, experience—and still more
experience is the only thing that will help in the journey through many
of these pitfalls.

6.6 Outliers
Outliers are atypical objects (or variables) hopefully detected early on
in the process of PCA(X). If outliers are the result of erroneous
sampling, or measuring, or if they represent truly aberrant data for
other reasons, they should be removed otherwise the model will not
be correct.
On the other hand, “outliers” may just as well be important, for
example if they are somewhat extreme, but still legitimate,
representatives for the data, in which case it is essential to keep them
in the model. Thus, if they represent a significant or important
phenomenon and they are removed, the model will also be
“incorrect”. It will lack, and will consequently be unable to explain,
this property which is in fact present in the data. The model will be an
equally poor approximation to the real world to which it is to be
applied sooner or later. This will of course appear to be a major
problem for new data analysts—that inferior model will be generated
if true outliers are to be included—and also if false outliers are
removed.
The good news is that outliers must either be one or the other.
The bad news is that it is always up to the analyst to decide whether
an outlier should be kept or discarded. In fact, the only problem is
that it will take some experience to get to know these things by their
appearances in the appropriate plots, but it is a feasible learning
assignment and it is one which is absolutely essential to master.
Textbook examples, exercises and personal experience will quickly
help to assume the necessary responsibility for detecting and
deleting outliers efficiently. It is essential to realise that there are
essentially only two major outlier detection modes:
1) Data analytical: the relative (geometrical) distribution of objects in,
e.g., the score plots is often all to go by. Decisions must be based
only on the data context and the data analysts analytical
experience.
2) Domain: i.e. problem-specific knowledge may in certain/many
other situations constitute a welcome reference with which to make
the difficult decision as to whether a particular object, or group of
objects, are outliers or not.
Score plots are particularly good for outlier detection. An outlying
object will appear in some score plot(s) as separate from the rest of
the objects, to a larger or smaller degree. Outliers are characterised
by excessively high scores (with either sign) as compared to all other
objects. Figure 6.11 shows two cases. In the left pane, the object with
an arrow is a potential outlier, but some observers may decide that it
nevertheless still “fits” with the general population, while the object in
the right pane is considerably more doubtful when assessed together
with all the remaining objects, and their internal trend. Observe that
such a relatively “clear” outlier will, in a major way, be significantly
responsible for the direction of a component (PC1 even, in the
illustrated case).

Figure 6.11: Mild and extreme outliers in score space.


The problem of how to quantitatively identify—and thus perhaps
be able automatically to delete—outliers has been the subject of
many suggestions in the data analytical literature. While it may be
acceptable to take out outliers “manually” for data exploration
purposes, for automatic data analysis, e.g. process control, there will
not be the time, nor resources, to perform a continuing visual
inspection of score plots.
Residual object/variable variances can be used for this purpose,
as indeed can the relevant scores/loadings plots as well. This manual
option may sometimes involve a lot of work though, especially when
there are many variables and/or objects.
The general issue of automated outlier deletion is complex;
however, the following sections provide details of some of the more
important statistics and plots used for this purpose, particularly in
applications such as Multivariate Statistical Process Control (MSPC).

6.6.1 Hotelling’s T2 statistic

The Hotelling’s T2 statistic [3] is a multivariate generalisation of the


Student t-test (referred to in chapter 2). It is a measure of distance
from the centre of a multivariate model and has the advantage over
the leverage statistic (also discussed below) that statistical limits can
be placed on these distances. The form of the Hotelling’s T2 statistic
is provided in equation 6.1,

where X is the original data matrix, [X bar]is the mean of the data set
and W is the Covariance matrix of X.
The Hotelling’s T2 statistic is approximately F-distrubuted as
follows,
Figure 6.12: Gasoline octane number data score plot with Hotelling’s T2 ellipse at 25%
significance.

Which translates that if any sample exceeds this critical F-value, it


should be evaluated as a potential outlier.
Rather than applying this statistic to the X-data alone, calculation
of Hotelling’s T2 for multivariate models is performed on the scores. A
convenient way to display the Hotelling’s T2 statistic is by displaying it
on a scores plot, where it can be shown at a set of levels of
significance. Figure 6.12 shows how the Hotelling’s T2 statistic can
reveal outliers in a PCA score plot.
The Hotelling’s T2 ellipse set at 25% significance is capable of
separating the two samples with additive. This statistic finds valuable
use in many areas of data analysis and should be reviewed as one of
the first outlier detection methods.

6.6.2 Leverage
Leverage is a measure of the effect of an object on a calculated
model and is a measure of the distance of an object from the model
centre. The leverage for an object is scaled between 0 and 1 with
extremely influential objects having a leverage close to 1, and a
“typical” (average) object having a leverage close to zero. Leverage is
calculated according to equation 6.2.

where hi is the leverage for object I, A is the optimal number of


components in the model (Aopt) and tif is the score value for object i in
the particular component.
For an individual object, there is a one-to-one relationship
between the Hotelling’s T2 statistic and Leverage, making these
measures in principle the same thing (equation 6.3):

The limit of 3 × Leverage as a criterion for detecting outliers


originated from the univariate standard statistical notion of three
standard deviations equating to a 99.5% confidence interval.
If the number of samples is less than 25, the Leverage limit may
be appropriate, otherwise the Hotelling’s T2 statistic is used as it is
based on an F-test and it seems to have become the standard for
multivariate statistical process control statistic. For a data set with
few samples, the variance of the scores, which is the basis for the
Hotelling’s T2 will be large, and the Hotelling’s ellipse in the score plot
will typically be much wider than the samples’ position. This is one of
the challenges when using a model with few samples to test if a
sample is inside the model space.
Both Hotelling’s T2 and leverage are dependent on the number of
components used in the model and it is very important only to use
these statistics for the optimal number of model components Aopt, or
to evaluate how specific samples change their values as more
components are included in the model. High leverage samples highly
influence the model by pulling the direction of the model components
to themselves. This can have dire consequences on model
interpretation if a leverage sample is left to influence the model too
greatly, underscoring that even a single unrecognised outlier can
wreck an entire multivariate model.
To maximise the benefits of residuals and the leverage statistic,
there exists a powerful plot that can be used for model diagnostics,
routine analysis testing and for real time applications utilising
Multivariate Statistical Process Control (MSPC). This is known as the
influence plot and is described in detail in section 6.6.4. For a more
detailed treatise on influence and residuals, the interested reader is
referred to the text by Cook and Weisberg [4].

6.6.3 Mahalanobis distance

The term Mahalanobis Distance (MD) is often referred to in the


literature as a criterion to detect outliers and is in principle the same
as the Hotelling’s T2 statistic, but in most cases, it is calculated based
on the original variables. It will be the same as the Hotelling’s T2 for
the maximum number of components i.e. where the residual E is
zero.
However, separating the variance of a sample in two parts as in
the Influence plot (discussed in the next section) is the sounder
approach in that it gives the possibility to distinguish between two
types of outliers (using an easy-to-understand pharmaceutical
analogy):
1) An outlier representing that there is “too much” or “too little” of
what is modelled, e.g. if one has established a model for 90–110%
of an active ingredient in a pharmaceutical tablet, a new sample
with 85% will lie outside the Hotelling’s T2 or Leverage limit.
2) An outlier may be an object that is in fact a new tablet for example
with another active ingredient (large residual in the Influence plot on
the y-axis, or the sample residual plot).
No more will be said about the Mahalanobis distance here and the
interested reader is referred to the literature for more details [1].

6.6.4 Influence plots


A convenient way to visualise the presence or absence of outliers for
both PCA and regression models (PCR, PLSR) is to plot leverage (hi
or Hotellings T2 statistic on the x-axis and X-, Q- or F-residuals on the
y-axis. This is known as the influence plot but in fact any pertinent
combination of statistics can be plotted for a particular purpose. Q-
and F-Residuals are defined in greater detail in chapter 7.
The general form of the Influence plot is provided in Figure 6.13
and an explanation of its use is also given.
As an example, consider the plane formed by two PCs in Figure
6.13. Samples that lie close to the plane, but are extreme along the
PC axes are high leverage samples, i.e. they can influence the
orientation of the plane if they become very far away from the centre
of the model.

Figure 6.13: The influence plot and its use for detecting outliers and suspect objects.

Samples that do not lie on the plane in an orientation


perpendicular to the plane show some form of X-residual. As samples
lie further away from the plane, they do not fit the model well and are
therefore X-residual outliers. When a sample is extreme in both X-
residual and leverage, it can in most cases be concluded that the
sample is a true outlier. As always it is preferable to be able to justify
outlier designations with external evidence, wherever available.
There are a number of variants of the influence plot, the most
common forms are:
1) X-residual vs leverage
2) Q/F-residual vs Hoteling’s T2
Any combination of the above variants is possible. The advantage
of using Q/F-residuals and Hotelling’s T2 in influence plots is such
plots allow the placement of statistical limits, thus providing objective
evidence for the presence or absence of outliers (assuming, as
always, that the statistical prerequisites regarding the salient
distributions hold up to reality).
Objects that lie far away from all others, or which do not fit within
the bounds of the influence plot, are possible outliers, because they
are different from the others. However, this does not necessarily
mean that they should be removed from the data set. It is the
responsibility of the data analyst to determine why they are different,
and make the crucial choice of whether to keep them or not. Such
objects may represent extreme end-member objects that actually
help span the calibration/validation variations; removing them would
then result in lower variability for a model, which consequently can
then only handle the more typical, average samples.
There are thus plenty of helping indices a.o. at hand for the first
item on any multivariate data analysis agenda, outlier identification—
with subsequent decision-making: which outliers should be policed
out of the training data set? It is here emphasised that in the hands of
the experienced data analyst, much of this work can actually also
take place in visual modus, i.e. by informed inspection of the relevant
visualisation plots. The best data analyst trusts their own experience
and ability to make informed judgements about outliers—while of
course incorporating the additional input from the relevant dedicated
statistics and plots.

6.7 Validation score plot and PCA projection


As a final topic in the presentation of PCA, two related topics will be
covered, those of validation scores and projection. Projection is the
method of taking a new object and “transforming” it through the
loadings of an existing model to produce a new score value in
multivariate space.

6.7.1 Multivariate projection

Since scores relate to object space and, in particular, the inter-object


relationships, using the method of projection with the accompanying
outlier statistics, a new object can be projected onto an existing
model and the position of the new object can be determined as being
part of the model population, or as being distinctly different.
It is known that once a PCA model has been generated, it can be
used on new data with the error matrix dropped from the definition as
per equation 6.4.

Rearranging equation 6.4 (and a little bit of matrix algebra) to


make T the subject, the projection equation for PCA is obtained as
equation 6.5.

where TNew is the projected score value for the new object XNew and P
is the loading vector(s) for the PCA model of Aopt.
In words, projection is the PCA equivalent of prediction using a
quantitative model (as discussed in chapter 7). It is a highly valuable
method for outlier detection post model development when the
model is used in real situations. Figure 6.14 shows the test set left out
of the alcohol mixture example projected onto the scores space of
the calibration set.
In Figure 6.14, green signifies the projected points and these
appear to lie in the same general space as the blue calibration
objects. The residual variance plot shows that the new sample
variance (green curve) follows the same pattern as the original model
(blue and red curves). It can be concluded from this evidence that the
projected objects are from the same population as the calibration
objects.

6.7.2 Validation scores

Validation is the topic of chapter 8 and depending on the method of


validation used, this will determine how reliably the end model can be
interpreted. Validation scores can be used as a measure of how well
a model can predict the “position” of new samples. When cross-
validation is used, it is expected that the projected score values
should closely overlay each other in the scores plot (as all objects are
used for calibration and validation purposes). The case is different for
test set validation where the calibration and validation samples are
independent of each other.
In any case though, if the validation set is representative, then
they should occupy a similar position in multivariate space. Validation
scores are generated using the method of projection as described in
section 6.7.1 but calculated as part of the model generation process.

Figure 6.14: Alcohol mixture test set projected onto calibration set space model (left).
Figure 6.15: Outlier detection using validation scores for alcohol mixture analysis.

As an example, for the alcohol mixture analysis, it was observed


visually in Figure 6.7 that one sample (A16) had a different profile (it
had an unusual spike between 1200 nm and 1220 nm). The PCA
scores for the raw data including this sample are shown in Figure
6.15 along with the calibration and validation scores.
In Figure 6.15, the blue points are the calibration objects and the
red points are the validation objects. The method of validation used
was full cross validation where it can be seen that the majority of the
red and blue points are very similar except for the point A16 that
shows a large difference between the calibration and validation
points. This is because of the spike in the spectra and it shows that
the model without this spectra feature is not capable of accounting
for the extra variability in this sample.
A complementary method used to detect outliers when cross
validation is used is the stability plot. The stability plot shows the
projection of the object in question onto the sub-models generated in
the cross-validation process. If the samples project onto each model
and show little variability for all cross-validation segments, the model
can be considered capable of accounting for all of the sample
variability in the data set.

Figure 6.16: Stability plot of alcohol mixture analysis scores showing the influence object
A16 has on the model.

However, if a gross outlier is present, the projections start to


become unstable and reflect how much the outlier influences the
model. Figure 6.16 shows the stability plot for the score plot and in
particular shows how influential object A16 is to the model.
When the sample A16 is absent from the sub-model during cross
validation, the influence this sample has on the model can be seen
from the large projected difference from the sub-models containing
this object. This is an indication that this sample is very influential and
potentially an outlier. There are corresponding stability plots for
loading plots as well, however, these can be complex to interpret,
particularly when many variables are being investigated.
Figure 6.16 shows that object A16 is different to the rest in PC2
primarily and an investigation of the PC2 line loading plot shows
where this sample is influencing the model, i.e. in the region of the
spectrum that was visually apparent. Figure 6.17 provides the loading
plot for PC2.

Figure 6.17: Loading plot of alcohol mixture analysis for PC2 showing the origin of the
outlying observation A16.

6.8 Exercise—detecting outliers (Troodos)


Finding outliers is always crucial. Here is an example using data from
a real-world case that shows some interesting issues.

6.8.1 Purpose
To learn about outliers and how to recognise them from the score
plot and the influence plot, introduced earlier in this chapter.

Figure 6.18: P. Thy (left) trying out his hand in multivariate data analysis (for the very first
time) supervised by the first author of the present textbook—jointly elucidating the hidden
data structure of the Troodos data set.

6.8.2 Data set

The Troodos area of Cyprus is a region of particular geological


interest as several hypotheses of its origin are entertained. There has
been quite some dispute over this part of Cyprus’ geological history,
which need not be given in all details here, however, in order for the
data to be used in the present context. Suffice that there are 143 rock
samples available from different locations in the areas underlying the
pertinent section of the Troodos Mountains of Cyprus. The rocks
were painstakingly collected by KHE’s geologist-friend and colleague
since college days, Peter Thy, in a series of strenuous one-man field
campaigns in Cyprus [5]. The data analysis was carried out some
years later as a joint effort. It is always optimal to carry out a
multivariate data analysis with the “data owner” at the data analyst’s
side (this situation is only surpassed by the rare case in which the
analyst is also the collector of the data).

Table 6.1: Description of Troodos variables.

Var Name Description Var Name Description

X1 SiO2 Conc. of SiO2 X6 MgO Conc. of MgO

X2 TiO2 Conc. of TiO2 X7 CaO Conc. of CaO

X3 Al2O3 Conc. of Al2O3 X8 Na2O Conc. of Na2O

X4 FeO Conc. of FeO X9 K2O Conc. of K2O

X5 MnO Conc. of MnO X10 P2O5 Conc. of P2O5

The variables described in Table 6.1 are measurements of the


concentrations of ten rock geochemical compounds (oxides).
Geologists often use such data in order to discriminate between
“families” of rocks, or rather chemically related rock series, in order to
discriminate between generically different, or similar rock groups,
clusters or rock series.
Characterisation by these so-called “major element oxides”
accounts for the dominant geochemical makeup of any rock, usually
to the amount of some 95+%. The rock samples were collected in the
field as “not weathered”. They are therefore supposed to be pristine
“representative samples” as determined by the responsible field
geologist. They are then analysed in the geochemical laboratory and
the relevant compositions of the major element oxides are
determined. The task is to analyse the data in Table 6.1 using a
standard PCA and to start to look for patterns. Picture this as an
open-ended first view of data matrix X.
Figure 6.19: PCA overview of the Troodos data set.

One cardinal question is: are there more than one overall group of
samples?
If the rocks are all geochemically similar, one would expect that
the whole suite of rocks to have been formed geologically by the
same process, perhaps at the same time. If there are clear groups in
the locations of rocks, however, due to different geological
backgrounds, other conclusions about the formation of the area
might be drawn. This work was originally carried out to help settle a
major controversy regarding the entire geological history of the
Troodos massive, see Thy and Esbensen [5], who also show a further
pattern recognition data analysis aspect, not shown here; here we
focus on the initial PCA.

6.8.3 Analysis
Figure 6.18 provides the PCA overview of the Troodos where all
variables were auto scaled and the method of random cross
validation was used to gain an overview of the data.
The explained variance plot suggests that there are several
sources of variability in this data set, or that the possible common
petrogenetic process is not particularly simple. The score plot shows
that the general object population follows a distinct, narrowly defined
curved profile except for two distinctly different objects, namely
objects 65 and 66. Such extremely lonely samples might very well be
outliers.
The original problem context and the raw data must be
considered while working with these plots. It is also quite normal to
look at the original data matrix again when assessing potential
outliers. Whilst it would not be justified to regard either object 129 or
130 in a similar vein, it is a fairly safe initial bet that objects 65–66 are
indeed gross outliers, while object 129 is in all likelihood an extreme
end-member only. The status for object 130 needs further
investigation. Note that a too swift data analyst may possibly declare
all four objects 19, 20, 129 and 130 as extreme end-members.
The loading plot is shown as correlation loadings and it can be
seen that the variables most influential to PC2 (where objects 65 and
66 lie predominantly) are MnO, FeO and Al2O3. This also appears on
the straight loading plot in this case (not shown here).
The score plot in Figure 6.18 represents a typical picture of a
model where problems may occur. A “good model” spans the
variations in the data such that the resulting score plots show the
samples “well spread over the whole PC space”. In the present plot,
the samples are well spread in PC1, but only a very few samples
represent the major variations in PC2. All the other samples are
situated at, or very close to the origin in this direction (zero score
values in PC2), i.e. they have very little influence on the model in this
direction. It is also noted that the partitioning of the total variance
captured along PCs 1 and 2 is 54 and 23% respectively; thus, a fair
fraction of the total data variance is carried by each of the first two
components, but there may perhaps also be a need for higher-order
components.
The influence plot of Figure 6.18 shows the F-residuals vs
Hotelling’s T2. It is reproduced below in Figure 6.19, here illustrated
for PCs 1–4 in a four-pane view (even though the final PCA model will
be based on a smaller number of components).

Figure 6.20: Influence plots of Troodos data with four PCs.

Figure 6.21: PCA overview of Troodos data with objects 65 and 66 removed.

Observe how samples 65 and 66 move as additional components


are added to the tentative model. This is a typical behaviour for
outliers. High residual variance means poor model fit. High leverage
means having a large effect on the model. Therefore, samples in the
upper-right corner (large contribution to the model and high residual
variance) are potentially dangerous outliers.
When more PCs are added to the model, the F-residual decreases
and even outliers will eventually be fitted better to the model. The
model thus concentrates on describing the variations due to these
few different samples instead of modelling the variations in the whole
data set.
An important issue in general is to check the suspicious data. This
is especially true if the data have been typed in manually (if this input
option does at all still exist?). Typing in any reasonable large number
of data is almost guaranteed to produce some typing errors. Could
this be the (simply) case for objects 65 and 66?
However, in the present case, after the two objects distinguished
themselves in the early data analysis, their background was
thoroughly scrutinised, and even the physical samples themselves
were re-examined. A plausible geological explanation of their
deviating composition was actually found. There was now enough
evidence to declare these two objects outliers, they were removed
from the analysis. Figure 6.20 shows the next PCA overview with
objects 65 and 66 removed. It is always best to be able to
substantiate the status of recognised outliers from domain-specific
external knowledge and experience.
The explained variance of two PCs has now risen from 77% to
80%. This is not a large improvement; however, more information has
been partitioned into PC1 (54% in the first model to 61% in the
second). It is also interesting to note that the structure of the loading
plot is quite similar before and after outlier removal.
The influence plot would also suggest that objects 129 to 131 are
potentially extreme. However, removal of these objects has no real
effect on the model and, in particular, the pattern observed in the
loading plot remains the same. It can therefore be concluded that the
model can now be interpreted as is, to reveal whatever hidden data
structure that may be found. In fact, the geochemical/geological
evidence clearly suggested that these two objects were indeed but
extreme end-members of the entire suite of rocks portrayed. The
interpretation of this final model was very interesting (for geologists);
the geological understanding of Cyprus was indeed significantly
modified—but these complicated features for anybody not-a--
geologist need not be followed further here (but can of course be
found in the basic reference, ibid. The data analytical illustrations
above suffice well to tell the story.

6.8.4 Summary

In this exercise a non-too-simple real-world data set was presented,


which after some initial standard PCA apparently contained at
maximum two, possibly four outliers. The difference between outliers
and extremes can indeed be small and was in the present example
only resolved by recourse to extensive domain-specific knowledge.
Removing only one or two at a time and making a new model, allows
visualisation of how the changes imparted manifest themselves. After
removing the first two outliers, the explained variance was slightly
higher at two PCs. Removing the next two outliers did not really
change the explained variance any further.
The first two, or three PCs describe 77–87% of the total variance.
It is not necessarily an objective in itself always to achieve as high as
possible a fraction of the total variance explained in the first PCs at
the exclusion of other data analytical objectives or domain expert
interpretability.
For the present case: in the last score plot for PC1 vs PC2 two
data groupings on either side of the ordinate can be observed (refer
to Figure 6.20). Finding out whether there was only one, or several
rock groups was the overall objective for this particular data analysis.
It is of course difficult for the non-domain expert to interpret the
meaning of PC1 without more detailed geological knowledge about
the objects (field samples). What is geochemically and objectively
clear, however, is that the corresponding PC1 vs PC2 loading plot
indicates that variables 6 (MgO) and 7 (CaO) pull one group to the
right, and the rest (except no. 3 Al2O3) pull the other group to the left
(refer to the correlation loading plot in Figure 6.20). Thus, there is a
very clear two-fold grouping of these 10 variables (one lonely variable
would appear to make up a third group all on its own along the PC2-
direction, which was only of interest for the deviating behaviour of just
two objects, which turned out to be outliers). While there is a total of
ten variables, there are in fact only two underlying geochemical
phenomena present, and the one portrayed by PC1 involves no less
than nine of these variables, and they are all pretty well correlated
with one another; two of them are negatively correlated to the seven
others.
Note that these groupings of objects as well as of variables were
not at all obvious until the two outliers were removed (objects 65 and
66). The main objective—to look for separate groups—could be well
achieved after removing only two outlying, severely disturbing
samples. This revealed “hidden grouping” resulted in a new,
interesting geological hypothesis, ibid.

6.9 Summary: PCA in practice


A very powerful way of looking at a principal component model is as
a transformation in which many original dimensions are transformed
into another coordinate system with (far) fewer dimensions.
The transformation is achieved through projection. For example, a
data structure may be transformed from a 3-dimensional coordinate
system into a 1-dimensional system by projecting the data elements
(objects) onto the particular linear feature calculated (PC1 or higher-
order). The loading vector expresses this direction for the relevant
PC. If the variations along, say, the x3 axis are relatively small, and the
x1–x2 relations show up as a strong correlation, little information will
be lost by using the appropriate correlation feature alone. The lost
information is of course showing up in the model error, E (Figure
6.22).
Translated into the principal components model, the new
coordinate system typically has far fewer dimensions than the original
set of variables, and the directions of the new coordinate axes, called
principal components or t-variables, have been calculated so as to
describe the largest variations in the original data set. This is called
decomposition or data structure modelling, because the important
variations are extracted, and covariance in the data revealed, using a
sequence of (fewer) PCs, i.e. by decomposing into orthogonal
components, which are much easier for interpretation. For the
chemically inclined reader, bilinear PC-modelling has been likened as
“mathematical chromatography”, a particularly apt metaphor.

Figure 6.22: Summary illustration of scores and loadings. The direction of any PC is
described by its loading vector, which is composed of its directional coefficients. All objects
are projected down onto one, or more principal components, the coordinates of the resulting
footpoints are termed scores.

The coordinates of the samples in the new system, i.e. their


coordinates related to the principal components, are called scores.
The corresponding relationships between the original variables and
the new principal components are called loadings. Each principal
component is fully described by its loading vector, the individual
loadings being synonymous with directional coefficients for the line
defined by the component. The differences between the coordinates
of the samples in the new and the old system, lost information due to
projection onto the fewer dimensions, makes up the modelling error,
or the lack-of-fit with respect to the chosen model, E.
PCA is a very practical way to put the magnifying glass on a data
set to reveal embedded, or hidden information. The reader is referred
to several of the “solved PCA cases” presented in application notes
on CAMO’s homepages.

6.10 References
[1] Martens, H. and Næs, T. (1989). Multivariate Calibration. John
Wiley & Sons Ltd.
[2] Skoog, D.A., West, D.M. and Holler, F.J. (1988). Fundamentals
of Analytical Chemistry, 5th Edition. Saunders College
Publishing, pp. 466–472.
[3] Hotelling, H. (1933). “Analysis of a complex of statistical
variables into principal components”, J. Educ. Psychol. 24, 417–
441. https://doi.org/10.1037/h0071325
[4] Cook, R.D. and Weisberg, S. (1982). Residuals and Influence in
Regression. Chapman and Hall.
[5] Thy, P. and Esbensen, K.H. (1993), “Seafloor spreading and the
ophiolitic sequences of the Troodos complex: a principal
component analysis of lavas and dike compositions”, J.
Geophys. Res. 98(B7), 799–805.
https://doi.org/10.1029/93jb00695
7. Multivariate calibration

The central issue in this book (after the necessary introductions to


projections and PCA, chapters 4 and 6) is multivariate calibration.
This involves relating two sets of data, X and Y, by regression. In a
systematic context, multivariate calibration can be called multivariate
modelling (X, Y). Multivariate calibration is first addressed in general
terms before introducing the most important methods in more detail.
So far, a Y-matrix has not been dealt with at all, but from here on Y
will become a very important issue.

7.1 Multivariate modelling (X, Y): the calibration


stage
PCA concerns modelling one X-matrix; this could be termed
multivariate modelling (X). In PCA a principal component model of
the essential covariance/correlation data structure in X is made. PCA
models are displayed as score plots, loading plots, variance plots etc.
Interpretation is an important problem-specific issue which was duly
dealt with above.
In contrast, multivariate calibration concerns two matrices, X and
Y. The Y matrix consists of the dependent variable(s) while X contains
the corresponding independent variables (in traditional regression
modelling terms). At this point it does not matter whether Y consists
of one variable or several, the following issues are common to both
cases. What matters is the establishment of a relationship between
the X and the Y matrices. A column vector is a univariate Y-matrix,
but a matrix is just the same; this relaxed introduction to X → Y
regression modelling is an effective facilitator for a smooth and
effective understanding of all essential aspects of multivariate
regression modelling; this is greatly helped by the fact that PCA(X)
constitutes the essential matter also for the apparently more complex
(X, Y) modelling.
The focus will first be on a general understanding of the concept
of multivariate calibration—multivariate regression. This concept is
shown visually in Figure 7.1.
The multivariate model for (X, Y) is simply a regression model for
the relationship between the (X, Y) data structures. A model is
established through multivariate calibration. Thus, the first stage of
multivariate modelling (X, Y) is the calibration stage.
But calibration is rarely just establishing/finding a model for the
description of the connection between X and Y. Moreover, the model
is intended to be used, for example for future prediction, i.e. to
estimate new Y-values from new measurements of X or in other
words, to predict Y from X under some optimality criterion. In general
prediction is therefore the second stage of multivariate calibration.
It is mandatory to start with a known set of corresponding X and Y
data of training data before new Y-values can be predicted. However,
in between these two stages comes a validation stage as is laid out in
full in chapter 8.

Figure 7.1: The process of multivariate calibration; establishing a regression model from
known X- and Y-data.

7.2 Multivariate modelling (X, Y): the prediction


stage
The above overviews are repeated here, so that the purpose of this
entire chapter is quite clear: first, a multivariate regression model of
the (X, Y) relationship must be established through validation. The
statistically correct way to describe this is to estimate the parameters
of the (X, Y) regression model [1]. Then this model can be used for
prediction; this latter process is shown visually in Figure 7.2.
The estimated regression model is applied to a new set of X-
measurements for the specific purpose of predicting new Y-values. In
this way, it is possible to use only X-measurements for future Y-
determinations, instead of making direct Y-measurements.
There are many reasons why multivariate calibration is so
extensively used, the main objective being to make as few direct Y-
measurements as possible. Typically, this is because the Y-
measurements may be expensive, difficult to obtain, ethically
undesirable, labour intensive, time consuming, dangerous etc. All
these characteristics of the Y-measurements have one thing in
common: it would be economically, logistically, practically …
desirable to replace them with X-measurements if these are simpler,
cheaper, faster …. and if these estimations can be made with a
known, acceptable small uncertainty. There are very many practical
application examples of this type of measurement in all of science,
technology and industry.
As a prime example, spectroscopic methods are often
implemented as fast alternative approaches that can measure many
simultaneous chemical and physical parameters (Y) indirectly. In the
same vein, there are also cases where it would be advantageous to
substitute several, perhaps slow and cumbersome, physical or wet
chemical measurement methods with one spectroscopic method.
Spectra obtained by this method would then constitute the X matrix
and the sought-after parameters, e.g. chemical composition, flavour,
viscosity, quality would constitute the Y matrix.
Another example would be “biological activity”, e.g. toxicity or
some beneficial feature, e.g. potency of a medical drug or tablet
dissolution time. The biological activity of a compound can be difficult
both to identify and quantify directly, because this may involve
animal, or human, experiments or similar. By using a set of
compounds with known biological activities (Y, the reference data),
connections can be made with such physical parameters, which are
easily determined, e.g. molecular structure, molecular weight with the
resulting biological activity. This field is called QSAR, Quantitative
Structure Activity Relationships [2], in which indirect multivariate
calibration of the objects in the X-matrix really comes to the fore. A
related field, which is equally dependent on multivariate calibration, is
QSPR, Quantitative Structure Properties Relationships.
In the biological and environmental sciences, and in technology as
well as a legion of industrial applications, there is a tremendous need
for multivariate calibration. An attempt to illustrate these applications
as fully as possible will be made in this chapter, but a complete
coverage of the field would be impossible due to the virtually
unlimited potential of the methods. And neither is this necessary; all
that is needed is a fundamental appreciation of the powerful
usefulness of multivariate calibration. This must, at all times, be used
intelligently in the guarded fashion of being fully dependent upon
proper validation.

Figure 7.2: Using the multivariate regression model to predict new Y-values.

7.3 Calibration set requirements (training set)


The starting point is always a well-characterised set of known
measurements collected and assembled to become the data matrix X
together with the associated Y-values. The objects should be
characterised with the Y-method to be used in the future, when one
wishes to exploit the more desirable X-data based predictions ( pred).
For any training data set, i.e. for each object (row) in the X-matrix, it is
mandatory also to measure the corresponding Y-value collected on
exactly the same “sample” as measured by the X-variables.* Y is, of
course, measured with the method one would like to substitute for in
future work, often called “the reference method”, so the pertinent Y-
matrix is also known in the calibration stage. This is the completely
general starting framework for the development of any multivariate
calibration model. The matrices X and the corresponding Y are
collectively called the calibration set or the training set.
The calibration set is of critical importance: the calibration set
must meet a number of requirements. The most important is that it is
representative of the future population from which the new X-
measurements are to be sampled,† and clearly the measuring
conditions should be strictly similar as well. As an illuminative
example, it makes very little sense to measure temperature-sensitive
spectra of a series of training set samples at 20°C, if/when the
prediction X-spectra have to be acquired at 28°C—and likewise
regarding all other influential conditions. The calibration set
conditions set the scene for what is possible to predict later …
Consider the case where spectroscopy is used to measure the
amount of fat in ground meat, instead of the more time-consuming
laboratory wet chemical fat determination methods. If the future
samples all will have a fat content between 1% and 10% only, then
obviously the spectra of meat with a fat content of 60–75% should
not be considered for the calibration set etc. These simplistic
examples may sound trivial, but the issues are far from and further
details can be found in the literature [3].
The demand that the training set (and the test set, see further
below) be representative is meant to cover all aspects of all
conceivable variations in the conditions influencing a multivariate
calibration. In the case of the ground meat example, if the problem
context is stated as: “We only want to determine fat in these meat
samples, so let us make our own training set from mixtures of fatty
compounds in the laboratory to keep things simple”. It is then known
precisely how much fat has been added and how the training
samples were made, concentrating on the fat component and adding
the other meat and non-meat components in correct proportions etc.
(possibly using a designed experiment, see chapter 11). Then the
search for naturally occurring, complex, real-word meat samples is
not a requirement along with the tiresome fat measurements. Never
entertain such thoughts! This idea is based upon univariate
calibration theory, which, in reality, will seriously limit any creativity
unless a rational design approach is taken.
In the case where a univariate approach is taken, the laboratory
spectra could be so different from the spectra (unless properly
designed) of the “real” meat that the laboratory-quality “fat model”
would not apply at all. Naturally so, because the artificial laboratory
training samples, no matter how precisely they would appear to have
been created, would not correspond to the complexities of the real-
world, processed meat samples. This may well be because of
significant interferences between the spectral signals from the various
meat and non-meat components, as occurring in the natural state—in
spite of their quantitative correct proportions.
A more stringent way of formulating the requirements of an
adequate calibration set is that it must span the X-space, as well as
the Y-space, as widely and representatively as possible, again in the
specific sense of the future usage of the final, correctly validated
prediction model. The consideration of designed experiments should
always be made when the situation allows such control of the
calibration situation. Experimental design helps to ensure that the
calibration set will indeed cover the necessary ranges of the
phenomena involved. However, there are very often some constraints
put on the training set, minor or major, the most often being that the
number of available samples is more-or-less restricted.
In many situations in technological and industrial scenarios, the
data analysts must be content with working with minimal intervention
possibilities which makes the calibration seem more like “we take
what we can get …” This case is not fully satisfactory from a
multivariate calibration point of view, however, and great ingenuity is
often called upon to add to this severely constrained situation, e.g. by
adding “random samples” (in the hope these might actually catch
some of the occasional complexity of the system under calibration).
Or it will sometimes be possible to tweak the system, X, a little (at
least) hoping to elicit some non-standard relevant change in the
response, Y.
Irrespective of the particular situation, awareness of the span and
the representativity of the X-data collected in the future is a must, and
also of the calibration set and its measurement conditions, since this
defines the possible application limitations of the model in future
prediction situations. Only very rarely will an analyst be lucky enough
that the application scenario can be extensively extrapolated beyond
the range of the calibration set.

7.4 Introduction to validation


The topic of validation is addressed in chapter 8 and is only briefly
touched upon here to give a minimum scenario for understanding
multivariate calibration. Validation means testing performance
according to an a priori set of test result specifications, i.e. is the
model fit for purpose? For example, validation testing is concerned
with its prediction ability of a model on a new data set, which has not
been used in the development of the model. This new data set is
generically called the test set. It is used to test the model under
realistic, future conditions, specifically because it has been sampled
so as to represent these future conditions. Indeed, if needed, several
test sets can be used for added confidence in the model! “Realistic”
here means that the test set should be chosen from the same overall
target (lot or population as the case may be) as the calibration set,
and that the measuring conditions of both the training set and the
test set are as representative of the future situation as indeed
possible.
However, this does not mean that the test set in general should be
closely similar to the training set, this criterion only applies to the (X,
Y) span of these two data sets. For instance, it is a distinct
misunderstanding, in general, to simply divide the training set in two
halves (provided the original set is large enough), as has unfortunately
been recommended in the chemometrics community on numerous
occasions. There are a number of fundamental reasons for why this is
so, which are all discussed in context in chapters 3, 8 and 9.
The brief overview below is only intended to introduce the key
validation issues because they must be borne in mind when
designing and specifying a(ny) multivariate calibration. From a
properly conducted validation, one gets some very important
quantitative results, especially the optimal, or “correct”, number of
components/factors to use in the calibration model, as well as
proper, statistically estimated, assessments of the future prediction
error levels.

7.4.1 Test set validation

The procedure introduced above—using a completely new data set


for validation—is called test set validation. There is an important point
here; the pertinent Y-values for the test set must also be known and
measured on the same samples as the X-variables were collected,
just as was done for the calibration set. The procedure involved in
test set validation is to let the calibrated model predict the Y-values
and then to compare these predicted values of the test set with the
known, real Y-values, which have been kept out of the modelling as
well as the prediction so far. Predicted Y-values will be referred to as
pred and the known, real Y-values Yref (hence the term “reference”
values). Sometimes the term “reference validation” may be found in
the literature, obviously meaning the same as “test set validation”.
An ideal test set situation is to acquire a sufficiently representative
training set of measurements for both X and Y, appropriately sampled
from the target lot/population. This data set is then used for the
calibration of the model. Subsequent to this, ideally completely
separated in time/place from the calibration, an independent, second
sampling of the target is carried out, in order to produce a test set to
be used exclusively for testing/validating of the model—i.e. by
comparing pred with Yref. This second validation set is sometimes
called an External Validation Set.
The comparison of results can be expressed as prediction errors,
or residual variances, which quantify both the accuracy and precision
of the predicted Y-values, i.e. the error levels which can be expected
in future predictions. Regarding accuracy w.r.t. the original lot, referral
is made to chapters 3, 8 and 9.

7.4.2 Other validation methods

There is no better validation method than test set validation: testing


on an entirely “new” data set. A data analyst should always strive to
use validation by test set. TEST IS BEST!
There is, however, a price to pay. To some, test set validation
apparently forces the taking of what appears to be “twice as many
samples” as would be necessary with the training set alone. However
desirable, there are admittedly situations in which this is manifestly
not always possible. For example, when the measuring of the Y-
values is (too) expensive, unacceptably dangerous or the test set
sampling is otherwise limited, e.g. for ethical reasons or when
preparing samples is extremely difficult etc. For this situation, there is
a somewhat viable alternative approach, called cross validation.
Cross validation can, in certain special cases and when applied
correctly, be almost as good as test set validation, but only almost—it
can never substitute for a proper test set validation, see chapter 8 for
details.
Too many times, cross validation is used indiscriminately in
practice and is either suggested to new comers by overzealous
instrument vendors who portray chemometrics as easy and
automated by software and do not know very much about
chemometrics, or by the disinterested analyst who could not be
bothered to develop a test set. If there are 1000 samples available to
model, why would a final model be developed using cross validation?
One of the authors experiences many years ago was the use of Leave
One Out (LOO) cross validation for a set of >6000 samples. The
method did not even produce an acceptable calibration, but the fact
of the matter in this case was that chemometrics got a bad name due
to the lack of knowledge of both chemometrics and validation
principles of the analyst.
Finally, there is a “quick and dirty” validation method called
leverage-corrected validation. This method uses the same
calibration set to also validate the model, only “leverage-corrected”. It
is hopefully obvious that this may be a questionable validation
procedure all depending on the quality of the corrections employed.
Furthermore, this approach often gives results, which (like cross
validation) are too optimistic. However, during initial modelling, where
the validation is not really on the agenda yet, this method can be
useful as it saves time and detects outliers when a large data set is
being analysed. The leverage-corrected validation approach is
included in The Unscrambler® mostly for historical reasons, and is
only used for really large data sets for which case the
recommendation is to use a proper two-segment cross validation. In
chapter 8 it is explained in more detail how these other validation
methods work and how they are related to test set validation.

7.4.3 Modelling error

How well does a regression model fit to the X-data and to the Y-
data? How small are the modelling residuals? It is sometimes stated
that a good modelling fit also implies a good prediction ability, but
this is generally not necessarily so. It has to be proven to be so in
each case.

Initial modelling

Detection of outliers, groupings, clusters, trends etc. is just as


important in multivariate calibration as in PCA, and these tasks
should always be first on the data analyst’s agenda. In this context,
any validation method in the initial screening data analytical process
may be used, because the actual number of dimensions of a
multivariate regression model is of no real interest until the data set
has passed this stage, i.e. until it is cleaned up for such outliers and
the remaining data set is internally consistent. In general, removal of
outlying objects, or variables, often influences the model complexity
significantly, i.e. the number of components will often change as a
result of the removal of highly influential outliers. However, the final
model must always be properly validated by a test set (or
alternatively, but reluctantly, using cross validation and only in those
rare specific of circumstances), but never based on leverage
correction.

7.5 Number of components/factors (model


dimensionality)
As for PCA, the optimal, or “correct” number of components/factors
is equally essential for multivariate regression methods. For clarity
and consistency of terminology, components are used to describe
latent variables associated with PCA or the so-called Principal
Component Regression (PCR), which is based on PCA, and factors
are used to describe the latent variables associated with Partial Least
Squares Regression (PLSR) modelling. These model types will be
discussed in detail throughout this chapter. An operative procedure
to assess this number has not been described yet, however, for
multivariate calibration there is a very close connection between
validation and finding this “optimal” number of components.
Test set validation, cross validation and leverage correction are all
designed to assess the prediction ability, i.e. the accuracy and
precision associated with pred. To do this, pred must be compared to
the reference values Yref. The smaller the difference between
predicted and real Y-values, the better. The higher number of
PCs/factors used, the smaller this difference will be for the calibration
set, but only up to a point for the validation samples, and this will be
the optimal number of components. This procedure is defined as
follows.

7.5.1 Minimising the prediction error

The prediction error is expressed as the residual Y-variance, based


on the particular validation method used. It is expressed in several
forms and is typically studied for a varying number of alternative
components/factors in increasing order.
In Figure 7.3 the x-axis shows the number of components
included in the prediction model. The y-axis denotes a measure for
the prediction error variance, which is usually stated in two forms: 1)
the residual Y-variance (also called prediction variance, V(y)Val) or 2)
the RMSEP (Root Mean Square Error of Prediction), but the latter is
simply the square root of the former as defined by equation 7.1.

As may be observed from Figure 7.3, the overall prediction ability


is best when the prediction variance (prediction error) is at its lowest.
This is where the prediction error, the deviation between predicted
values and reference values, has been minimised in the ordinary sum-
of-squared-deviations sense. The plot in Figure 7.3 shows a clear
minimum at two factors, which indicates that this number is optimal.
Inclusion of more factors may improve the specific modelling fit for X,
but will simultaneously reduce the prediction ability. From the
practical point of view of prediction optimisation, this minimum
corresponds to the “optimal” complexity of the model, i.e. the
“correct” number of prediction model factors, also known as the
optimal model rank.

Figure 7.3: Model prediction error vs number of model components/factors.

Note that determination of this optimum is intimately tied in with


the validation method used, indeed it is part of validation. It is,
therefore, very easy to obtain the correct dimensionality of any
multivariate calibration model—through the use of an appropriate
validation. This is somewhat different in relation to the case for PCA,
in which only the residual X-variance plot was at hand.
These introductory remarks on multivariate calibration will serve
as a sufficient background upon which to focus on the specific
regression methods available.

7.6 Univariate regression (y|x) and MLR


The concepts of bilinear regression methods are now presented:
Principal Component Regression (PCR) and the Partial Least Squares
Regression (PLSR) methods PLS1 and PLS2. The traditional
univariate regression and MLR will first be presented for comparison
purposes.

7.6.1 Univariate regression (y|x)

The simplest form of regression is the so-called univariate regression.


One X-variable only is measured, x, and one response property is
modelled, y. This is described by the usual straight line vector
relationship, typically performed by many, every day using
spreadsheet packages such as Microsoft Excel (or similar) and
defined by equation 7.2.

Univariate regression is undoubtedly the most often regression


method used. It has been studied extensively in the statistical
literature and it is part of any university or academic curriculum in
which quantitative analysis is involved. It is assumed that the reader
is sufficiently familiar with this basic regression technique, but if
required there are scores of relevant statistical texts in the literature
section, for example Draper and Smith [1].
There is a serious potential problem with this approach, however,
as there are no modelling and prediction diagnostics available. It is de
facto impossible to detect situations where univariate regression
gives false estimates. An example of this is illustrated in Figure 7.4. In
the left-hand panel the concentration (y) of the compound, I, can be
predicted, in a solution calculated from the spectroscopic
absorbance at one selected wavelength (x), usually chosen as
corresponding to the peak height. If there is one, or more, unknown,
compound(s), for example compound II, present in the solution and if
there is a significant overlap between these spectra in the wavelength
region in which (x) is measured, the observed absorbance will be the
sum of the absorbances of compounds I and II at the wavelength
chosen. This is illustrated in the right-hand panel of Figure 7.4. Thus,
when the spectra are overlapping, one wavelength alone cannot be
used to determine the concentration of compound I. This is a direct
consequence of the fact that the quantitative contribution from the
compound II spectra are unknown in the calibration. The univariate
calibration approach therefore fails, completely without the data
analyst knowing—not a happy thought!

Figure 7.4: Univariate regression: overlapping spectra in the wavelength regimen (x-axis).

7.6.2 Multiple linear regression, MLR

Multiple Linear Regression, MLR, is the classical method that


combines a set of several X-variables in linear combinations, which
should correlate as closely as possible to the corresponding single y-
vector; see Figure 7.5.
In MLR a direct regression is performed between Y and the X-
matrix. Here, only one column vector y will be assessed at a time, but
the method can readily be extended to a whole Y-matrix—in this
case, independent MLR-models are made, one for each Y-variable,
based on the same X-matrix.
Figure 7.5: MLR: regressing one Y-variable on a set of X-variables.

The MLR model is provided in equation 7.3:

This can be compressed into the convenient matrix form (equation


7.4):

The vector of regression coefficients b is found by least squares


fitting so that f, the error term, is minimised, i.e. fTf is minimised and
the least squares line is the only line that will simultaneously minimise
f for all objects. This leads to the following, hopefully well-known
statistical estimate of b (equation 7.5).

Estimating b involves matrix inversion of (XTX) and this may cause


severe problems with MLR when there are collinearities in X, i.e. if the
X-variables correlate with each other. In some cases, matrix inversion
may become increasingly difficult and in severe cases may not be
possible at all. The (XTX)–1 division will in such cases become
increasingly unstable (it will in fact more and more correspond to
“dividing by zero”). With intermediate to strong correlations in X, the
probability of this ill-behaving collinearity is overwhelming and the
usefulness of MLR will rapidly be reduced to the point where it will
not work. To avoid this numerical instability, it is standard statistical
practice to delete correlated variables in X so as to make X become
full rank. At best this may mean throwing away potential information.
To make things worse, it is definitively not easy to choose which
variables should go and which should stay. In the worst case, there
will be too many collinearities and the analysis cannot proceed.
In short MLR may fail when there is:
Significant collinearity in X
More variables than objects
Interference between variables in X
Furthermore, MLR solutions are not as easy to interpret as the
bilinear projection models that follow below. Again, these
“explanations” are only a first, non-statistical introduction into matters
that of course also should be studied in their proper mathematical
and statistical context when needed.

7.7 Collinearity
Collinearity means that the X-variables are inter-correlated to a non-
negligible degree, i.e. that some of the X-variables (or in the worst
possible case, all of them) are linearly dependent to some degree; for
example,

x1 = f(x2,x3,…,xp)

If there is a high collinearity between x1 and x2 (see Figure 7.6), the


variation along the solid line in the x1/x2 plane is very much larger than
across this line. It will then be difficult to estimate precisely how Y
varies in this latter direction of small variation in X. If this minor
direction in X is important for the accurate prediction of Y, then
collinearity represents a serious problem for the regression modelling.
The MLR solution is graphically represented by a plane through the
data points in the x1/x2/Y-space. In fact, the MLR model can directly
be depicted as a plane optimally least square fitted to all data points.
This plane will easily be subjected to a tilt at even the smallest
change in X, e.g. due to an error in the X-measurements, and thus
become unstable, and thereby more or less unsuited for Y-prediction
purposes.
In such a case, one possible “solution” is to pick out a few
variables such that the remaining variables do not co-vary (or which
correlate the least), and use the information in a combination of these
selected “independent” variables. This is the idea behind the so-
called stepwise regression methods, and this may sometimes work
well in specific applications but certainly not in all. Also, note that the
demands of a particular stepwise calculation method need to be
followed strictly; this is surely something all true data analysts dislike!
There are in general many problems in relation to step-wise methods,
for which reference to Høskuldsson [4] is made. It should also be
mentioned that there is a danger of misinterpretation of regression
coefficients in MLR models and the corresponding ANOVA table (e.g.
p-values) in general when the X-variables are not orthogonal (i.e.
independent). This problem occurs even for moderate collinearity
situations, long before the collinearity gives numerical problems in the
mathematical sense.
However, if the minor, “transverse” directions in X are more or
less irrelevant for the prediction of Y (which may be the case in
spectroscopy), collinearity is not a problem, provided that a method
other than MLR is chosen.
Bilinear projection methods, the chemometric approaches
focused on in this book (PCA, PCR, PLSR) actually utilise the
collinearity feature constructively, by choosing a solution coinciding
with the variation along the solid line in Figure 7.6. Thus, this type of
solution is stable with respect to collinearity.

7.8 PCR—Principal component regression

7.8.1 PCA scores in MLR


The score vectors, t, in PCA are orthogonal to each other, no matter
how many are calculated. Suppose a full PCA was performed and all
calculated PCs in the form of score vectors, were used directly in
MLR? Since all the PCs were used, there would be no loss of X-
information. On the other hand, just the right type of orthogonal
(independent) variables have been created, which MLR has been
designed to handle in an optimal fashion.
Figure 7.6: Collinearity in X-space—leading to unstable Y-predictions for MLR.

Principal Component Regression, PCR, can therefore be thought


of as a two-step procedure: first a PCA is used to transform X to
obtain the resulting T-matrix, which is then plugged directly into the
MLR model (equation 7.6), now giving.

instead of y = Xb + f. This “MLR”, now called PCR for obvious


reasons, would thus be stable. But not only that—by using the
advantages of PCA additional benefits are obtained, in the form of
scores, loadings and variances etc. which can help in the diagnosis
and interpretation of the model.
7.8.2 Are all the possible PCs needed?

No, a full PC decomposition in the regression step is not required. In


fact, it would be better if fewer components were used, since the
“later” components progressively more and more correspond to
noise. There is no a priori reason that using all the available variations
in X will necessarily create a better model for prediction of Y. In fact,
there may easily be data structure elements in X that have nothing to
do with Y at all, i.e. which are casually uncorrelated to Y. These
variance elements should be omitted or else they will disturb the
optimal regression modelling of Y. The problem is that PCA will
model any-and-all variations in the data. Thus, sometimes PCR will
be forced to work on the basis of a set of PC components which are
not all relevant for the Y-variation. This problem is exacerbated by the
fact that in general it is not known a priori when this situation occurs
—there is no general “warning label” associated with a particular (X,
Y) matrix to be modelled.
How is this selection problem solved?
Perhaps it comes as no surprise, but the solution is to use
prediction validation to determine the correct number of PCs to
include in the particular regression model. The procedure is to
continuously increase the number of PCs in the regression one by
one (i.e. increasing the dimension of T in Equation 7.6). Subsequent
use of the resulting model (the resulting vector b, for the augmented
number of components) in the prediction stage determines the
optimal number, thereby arriving at a proper validation variance plot,
in principle similar to Figure 7.3 above.
This validated number will, in general, differ from the optimal
number of PCs found from an independent PCA of X without regard
to Y. This is because the prediction ability determines the number of
components needed in order to optimise the prediction criterion, not
the PCA modelling fit with respect to X alone. This is the first time this
very important distinction between the alternative statistical criteria
has been met, modelling fit optimisation vs prediction error
minimisation, but it will certainly not be the last time in chemometrics;
see Høskuldsson [4] for a comprehensive overview.
Because the relevant competency regarding PCA has already
been achieved, and MLR and validation have been entertained in a
logical follow-up manner, it has now been possible to introduce all
the essentials of PCR in section 7.8. And now it is time for an
example!

Figure 7.7: PCR: using PCA scores in MLR (T) instead of X.

7.8.3 Example: prediction of multiple components in an


alcohol mixture

This example is a continuation of alcohol mixture problem presented


in chapter 6, where PCA was applied to the data to see if any
meaningful conclusions could be derived from the data. It was found
that the MSC preprocessing method resulted in a score plot that
resembled the original mixture design and therefore, this is a
justification that this preprocessing is appropriate to the problem.
For each mixture, the amount of three alcohols, methanol, ethanol
and propanol are known and recorded as part of a Y-matrix. PCR can
model more than one Y-response, but unlike PLSR, it cannot make
use of information regarding correlations between the Y-responses.
From the resounding success of bilinear data modelling, which thrives
on correlations between the X-variables, what cannot be achieved in
addition when also allowing similar collinearities on the Y-space to
come into play? The expectations in chemometrics were very high.
The reality was very different, however, see below.
Figure 7.8: PCR overview of the alcohol mixture analysis.

Figure 7.8 provides the standard four pane regression overview


for the methanol model validated using the available test set.
Corresponding models for ethanol and propanol are also available
and will be described as appropriate in the following text.
As was the case with PCA, the first plot to investigate is the
Explained (or Residual) variance plot. This indicates that a two-PC
model is adequate for modelling the response methanol. In a PCR
(and PLSR) analysis, there are two types of explained variance plot,
one for X and one for Y. The Y-Explained Variance plot is shown by
default; however, a data analyst will also look at the Explained X-
variance plot for better understanding of the model. This plot is
provided in Figure 7.9 for the X-calibration data.
Figure 7.9: Explained X-calibration variance plot for response methanol, alcohol mixture
analysis.

Comparing the X- and Y-variance plots shows that two PCs


describe the majority of the variance in both data tables and the two-
PC model can be considered as well conditioned. In some cases, it
may take three or more PCs to describe the Y-variance, however,
close to 100% X-variance may be explained in one PC. This is a
dangerous situation and represents a case where two PCs are being
used to describe Y-variance that relate to <<1% of the variance in X.
This is why good data analysts always use all the diagnostic tools
available to their advantage. A converse situation may exist where
1% of X-variance may explain 100% of Y-variance. This is a good
situation and putting it into perspective, if 100 variables are being
measured, this represents a possible situation where only 1 variable
is related to Y—or maybe some of them are linearly correlated to the
same effect, i.e. only one “phenomenon” in X related to Y.
Now that is has been established that two PCs are all that is
needed to describe the response methanol, the next step is to look at
the score plot. In PCR, the score plot is identical to the PCA score
plot, since after all, PCA is the method used to extract the scores
from the X-data. This is identical to the score plot observed in
chapter 6 for the same data, so no more about this plot here.
The next plot to investigate is the regression coefficient plot. This
is the actual model to be used on new data to predict the methanol
content in new samples. This plot should be investigated for,
1) Similarity to the expected spectral features in the raw data and
2) Presence of noisy variables, outlined as abnormal features when
compared to the raw data.
Figure 7.10 shows a comparison of the MSC-corrected pure
component spectra and the two-PC regression plot for the model of
methanol.
In the wavelength region 1150–1250 nm, this region is highly
significant for changes in methanol concentration along with the
1400–1420 nm region. These regions relate to –CH and –OH
absorption regions respectively and therefore the model now has
chemical interpretability. From the score plot in Figure 7.10, the PC1
direction describes an increase in propanol concentration in the
positive PC1 direction with a corresponding increase in methanol in
the negative PC1 direction. From the regression coefficients plot, the
absorbances at 1160, 1205 and 1410 nm are most important for
measuring increasing methanol content, while the absorbance at
1190 nm is due primarily to the increase in ethanol response
(observed when the PC loadings are analysed). The loadings for the
methanol model for PCs 1 and 2 are provided in Figure 7.11.
Figure 7.10: Comparison of MSC-corrected pure component spectra and regression model
of the response methanol.
Figure 7.11: Comparison of PC 1 and 2 loadings for response methanol in alcohol mixture
analysis.

Overall, the overlaid PC1 and PC2 loadings in Figure 7.11 show
that PC1 is the methanol/propanol direction and PC2 is the
methanol/propanol vs ethanol direction.
The last plot to review is the Predicted vs Reference plot which
shows the quality of the predictions with respect to the reference
measurements for each calibration and validation object. Predicted vs
Reference plots are discussed in more detail in section 7.10.10. For
this example, the linearity of the Predicted vs Reference plot is
visually very strong and the calibration and validation points lie in
their expected spaces.
Regression coefficients and Predicted vs Reference plots for the
responses ethanol and propanol are provided in Figure 7.12 to
complete the analysis.
By comparison of all of the regression coefficients, it is apparent
that each alcohol component is modelled by different specific and
individual parts of the spectrum. This is an outcome of using a
designed experiment to generate the calibration set which is well
suited to support multivariate regression models for several Y-
variables.

7.8.4 Weaknesses of PCR

PCR is a powerful weapon against collinear X-data, and it is


composed of the two most studied and used of the multivariate
methods, MLR and PCA. Despite this, it should also be clear that
PCR is not necessarily the final solution.
Observe, first, how PCR is a distinct two-stage process: PCA is
carried out on X from which a derived T-matrix is used as input for
the MLR stage, usually in a truncated fashion, i.e. only the A “largest”
components are used as determined by an appropriate validation.
There are no objections to this if an appropriate number of PCR-
components are used, but there is no warning against using either
too “many”, or too few components. If too many components are
used, the whole idea of projection compression is lost; if too few, the
model cannot not become optimal under any circumstance.
But most importantly, with PCR there is one cardinal aspect still
not optimised: There is no guarantee that the separate PC-
decomposition of the X-matrix necessarily produces exactly what is
desired—i.e. only the structure which is correlated to the Y-variable.
There is no built-in certainty that the A first (“large”) principal
components contain only that information which is correlated to the
particular Y-variable of interest (however, from a purist’s point of
view, this is still a challenge to the data analyst involving optimisation
of the relevant pre-processing, sampling and variable selection).
There may very well be other variance components (variations)
present in these A components, or worse, there may also remain Y-
correlated variance proportions in the higher order PCs that never got
into the PC-regression model, simply because the magnitudes of
other X-structure parts (which may be irrelevant in an optimal (X, Y)-
regression sense) dominate. Again, this issue is a real challenge for
the more experienced chemometricians with extensive subject matter
expertise.
What to do?—well PLSR is the answer!

Figure 7.12: Regression coefficients and predicted vs reference plots for responses ethanol
and propanol, alcohol mixture analysis.

7.9 PLS-regression (PLSR)

7.9.1 PLSR—a powerful alternative to PCR


It is possible to obtain the same prediction results, but generally
based on a smaller number of components (sometimes called PLS-
Factors, or just Factors‡ since they describe the relationship between
X and Y and not just the structure in X), by allowing the Y-data
structure to influence directly on the X-decomposition. This is
achieved by condensing the two-stage PCR process into one: PLSR
(Partial Least Squares Regression). In the literature, the term
commonly used is just PLS, which has also been interpreted to
signify Projection to Latent Structures. In this book, the terminology
PLSR will mostly be kept for completeness, unless misunderstanding
is out of the question.
PLSR has seen an unparalleled application success, both in
chemometrics and other fields, attested to by a myriad of successful
published calibrations, studies, case histories, research papers and
textbooks. In an article recently published by Swarbrick [5] regarding
the future of chemometrics in NIR spectroscopy, the author states
that PLSR will not see a rival any time soon. However, there are a lot
of smart people out there, so maybe, just maybe a new method is
only around the corner. But from a current perspective, PLSR is
definitely still going to be the method of choice for quite some time to
come. Amongst other features, the PLSR approach gives superior
interpretation possibilities, which are best explained and illustrated by
examples. PLSR claims to do the same job as PCR, only generally
with fewer bilinear components and is easier to interpret. The reason
this can indeed be achieved is because in PLSR the decomposition
of X is directed by Y, whereas, the decomposition of X is only guided
by internal data structures for the PCR option.

7.9.2 PLSR (X, Y): initial comparison with PCA(X), PCA(Y)

In order to facilitate a clear comparison between PCR and PLSR,


focus is first placed on the regression part. PLSR uses the Y-data
structure, the Y-variance, directly as a guiding hand in decomposing
the X-matrix, so that the outcome constitutes an optimal regression,
precisely in the prediction validation sense introduced above. It is no
straightforward matter to explain exactly how this is accomplished
without getting into a mathematical or statistical exposition, or by
covering the details of the PLSR algorithm. This latter will be
explained in section 7.9.3. below, but here is never-the-less a simple
introduction to PLSR by way of comparing it with PCA and PCR.
Following the geometrical approach used in this book and
picturing PLSR in the same way that PCR was introduced, Frame 7.1
presents a simplified overview of PLSR, or rather the matrices and
vectors involved. Some help will be at hand from the earlier PCA
algorithm accomplishments, e.g. the specific meaning of the t- and p-
vectors depicted (refer to chapter 4 for more details).
A first approximation to an understanding of how the PLSR
approach works (though not entirely correct) is tentatively to view it as
two simultaneous PCAs, i.e. a PCA (X) and a parallel PCA (Y). The
equivalent PCA equations are presented at the bottom in Frame 7.1.
Note how the score and loading complements in X are called T and
P, respectively (X also has an alternative W-loading in addition to the
familiar P-loading, see further below), while these are called U and Q,
respectively, for the Y-space.§
Note that the general case of several Y-variables (q) is being
treated here. This is not a coincidence. For the uninitiated reader, it is
easier to be presented with the general PLSR concepts in this fully
developed scenario (PLS2) than the opposite (PLS1). This strategy is
almost exclusive to this present didactic approach (as has been the
case in all editions of this book). Most other textbooks on the subject
have decided to start out treating PLS1 and later to generalise to
PLS2. However, it has been found that over many years of experience
by the authors of this book as well as countless others, that the
general PLSR concepts are far more easily related to both PCA as
well as PCR within the context of starting out with a focus on PLS2.
The case of one y-variable (PLS1) will later be considered as but a
simple boundary case of this more general situation, in fact only
involving reducing a q-dimensional Y-matrix to a 1-dimensional y-
vector. The distinction between PLS1 and PLS2 is thus a duality,
since the algorithms are very similar, apart for the need to iterate
towards convergence for PLS2.
In reality, however, PLSR does not really perform two independent
parallel PC analyses on the X and Y data from the two spaces. On the
contrary, PLSR actively interconnects the X- and Y-spaces by
specifying that the u-score vector(s) are to act as starting points for
(actually instead of) the t-score vectors in the X-space decomposition
calculations. Thus, in the PLSR method, the starting proxy-t1 is
actually u1, thereby letting the Y-data structure directly guide the
otherwise “PCA-like” decomposition of X. Subsequently, in a
completely symmetrical fashion, u1 is later substituted by t1 at the
relevant stage in the PLSR-algorithm in which the Y-space is
decomposed (symmetrical substitutions).
The crucial point is that it is the u1 (reflecting the Y-space
structure) that first influences the X-decomposition leading to
calculation of the X-loadings, but these are now termed “w” (w
stands for “loading-weights”). Then the X-space t-vectors are
calculated, in a “standard” PCA fashion, but necessarily using this
newly calculated w-vector. This t-vector is now immediately used as
the starting proxy-u1-vector, i.e. instead of u1, as described above. By
this means, the X-data structure also influences the “PCA (Y)-like”
decomposition. Computationally, the algorithm behind PLSR makes
use of an active, symmetric interchange of the t1 and u1 vectors in the
compound scheme, which subsequently performs the exact same
function for loading calculations as does the standard PCA.
This simple view of the inner workings of the PLSR approach is in
fact sufficient for a sufficient overview and understanding of the
algorithm, see Frame 7.1.
The PLSR algorithm is specifically designed around these
interdependent u1 ⇒ t1 and t1 ⇒ u1 substitutions in an iterative way
until convergence. At convergence, a final set of (t, w) and
corresponding (u, q) vectors have been calculated for the current
PLSR factor (a), for the X-space and the Y-space, respectively. There
are really only a few minor matters remaining in the full PLSR
algorithm, to be further detailed below.
Frame 7.1: Partial Least Squares Regression: schematic overview of matrices and vectors.

Thus, what might at first sight have been viewed as two parallel
sets of independent PCA decompositions, is in fact based on these
interchanged score vectors instead. In this way, the goal of modelling
the X- and Y-space inter-dependently has been achieved. By
balancing both the X- and Y-information, PLSR actively reduces the
influence of large X-variations which do not correlate with Y, and so
does the job of removing the problem of the two-stage PCR
weakness outlined above.
In many algorithmic implementations of PLSR, a slope of 1 in the
relationship between t and u is assumed, thus Y can alternatively be
represented as Y = TQT + F. Some chemometricians prefer the
simpler relationship based on a common object designation; T, in the
model equations, i.e. working with the Y-space without u-score
vectors as it were. While this will result in numerically identical
predictions, a.o. the possibility of insight into the operative X–Y part
models is completely lost with this option, however, for which
reasons most chemometricians prefer the full t–u approach. The
authors of this textbook are very strongly advocating this latter
approach precisely because of its vastly superior model
interpretability features. “There should be no PLSR modelling without
full model understanding”.
The Non-Iterative Partial Alternating Least Squares (NIPALS)
algorithm was historically developed first [6]. It will be outlined in full
algebraic detail below with reference to the standard PCA algorithm
which was introduced in the same detail in chapter 4, section 4.14.

7.9.3 PLS—NIPALS algorithm

In the following description, particular numerical issues are not


addressed regarding the algorithm; it is sufficient to appreciate the
specific projection/regression characteristics of the PLSR NIPALS
algorithm. Note that the PLSR-component index, termed “a” in the
text, is called “f” here.
Centre and scale both the X- and Y-matrices appropriately
Component index initialisation, f: f = 1; Xf = X; Yf = X; Yf = Y
1) For uf choose any column in Y (initial proxy u-vector)

2) (w is normalised)
3) tf = Xfwf

4) (q is normalised)
5) uf = Yfqf
6) Convergence: if |tf,new – tf,old| < convergence limit, stop; else go to
step 2. (uf may be used alternatively)

7)
8) (PLSR inner relation)

9)
10) f = f + 1
Repeat 1 through 10 until f = A (optimum number of PLSR factors
by validation).

Explanation of the NIPALS algorithm for PLSR:


1) It is necessary to start the algorithm with a proxy u-vector. Any
column vector of Y will do, but it is advantageous to choose the
largest column, max |Yi|, as this will reduce the number of iterations
needed. Some algorithms stipulate the third Y-column, which will
also get the iterative algorithm going just as well, but likely with a
higher number of iterations needed. It actually doesn’t matter which
column one uses... just get the iterations started!
2) Calculation of loading weight vector, wf, for PLS-factor no. f.
3) Calculation of score vector, tf, for PLS-factor no. f.
4) Calculation of loading vector, qf, for PLS-factor no. f.
5) Calculation of score vector, uf, for PLS-factor no. f.
Steps 3 and 5 represent projection of the object vectors onto the
fth PLSR-factor in the X- and Y-variable spaces, respectively. By
analogy, one may view steps 2 and 4 as the symmetric operations
projecting the variable vectors, w and q, onto the corresponding fth
PLSR-factors in the corresponding object spaces. Note also that
these projections all correspond to the regression formalism for
calculating regression coefficients. Thus, the PLSR-NIPALS algorithm
has also been described as a set of four interdependent, “criss-
cross” X–Y regressions. This is a simple approach (if one is able to
grasp the underlying mathematics)—but also simple if one is only
able to appreciate the geometrical projections manifestations.
Note that in PLSR the “loading weights vector”, w, is the
appropriate representative for the PLSR-component directions in X-
space. Without going into all pertinent details, the central
understanding is that of the w-vector representing the direction which
simultaneously both maximises the X-variance modelled as well as a
more important criterion of maximising the Y-variance (actually the
driving criterion for PLSR): After convergence w reflects the direction
which maximises the (t, u)-covariance, or correlation (for auto-scaled
data) between the two spaces. Høskuldsson [4] describes these
relationships in a systematic mathematical manner, made more easily
understandable after having been introduced to the predominantly
geometric projection aspects of the same mathematical model
equations used here.
1) Convergence: The PLSR-NIPALS algorithm very often converges
to a stable solution in less iterations than for the equivalent
PCA/PCR situation, because the correlated data structure in both
spaces are interdependently supporting each other. At
convergence, for the stable solution, the PLSR optimisation
criterion is composed of the product of both a modelling
optimisation term and a prediction error minimisation term; the
combined criterion known as the H-principle, ibid.
2) Calculation of p-loadings for the X-space. These are mainly needed
for the subsequent updating of the model spaces, although some
statisticians prefer also to do the X-space interpretations based on
these p-loadings (as opposed to the w-loading weights). In this
matter opinions between professionals differ rather sharply, but
“luckily” the PLSR-predictions will be the same… Ergon [7] and
Ergon et al. [8] shed a great deal of light on this issue.
3) Calculation of the regression coefficients for the so-called “inner
X–Y space regression” is of paramount importance for PLSR
modelling. This inner relation can be graphically depicted as the “T
vs U plot” (t vs u), which constitutes the central score plot of PLSR,
occupying a similar interpretation role as does the equivalent (t, t)-
plot for PCA (X). There is of course also a double set of such (t, t)-
and (u, u)-score-plots available from a PLSR model for “(X, Y)
space-specific insight”. It bears noting that the inner PLSR relation
is made up of nothing but a standard univariate regression of u
upon t.
This inner relation is to be understood, literally, as the operative
X–Y link in the PLSR-model, calculated and visualised for one
component at the time. This link is specifically estimated one
dimension at a time (“partial modelling”), hence the original PLSR
acronym: Partial Least Squares Regression.
4) Updating:

This step is often called deflation: Subtraction of component no. f


in both (X, Y) spaces. The p-loadings are only used in the updating,
but the w-loading weights carry the effective model information. By
using the w-vectors in the further model development, secures a
desired orthogonality for the subsequent t-vectors. Data analysts
are particularly interested in orthogonal score vectors because this
makes causal interpretations easy (much easier than if higher
components display an oblique relationship to one another).
A note for more advanced insight: the p-loadings calculated for the
X-space after the w-vectors are in place, are not orthogonal. There
is a surprisingly big cache of issues related to which vectors are
orthogonal to one-another, or not—the theoretical literature is
extensive. This is most definitely not a place to begin your data
analyst career—but for the really interested data analyst, after
some years of practice, referral can be made to the magisterial
treatises by Ergon [7] and Ergon et al. [8], which work through all
pertinent issues with meticulous care.
5) The PLSR model: TPT and UQT is thus calculated—and deflated—
for one dimension at a time. After convergence, the rank one
models, tfpTf and ufqTf are substituted appropriately, the latter
expressed as Yf + 1 = Yf – bTfqTf by inserting the inner relation, so as
to allow for appreciation of how Y is related to the X-scores, t. By
deflating, an updated set of matrices (X, Y) have been made ready
for a new dimension to be modelled, i.e. the “next PLSR-factor”
can now be calculated using the exact same algorithmic formalism
—it is only the factor number “f” that changes f = f + 1.
6) It is hopefully clear how the iterative NIPALS algorithm forces the
convenient “one factor at the time” scheme, which is a logical way
to approach a step-wise modelling of the interdependent data
structure of (X, Y). “Direct” decomposition methods exist that
calculate all components in one approach, e.g. Singular Value
Decomposition (SVD) or direct “bi-diagonalisation” approaches.
These numerical alternatives are not treated in the introduction
level of this book, but have been treated extensively in the
chemometric literature.
What is the main difference between PLSR and PCR then? PLSR
uses the information in Y actively to find the Y-relevant structure in X,
with w representing the maximum (t, u)-covariance/correlation. Only
after this criterion has been optimised will the X- and Y-space
variance maximisation be attended to, in fact the X-space will by
necessity be smaller than for an independent PCA(X) because the
operative component direction is tilted in order to satisfy the
maximum (t,u) correlation criteria. This tilting is revealed as the
difference between the w- and the p-loadings for each component.
Joint plotting of these alternative vectors can show exactly how much
information could actually be used from the X-space to maximise the
Y-predictability of the model established. There is almost always
some “deadwood” data structure in X, which is better off being
excluded from the PLS-regression model.
Thus, PLSR focuses on the co-varying relationship between these
two spaces, be this expressed as covariance or as correlation. This
then is why PLSR results in simpler models (fewer components). The
exception would be in the case where the X-data structure in fact
only contains Y-relevant information, a situation that certainly does
occur, but not frequently.

7.9.4 PLSR with one or more Y-variables

From a conceptual point of view, one might distinguish between two


alternatives of PLSR; “PLS1” the situation with only one Y-variable,
and the case with multiple Y-variables, “PLS2”.
PLSR gives one set of X- and Y-scores and one set of X- and Y-
loadings, which are valid for all of the Y-variables simultaneously. If
instead one PLSR model is made for each Y-variable, one set of X-
and Y-scores and one set of X- and Y-loadings will be obtained for
each Y-variable. PCR also produces only one set of scores and
loadings for each Y-variable, even if there are several Y-variables.
PCR can only model one Y-variable at a time. Thus, PCR and PLS1
are the natural pair to match and to compare, while PLS2 is truly in a
class of its own.
From a historical data analysis point of view, the use of PLS2 was
for many years thought of as the epitome of the power of PLSR:
complete freedom—modelling X (unlimited number of variables)
together with an arbitrary number of Y-response variables,
simultaneously. Gradually, however, as chemometric experiences
accumulated (literally over decades) everything more and more
pointed to the fact that marginally better prediction models were
always to be obtained by using a series of PLS1 models on the
pertinent set of Y-variables (and sometimes much better than
marginally…). The reason for this is easily enough understood—
especially with 20/20 hindsight, see below.

7.9.5 Interpretation of PLS models

In principle, PLSR models are interpreted in much the same way as


PCA and PCR models. Plotting the X-and the Y-loadings in the same
plot allows the study of the inter-variable relationships, now also
including the relationships between the X- and Y-variables in the
pivotal t–u plot(s).
Since PLSR focuses on Y, the most Y-relevant information is
usually expected in the first components. There are, however,
situations where the variation related to Y is subtler, or perhaps is
subdued at first by other variances not related to Y, in which case
relatively many components will be necessary to model a significant
fraction of Y. Modelling protein in wheat from NIR spectra, for
example, may require an exceptional 8–18 components, because the
dominating variations in the data may be related also to grain size,
packing, chemistry etc. By studying how the modelled Y-variance
develops with increasing numbers of components, it is often
immediately clear which components explain most of the Y-variance.
The number of PLSR factors to interpret is also a highly relevant
topic, particularly in regulated industries like the pharmaceutical
industry. Unlike in many other application areas that represent no real
life-threatening actions as a result of imprecise predictions, the
pharmaceutical industry dictates that all loadings (and loading
weights) can be interpreted in the event of a serious incident as a
result of an imprecise prediction. A general rule of thumb is that the
number of components should be commensurate with the complexity
of the system, i.e. if a binary mixture of powders is being analysed,
ideally the number of PLSR factors in the model should be 1.
Practical limitations may affect this number somewhat, but if a model
of, say, 10 PLS factors was observed for the modelling of such a
simple binary mixture, serious questions must be asked about the
sampling and the reference method before any further work is done.
A classic example of PLSR applied to a binary mixture is the gluten-
starch data set presented at the end of this chapter.

7.9.6 Loadings (p) and loading weights (w)

All PLSR calibrations result in two sets of X-loadings for the same
model. They are called loadings, P, and loading weights (or just
weights or PLSR-weights), W.
The P-loadings are very much like the well-known PCA-loadings;
they express the relationships between the raw data matrix X and its
scores, T. These may be used and interpret in the same way as in
PCA or PCR, so long as it is remembered that the scores have in fact
been calculated by PLSR (and are not necessarily orthogonal).
In many PLSR applications P and W are quite similar. When this is
the case, it is because the dominant data structures in X “happen” to
be directed more or less along the same directions as those with
maximum correlation to Y. In all such cases the difference is not very
interesting—the p and w vectors are pretty much identical. The
duality between P and W will only be important in the situation where
the P and the W directions differ significantly, however. This
difference between loadings and loading weights will be
demonstrated in the octane in gasoline example presented later in
this chapter.
The loading weights, W, represent the effective loadings directly
connected to building the desired regression relationship between X
and Y. Vector w1 characterises the first PLSR-factor direction in X-
space, which is the direction onto which all the objects are projected
in a PLSR model. Principally, this direction is not identical to the p1
direction, but differences between p1 and w1 may vary. It is precisely
this difference that provides information on how much the Y-
guidance has influenced the decomposition of X; one may think of
the PCA t-score as being tilted because of this PLSR-guidance. One
illuminating way to display this relation is to plot both alternative
loadings in the same 1-D loading plot or as side-by-side plots.
The sequential updating in the PLSR-algorithm implies that a
similar relationship to that between p1 and w1 also holds for the
higher-order PLSR-factors 2, 3,... The w-vectors that make up the W
matrix are all mutually orthogonal. Therefore, inspection and
interpretation of the loading weights in 2-D vector plots, e.g. w1
versus w2, can be performed exactly as was done for PCA (using p–p
plots), while remembering that W relates to the regression between X
and Y. Thus, W tells the effective PLSR “X-space loading story” with
respect to the inter-variable relationships.
In PLS2 models, there is also a set of Y-loadings, Q, which are the
regression coefficients from the Y-variables onto the individual
scores, U, Q and W. These may be used to interpret relationships
between the X- and Y-variables, and to interpret the patterns in the
score plots related to these loadings. The specific use of these
double sets of scores (T, U) and loadings (W, Q) are illustrated by the
practical PLSR-analytical examples to be presented below.
The fact that both P and W are important, however, is clear from
construction of the formal regression equation Y = XB from any PLSR
solution with A factors. This B “compaction” regression matrix is
derived as follows:

B = W(PTW)–1QT
This B-matrix is often used for practical (numerical) prediction
purposes, if there is no specific need for interpretation of the model in
question.

7.9.7 The PLS1 NIPALS algorithm

Now for the NIPALS algorithm for the PLS1 approach, in which the Y-
matrix is in the form of a single y-vector (Q = 1). Because the PLS2
algorithm was discussed in detail above, it will be very easy both to
understand PLS1’s specific features, but especially so to appreciate
its particularly simple “reductionist” position in comparison to PLS2.
In fact, by noting the matrix–vector substitution, Y → y, it will be
possible to follow how the relevant steps in the corresponding PLS2
algorithm that concerns the score vector u, all simply collapse, which
the, at first, rather surprising result that the convergence feature is
actually made redundant. The result is a much simpler, non-iterative
calculation procedure. As usual, centring and scaling first:
1) Centre and scale both the X- and Y-matrices appropriately.
2) Index initialisation, f: f = 1; Xf = X; yf = y
3) The y-vector is now its own proxy “u-vector” (there is only one Y-
column)

a) (w is normalised)

b) tf = Xfwf

c)

d)

e) Xf + 1 = Xf – tfpTf yf + 1 = yf – qftTf
f) f = f + 1
Repeat until f = Aopt (optimum number of PLSR-factors by
validation).
The PLS1 algorithm, and procedure, is as simple as this. There are
no other bells or whistles. Because of its computational simplicity
PLS1 is very easy to perform, but there are other, more important—
indeed crucial—reasons why PLS1 has become the singular most
important multivariate regression method, which will be laid out
below.

7.10 Example—interpretation of PLS1 (octane in


gasoline) part 1: model development

7.10.1 Purpose
The production of petroleum (gasoline) in major oil refineries is big
business, but also one in which even small inefficiencies can translate
to large operational profit losses. One of the major properties in
gasoline production is to optimise the blend of products within tight
octane number specifications. Octane number is a measure of the
burning efficiency of the fuel, with higher octane number gasolines
forming the basis of premium fuels (where the highest profit margin
can be made) and lower octane number fuels going into more
standard products.
It is not only the final octane number that requires optimisation,
but also the individual component amounts of the blend feedstocks
used to attain optimal octane numbers, i.e. blend components that
typically display higher octane numbers are either expensive (made
by synthesis), including components such as alkylates, or are
environmentally dangerous and must be regulated (i.e. benzene and
other aromatic compounds). In modern refineries, where thousands of
litres of gasoline are produced every minute, if a deviation from recipe
is not detected in a timely manner, then gasoline of higher
performance may be blended into a lower grade product (this would
be to the consumer’s benefit) whereas, in the opposite case, lower
quality gasoline is being produced for premium product (the
consumer’s disadvantage).
It has been estimated that every 0.3 octane number deviation
resulting from inaccurate or inefficient blending costs a refinery
around 0.20 USD per barrel of product (2016 figures). Considering
that a small refinery produces 100,000 barrels (bbl) per day and a
large refinery up to 1.2 mil bbl per day, such “give away” costs range
between 20K and 240K USD per day of operation. If left in such an
un-optimised state even at 0.20 USD giveaway per barrel, over a
continuous year, a single refinery may well give away a very
significant part of its profits. This is why methods such as on-line
spectroscopy are highly desirable particularly for implementation into
blending operations. These systems produce real or near-real time
data that can be used to control feedstock flows into the blend
headers and control the gasoline production to tight specifications.
The following example shows how a model for octane number
prediction can be developed using NIR spectroscopy and PLSR.

7.10.2 Data set

The octane data set consists of two sample sets:


1) A training (calibration) set consisting of 26 gasoline sample spectra
spanning the range of products (high, medium and low octane)
measured between 1100 nm and 1550 nm.
2) A test set (for test set validation) consisting of 13 gasoline sample
spectra spanning the same space as the calibration set.
To add to the complexity of the analysis, a number of samples
were prepared with an additive and included as part of the calibration
and validation sets. These samples are not known a priori (well they
are, but here it is the data analysts’ realistic assignment to find
them… a very realistic exercise).

7.10.3 Tasks
This data set will be analysed using The Unscrambler® PLSR
capabilities and a description of the steps involved will be outlined in
detail in this example. The main tasks are to develop a calibration
model for octane number using the training set and validate the
model using the test set.

7.10.4 Initial data considerations

1) Spectral data come in many proprietary formats and The


Unscrambler® allows for importation of many of these formats.
Once the data set has been imported, it takes the form of a
spreadsheet where the data analyst must define Row Sets, i.e.
calibration and validation sets and Column Sets, i.e. Spectra (X)
and Octane (y). In the case of this particular data set, two sets of
external information were included,
a) Gasoline Type: high, medium and low octane content.
b) Additive: whether the sample contained an additive or not (but
the data analyst is at first relegated from access to this added
information).
2) Since spectral data are best visualised using line plots, it is
important to gain a first overview of the data for spectral
consistency and see if any suspect samples (primarily noisy
samples) can be detected in the data set. The line plot of the
spectra for the calibration and validation sets is shown in Figure
7.13. The spectra were sample grouped based on the external
information (categorical variable) Type.
Initial inspection of this data shows that there are two populations
of sample present and the main region of difference occurs between
1350 nm and 1550 nm. The sample grouping does not reveal too
much information otherwise. However, if the plot is zoomed into the
region around the 1200–1250 nm, the plot of Figure 7.14. shows that
there is internal structure related to the type of gasoline.
There is a decrease in the absorbance at the shoulder centred at
1220 nm that is related to octane number, with increased absorbance
related to low octane and vice versa. Based on subject matter
expertise, this region relates to –CH3 absorbances and in particular
may be related to straight chain hydrocarbons in the samples, which
are known to be lower in octane number.
The corresponding line plot of spectra, now grouped by the
variable “Additive” are presented in Figure 7.15. It is perhaps not
surprising that the very marked differences in the spectral profiles in
the region 1350–1550 nm are in fact related to the additive(s). Clearly
the first order signature of a gasoline with an “additive” resides in this
wavelength interval—but can this also be found out using only the
wavelengths below 1350 nm?
Three simple plots Figures 7.13–7.15 and already the complexity
of the data sets is pretty clear to the data analyst. Ah—the power of
proper initial inspection of X!

Figure 7.13: Line plot of NIR spectra collected on gasoline samples.


Figure 7.14: NIR spectra of gasoline in the 1200–1250 nm region showing the relationship of
absorbance with gasoline type.
Figure 7.15: NIR spectra of gasoline samples grouped by the variable Additive.

Samples typically collected for octane determination are


performed using glass cuvettes, or by using flow-through cells with
constant path length. Since the samples are typically non-turbid
liquids, the spectra only required simple baseline correction as a
preprocessing and are now ready for analysis.

7.10.5 Always perform an initial PCA


No matter what the end goal of the multivariate data analysis is,
always perform a PCA to gain some first understanding of the
multivariate internal data structure. In this example, the calibration
data were used to produce the model and “validation” was
particularly easy, as it follows visually from the calibration curve plot.
The PCA overview is shown in Figure 7.16, which distinguishes
between calibration set and validation set sample scores.

Figure 7.16: PCA overview of the NIR octane in gasoline data set.

The interpretation of the data analysis is as follows:


1) The Explained Variance plot shows that the model dimensionality
should be 2, which models 98% of the total data variability.
2) There are two distinct groups of data as shown in the score plot.
PC1 describes this difference very well, while the rest of the sample
population is displaying a systematic within-group trend aligned
along PC2. This is the multivariate manifestation of the
relationships already disclosed by inspection of the raw spectra
alone.
3) The line loading plot shows that PC1 is solely attributable to the
additive signature spectral features between 1350 nm and 1550 nm
(the loadings below 1350 nm are all close to zero). PC2 is
describing changes in the –CHx region of the spectrum and in
particular is showing an inverse effect between the absorbance
band at 1150 nm and the absorbance band at 1200 nm.
4) The influence plot suggests one sample with a high leverage in the
data set is the only potential outlier.
When the raw data were sample grouped by the feature “additive”
(Figure 7.15), these samples reveal themselves as the ones with the
very distinct features. The scores plot of Figure 7.17 shows the data
grouped by Additive, and these are distinguished based on spectral
loadings higher than 1350 nm.
Now for the most interesting test of the information density of the
spectra. The data analyst should try to re-do this analysis based only
on wavelengths below 1350 nm.
The question now that would most likely arise is, “is there a
requirement for two quantitative models, one for samples with
additive and one without?” Or can a multivariate PLSR account for
both these features in one model? These questions can only be
answered by performing the regression analysis. The answer might
surprise beginner data analysts…

7.10.6 Regression analysis

Regression analysis was performed using PLSR, using the calibration


set to develop the model and the validation set to determine the
optimal model rank. The PLSR overview is presented in Figure 7.18.
Figure 7.17: Scores plot of octane in gasoline data grouped by Additive.

The interpretation of the data is as follows:


1) The Explained Variance plot shows that the model rank is now 3
and this describes 98% of the total Y-variability. Note that how PLS
Factor 1 only describes a minor fraction of the total Y-variability,
which is dominantly modelled by factor 2 with a small, but
significant last addition from factor 3. Correspondingly, the X-
Explained variance shows that 98% of the X-variance is explained
already with two factors. This indicates that while factor 3 is useful
for explaining Y, it is only using a very small amount of X to do so.
2) As also observed in the PCA, there are two distinct groups of data
as shown in the score plot. In fact, the t–t plots from PCA and
PLSR are quite similar, of course attributed to whether the samples
contain additive or not.
3) The regression coefficient plot shows a smooth continuity profile
along the wavelength axis and this indicates that the region that
mostly describes octane number is around 1200 nm with the region
around 1350 nm the second influencing contributor.
4) The Predicted vs Reference plot is highly linear and indicates that
while the samples may be in two distinct groups, the information
pertaining to octane number resides in a different part of the
spectrum compared to what separates the groups. Thus, it is
possible to model the (X, Y) octane number relationship regardless
of the ± additive features which dominated the independent PCA.
How is this situation at all possible? The answer lies in a joint
assessment of the loadings and the loading weights.

7.10.7 Assessment of loadings vs loading weights

It was observed in the PCA model that the spectral features that
separate the data into two groups occur in the region 1350–1550 nm
and from the initial PLSR, the spectral features related to octane are
in the region 1100–1350 nm. Figure 7.19 provides the loadings and
loading weights for factors 1 and 2 of the PLSR model displayed in a
four-pane overview.

Figure 7.18: PLSR overview of octane number in gasoline sample set.


The left-hand plots show the X-loadings (top) and X-loading
weights (bottom) for PLSR factor 1. The X-loadings for PLSR can be
interpreted in a similar manner to PCA loadings and these are
different in certain wavelength bands in detail from the X-loading
weights for PLSR factor 1, although the general form of the entire
spectrum also has a strong communality. The loadings are describing
the source of major variability in the X-data and this is not directly
related to the variability in Y. The X-loading weights for PLSR factor 1
show that the region of the spectrum most related to Y is between
1100 nm and 1300 nm and that little influence is coming from the
region of the spectrum above 1300 nm (which is where the additive
signature resides).
The right-hand side of Figure 7.19 shows the loadings and loading
weights for PLSR factor 2. The loadings and loading weights are here
very similar and therefore the spectral variability in X, after removal of
the influence of PLSR factor 1, is being used to model Y.
This comparison answers the question regarding how the samples
that are outliers in PCA and the PLSR scores along factor 1 can be
modelled; the information in X pertinent to Y is found in the
“effective” loading weights, the w-spectrum. This is an illuminative
demonstration of the power of PLSR’s design feature: Y-guided X-
decomposition—one might say that PLSR cuts straight to the chase:
where in the X-data structure is the most Y-correlated information to
be found?

7.10.8 Assessment of regression coefficients

The regression coefficients for the three-factor PLSR model are


shown in the top-right pane of Figure 7.18. The main thing to look for
in this plot (particularly for spectral data) is for high-loading (loading-
weight) intervals with smooth features (noise free), which are not too
complex in structure with respect to the original data. Looking at
these coefficients and mentally visualising them on an absolute scale
(i.e. thinking that the negative “peaks” are all positive), then the
regression coefficients look a lot like the original data, with the region
1350–1550 nm effectively weighted to zero. The interpretation of the
regression coefficients, in this case, is that of a component with an
absorption band dominated by the very broad peak centred at 1200
nm (i.e. low octane components). When gasoline components whose
characteristic absorbances are centred around 1160 nm and 1360
nm are present, these absorbances will be weighted higher and result
in higher octane number gasolines.

Figure 7.19: Assessment of PLSR factor loadings and loading weights for octane number in
gasoline data set.

It should be mentioned that the regression coefficients in a PLSR


model give an estimate of the so-called Net Analyte Signal (NAS)
which is the pure spectrum of the constituent of interest after
“subtracting” the pure spectra of the other constituents in the system
(the effective interferents). The subtraction in this case takes place in
mathematical terms as an orthogonalisation. The “art” of interpreting
compound regression coefficients is not at all easy, since all Aopt
PLSR components have been compacted to just one vector. Thus,
the coveted superior insight into the full “mechanics” of a multi-
component PLSR model are again lost… Höskuldsson [4] discusses
these features in a scholarly fashion.

7.10.9 Always use loading weights for model building and


understanding

For these reasons, an experienced chemometrician will never just use


regression coefficients—on the contrary. One shall always allow the
most informative insight into the operative X–Y relationships which is
provided by the complete set of Aopt loading-weights. Regression
coefficients carry the same numerical information needed for the
subsequent prediction use of the model, as does the alternative full
dimensional (Aopt) model formulation. There is a famous dictum in
chemometrics: “No model without understanding (w’s); no prediction
without validation” (attributed to Harald Martens). Thus, always use
loading-weights for an understanding of WHY and HOW the model
works (X–Y relationships)... and never press on with prediction before
proper validation has been performed.
In the case of a single Y-variable, the individual elements in the
loading weight vector for the first component is proportional to the
individual correlation between each X-variable and Y when the
variables are scaled to unit variance (another good reason to always
use auto-scaling of both the X- and Y-spaces). For mean-centred X-
variables the loading weights express the covariance. This also
implies that the regression coefficients for a one-component model
are expressing the same relative importance of the individual X-
variables.

7.10.10 Predicted vs reference plot


The desired features of a Predicted vs Reference plot are a slope
close to 1 and that all points lie close to the fitted line. There are
always two prediction lines available for PLSR modelling,

Figure 7.20: a) Calibration and b) validation predicted vs reference plots for octane in
gasoline data.

1) The calibration line is an assessment of the model fit to the objects


used in its construction and
2) The validation line, which depends on the method of validation
used, and which assesses the model against “independent”
reference samples, particularly when test set validation is used.
The Predicted vs Reference plot for the calibration and validation
sets is shown in Figures 7.20a and b.
The calibration and validation statistics for the three-factor PLSR
model are provided in Table 7.1.
The cardinal rules for interpretation of regression model statistics
are:
1) A slope of close to 1.00 is desired and there should preferentially
be close agreement between the calibration and validation lines
(which there is in this case). A slope close to 1.00 speaks of good
(very good) accuracy with respect to the prediction performance of
the PLSR model.
2) RMSE/SE(C/P) should be in agreement with the uncertainty of the
reference method, taking into account propagation of errors. The
Pharmaceutical Analytical Services Group (PASG) has suggested
that as long as the RMSE/SE(C/P) is less than or equal to 1.4 × SEL
(where SEL is the Standard Error of Laboratory), then the primary
reference method can be replaced with the alternative prediction
method. In the present case, the reference method can measure
octane number down to a precision of 0.2 octane numbers,
therefore the calibration and validation sets must have standard
errors less than or equal to 0.28 octane numbers. The present
model meets this criterion. RMSE/SE(C/P) speaks of the precision
obtainable with the PLSR model in question.

Table 7.1: Calibration and validation statistics for octane in gasoline


PLSR model.

Parameter Calibration Validation


model model

Elements (number of objects in the set) 26 13

Slope 0.98 1.01

Pearson’s correlation (R2) 0.98 0.98

RMSE(C/P)* 0.27 0.23

SE(C/P)** 0.27 0.24

Bias <0.001 0.008

*RMSE: Root Mean Square Error, C: Calibration, P: Prediction


**SE: Standard Error

3) Bias should not be significant in the statistical sense. One common


measure with which to assess the significance of a bias is the
following rule of thumb, –SE(C/P) ≤ Bias ≤ +SE(C/P). This model
also meets the criterion for no significant statistical bias. The strong
criterion for bias is “close to zero”.
4) Pearson’s R2 is the final statistic to assess but specifically only after
the correct, optimal number of factors have been found for the
model. It has been the present authors’ observation over many
years of experience with new users in chemometrics that R2 is
quickly viewed as the universal justification for model validity.
NOTHING CAN BE FURTHER FROM THE TRUTH. In many
situations, validation is usually discarded in favour of quickly
adding components to a model, to achieve the best possible
calibration model R2 statistic. But as R2 MUST ALWAYS improve
with the inclusion of more PLSR factors, this is only partially a
measure of the true predictive ability of the model, and a bad one
at that. Only proper validation will give a balanced view of the dual
criteria necessary to perform a valid performance validation: slope
(accuracy) and R2 (precision). People who suggest the single-
minded R2 approach to model development do not know whereof
they are speaking.
RMSE and SE are defined in detail in section 7.12.

7.10.11 Regression analysis of octane (Part 1) summary

At this stage, a short summary of the octane number data analysis


will be presented before the next sections, which describe all the
important model diagnostics to consider when interpreting and
validating a regression model. This example will be returned to later
as well.
The octane number data set is an example of “clean”, or “precise”
data, where sampling and preprocessing issues are at a minimum,
resulting in “easy to model” data. The only real issue that could have
arisen in this data set was the collection of poor quality reference
data. As long as the important rule that the same sample scanned is
the one sent for reference analysis is followed, then the risk of bad
reference data is further minimised. But it must be kept in mind that
this “same sample” may or may not be representative of the mixture
stream intercepted at the refinery, an issue that formally lies outside
the data analysis realm and context, but see chapters 3, 8 and 9 for
the strong obligations that weigh on the shoulders of cognisant and
responsible data analysts.
PCA and PLSR of the data set revealed two distinct groups, one
set where an additive was part of the formulation and a set without
this additive. PCA and PLSR loadings were able to detect where the
spectral variables associated with the additive were and sample
grouping helped to confirm the observations. Could it be confirmed
whether this feature may also have been elucidated based only on
wavelengths below 1350 nm?
Even with the rather disparate samples included, the regression
model was found to be acceptable, with error estimates for a three-
factor PLSR model that met RMSE/SE(C/P) statistics and good
linearity. There was excellent interpretability in the structure of the
loading-weights as well as regarding regression coefficients, which
made overall sense for a situation where such a model could be
tested in a real-world application, including at-line or on-line blend
monitoring.
In part II of this example, the concepts of interactive model
development and variable selection will be addressed. This will
illustrate some practical ways to refine models by perhaps not using
all of the data available and also show the difference between
eliminating suspect samples and refining variables.

7.10.12 A short discourse on model diagnostics

There are several diagnostics procedures available to the data analyst


to assess the quality of a multivariate regression model, in fact they
can be used for nearly any multivariate model. The models described
in this chapter aim to describe the structures of both X- and Y-
matrices and the residual terms EX and EY (one for each matrix type)
can both be used to provide useful information about the model
quality. The general forms of the model are for X- and Y-variables,
respectively,

Analysis of residuals is carried out for several purposes, including,


1) Detection of outliers or suspect samples (as observed in the
identification of the two populations of samples in the octane
number data, refer to section 7.10).
2) Identification of systematic variation(s) which were not accounted
for by the model (especially when variables are closely similar, such
as in a spectrum or a chromatogram),
3) Detection of drifts or trends, or unexpected jumps in the predicted
values generated by the model.
Detection of the class of separate samples described in points 1 and 2
mainly relate to these being potential outliers in the X-space, i.e.
samples that don’t fit the model well, or extreme samples that are
then termed influential. The residuals described in point 3 are those
related to predicted values being outside specific statistical limits
with respect to the perfect residual state ɛ = 0, where,

Each of these residual types will be discussed in the next


sections. For a comprehensive treatment of residuals in regression
analysis, the interested reader is referred to the book by Cook and
Weisberg [9].

7.10.13 Residuals in X

In the algorithmic details of PLSR as defined in sections 7.9.3–7.9.7,


one of the final steps in the process was deflation. In this step, after
scores and loadings have been calculated for a PLSR factor, their
vector multiplication (outer product) can be used to reconstruct that
part of X described by the salient component (component “f”), which
is then subtracted from X. This is shown below in equation 7.7.

If Xf starts out being the original data matrix X, then Xf + 1 may


generically be understood as the remainder of what is left over after
factor f has been subtracted. When expressed in terms of residuals,
equation 7.8 describes how the X-residual term is calculated for a
model with any number of PLSR factors (f).

In this case, A is the optimal number of PLSR factors for the


model. If a particular object residual is “large”, or displays any form of
systematic and therefore interpretable structure, there may be two
main causes for this,
1) Not enough PLSR factors have been included in the model to
account for the systematic variation still remaining in the X-data
caused by the object in question.
2) The object is a true outlier and its structure is so much different
from the rest of the samples that it should be deleted from the data
set.
X-residuals in PCA, PCR and PLSR also find good use in
assessment of spectral outliers, because regions where the spectra
are not being adequately modelled can be visualised and interpreted
in full wavelength detail. X-residuals alone can be subjective when
using them as a visualisation tool. Taking the sum of squared residual
values can reduce the X-residual to a single point for each sample.
This is described by equation 7.9, which is the standard form of the
sum of squares for a specified vector or matrix.

Replacing ei with E performs the calculation over all objects in the


data set. The objective assessment of residuals allows for statistical
limits to be placed on the size of the X-residual and therefore provide
a means to reject objects in a real time or other situation. To do this
requires the use of the so-called Q-residuals and F-residuals.

7.10.14 Q-residuals
Q-residuals were first introduced by Jackson and Mudholkar [10] to
detect outliers that can be the result of,
1) Too few components used to adequately describe the original data
X.
2) The samples are truly outliers for a specific model.
Q-residuals are calculated from the regular X-residual as a
squared sum,

In order to utilise the Q-residual as a means for rejecting outliers, a


critical value for Q can be obtained from equation 7.10.

where Qa is the critical value for the Q-distribution and the remainder
of the terms can be found in the standard text by Jackson [11].

7.10.15 F-residuals

F-residuals are further calculated from the Q-residuals and are


considered more conservative than Q-residuals. The F-residuals from
the calibration are calculated as per equation 7.11, in which K is the
effective number of variables (the number of degrees-of-freedom):

The validated F-residuals are calculated by adjusting for the


degrees of freedom as more factors A are included in the model
(equation 7.12).
where Fi is the F-residual; Qi is the Q-residual; K is the number of
variables; and A is the number of components used in the model.
The F-residual is compared for significance against the standard
F-test hypothesis, (refer to chapter 2 for more details on F-tests),

where i is the number of samples in the model. The major advantages


of the F-residual over Q-residuals include,
1) The F-test is a more established test compared to the Q-test based
purely on statistical grounds.
2) F-residual assessment can be applied to both calibration and
validation residuals, where Q-residuals are only applicable to
calibration residuals.
3) The correction for degrees of freedom gives a more conservative
estimate in assessing the robustness of the model and individual
outliers.

7.10.16 Hotelling’s T2 statistic

Hotelling’s T2 statistic was defined in chapter 6, section 6.6.1 and


was described as the multivariate equivalent of Student’s t-test, in
particular, it gives a 95% multivariate confidence interval for the PCs
under investigation. Figure 7.21 shows the T2 ellipse plotted on the
score data from the octane in gasoline example and shows how this
statistic can be used to detect an outlier(s) in scores space.
Since Hotelling’s T2 is based on a statistical test, the size of the
ellipse can be set to match a desired significance level. In the case
where there is minimal risk to the end user of a product or process,
the significance level can be set low, therefore allowing the ellipse to
contain a larger part of the score space. However, when the ellipse is
being used to monitor critical processes, or processes used to
manufacture life critical products, then the significance level can be
set higher to allow for less acceptance of suspect products. Typical
significance levels are listed as follows,

Figure 7.21: Using Hotelling’s T2 ellipse in score space to detect outliers.

1) 1% significance (99% confidence): typical setting for safe product


manufacture where acceptance of highly variable product imposes
no economic or health risk to the end user.
2) 5% significance (95% confidence): this is the usual benchmark
used by most statisticians to define a 1 in 20 chance of accepting a
bad product/result.
3) 10% significance (90% confidence): a more conservative setting
than the 5% level where the rejection of outliers is critical, but the
inclusion of suspect samples is not detrimental to the business or
end user.
4) 25% significance (75% confidence): typically used in highly critical
processes and analyses where the rejection of even suspect
samples is tolerated such that the absolutely lowest risk of harm to
the end user results.
7.10.17 Influence plots for regression models

Influence plots were also described in detail in chapter 6, section


6.6.4. These are the standard outlier detection tool for multivariate
models. In regression modelling and application of regression
models, they serve the purpose of ensuring that a new object has
similar characteristics to the calibration set of objects. They only
utilise the X-structure of the data, however, if a prediction value is
deemed to be suspect, the influence plot can be used to assess the
quality of the sample being measured.
If both the Y-predicted value and the X-data are suspect, then this
indicates that the new sample being measured is different, or the
measurement system being used may be out of calibration. If the Y-
value is not suspect, but is detected as an outlier in the influence plot,
this may indicate that a new sample matrix has been introduced to
the system, however, it will not impact on the predicted Y-values
(similar to the observation made in the octane in gasoline example,
where the samples with additive were spectrally different, but this had
little to no effect on the prediction of octane number).

7.10.18 Always check the raw data!


In all cases, a data analyst should always go back to the raw data and
check what makes the outlying object(s) special. Whenever possible,
consult those who collected the data. Maybe it is an erroneous value
caused by an instrument breakdown or a reading mistake, or
typing/data transcription mistake? Maybe this object has been
collected under very different conditions than the rest (but this was
recorded in the log). Or maybe it is just an “accidental extreme”,
whatever is meant by this catch-all phrase. Checking the raw data
very often gives valuable information to be used later in the analysis.

7.10.19 Which objects should be removed?


From a strictly data analytical point of view, objects that do not fit in
with the rest should only be removed based on justification. When
this is possible, then removal is necessary, otherwise outlying objects
will harm the model. It is always wise to remove only one or two
outliers at a time, starting with the most extreme ones as revealed by
the first components. Very often removal of a serious outlier results in
significant changes in the remaining data structure, when modelled
again. As one example of more or less counterintuitive results, the
two “less serious” outliers in one run may behave more like the norm
in the next run. Several outlier-screening runs are strongly suggested
to be performed iteratively instead of trying to catch all in one go.
If an “outlier” is removed that is really only an extreme end-
member object (along the direction of the component model in
question), the model may not get better, or in some cases may even
get worse. Extreme end-member objects actively help to span the
model. Extreme end-member samples are easy to spot. They occupy
the extreme ends of the T vs U regression lines, while outliers lie
orthogonally off this line (perpendicular to the model). This distinction
is much more difficult to appreciate if only numerical outlier warnings
are used, such as leverage indices etc. Always study the score plots
carefully (t–u). It is not comprehensive enough, in fact it is principally
deceitful, to try to catch outliers in a regression modelling by looking
at the t–t and/or the u–u plots. These will never be able to catch the
“transverse outliers”.
In regression modelling, and in The Unscrambler® software, the
critical plot is called the “X–Y relation outliers” plot (refer to section
7.13)—this is the absolutely most important plot for modelling. There
may of course also be problem-specific reasons to dig into how the
objects are distributed in the X-space in a specific PLSR-solution, via
the score plots, i.e. for interpretation purposes.
For a detailed explanation of the diagnostics used in multivariate
calibration, the interested reader is referred to the chapter on
chemometrics by Swarbrick and Westad [12], and there is much
general information on the context of this critical issue in the
validation benchmark paper by Esbensen & Geladi [13].
7.10.20 Residuals in Y

This section concerns the main outputs generated by the calibration


model, i.e. the predicted Y-variables ( pred). These values are typically
used to make business critical decisions and therefore must meet
some predefined statistical criteria, or similar, in order for the values
to be accepted. This process can only be performed during the
training and validation phase of model development, because when
the method is being used in a real situation, there are no longer any
reference values to check against, unless periodic control samples
are used to test the model’s long-term stability.
The y-residual ɛ for a particular object is calculated as the
difference between the objects predicted value generated by the
model and the value obtained on the same object that was measured
using the reference method.

Since the method of least squares is used to develop the


calibration model, the usual assumptions regarding the distribution of
the residuals are assumed to hold, in particular, that the residuals are
normally and independently distributed around zero with constant
variance s2.
This constant and normal variability of residuals around the zero
residual is known as “homoscedasticity” and it is typically a
necessary assumption for a robust regression model. To assess the
normality of residuals, the use of Residuals vs Predicted plots or
Normal Probability Plots can be used. In calibration development, the
plot of calibration residuals and validation residuals should display a
random distribution around the zero-residual point with no trending
visible when the residuals are plotted vs increasing reference values.
In this way, Studentisation of the residuals is possible where ±3
standard deviation limits can be placed around the zero line in order
to detect potential outliers. There are a number of common patterns
that an analyst must be aware of when assessing the residuals of a
calibration model and these are shown in Figure 7.22.
Figure 7.22a represents a normal and homoscedastic distribution
of residuals which could be said to be the ultimate goal of any
calibration model. Figure 7.22b shows the presence of one major
outlier. This will skew the distribution of the residuals from normality
into the direction of the outlier. Figure 7.22c shows the
heteroscedasticity problem, i.e. the residual variance is non-constant
over the regression line. This could, for example, occur when either
the analysis or the reference method has better precision at one end
of the scale compared to the other. Methods such as generalised
least squares or weighted least squares can be used to minimise the
variance at the extremes of the model, however, what is transformed
must also be back-transformed. Figure 7.22d shows the case where
a linear model is not an acceptable fit to the data. This situation
requires the choice of a linearising preprocessing method or the use
of a quadratic or polynomial model fit criterion.
Figure 7.22: Common patterns observed for residuals when developing a calibration model.
Figure 7.23: Normal probability plots of residuals for an acceptable and an unacceptable
calibration model.

Normal probability plots were introduced in chapter 2, and can be


used to assess the normality of calibration as well as validation
residuals. Figure 7.23 shows the residuals plotted on a normal
probability plot for an acceptable calibration model and one for an
unacceptable model.
In Figure 7.23, the top pane shows the situation where the model
in which the data are fitted well as evidenced by the residuals all lying
close to the straight-line fit. The bottom pane shows a situation where
the model fit is not optimal as the residuals not lying on a straight line
and deviating to a large degree at the extremes of the plot. In this
case, the plots show a comparison between a final three-factor
model fit to the data vs a clearly inferior one-factor model fit.

7.11 Error measures


The simplest measure of the uncertainty of future predictions is the
Root Mean Square Errors (RMSE) of calibration and validation. These
values are a measure of the average uncertainty that can be expected
when predicting Y-values for new samples, expressed in the same
units as the response variable. The results of future predictions can
then be presented as “predicted values ± 2 × RMSE” (at
approximately 95% confidence, since RMSE values are equivalent to
standard deviation estimates).
This measure is valid provided that the objects used for prediction
are similar to the ones used for model development, otherwise, the
prediction error might be (much) higher. The general RMSE for
calibration (known as RMSEC) is presented in equation 7.13.

where N is the number of samples used to develop the calibration


model.
For test set validation, the formula for the Root Mean Square Error
of Prediction (RMSEP) is as follows, equation 7.14:
For cross-validation, the formula is identical to that for test set
validation, but must formally always be reported as RMSECV lest
serious confusion will ensue. Validation residuals, explained variances
and RMSEP are computed in exactly the same way as calibration
variances, except that prediction residuals are used instead of
calibration residuals.
Plots of RMSE as a function of number of components (for latent
variable methods) are used in the search to find the optimum number
for a model. When the summary validation residual variance is
minimised, RMSEP is also minimised, and the model with an optimal
number of components will therefore have the lowest expected
prediction error.
RMSEP can a.o. be compared with the precision of the reference
method, which is known as the Standard Error of Laboratory (SEL). It
is of utmost importance to have an estimate of this analytical
precision to evaluate the errors generated by the regression model.
On this point, the following important information regarding SEL is
presented which should form the basis of all attempts to develop a
predictive analytical method using primary reference analytical data.
The precision and accuracy of the reference method must be
known in order to fully assess the complexity of the multivariate
model developed. There is no point placing unrealistic expectations
on the predictive method if the precision of the reference method
cannot achieve the goal in the first place.
Error propagates and sums up, so that when the total errors from
the reference and alternative methods are combined, the prediction
method cannot be expected to have an accuracy better than the
reference method (although there are some who take this position,
based on the argument of error propagation, the best result is that
the predictive method will be close, but never more accurate than
the reference method).
The prediction method can be more precise than the reference
method. The degree of precision enhancement is a function of
analytical sampling, i.e. if the alternative method utilises a sampling
system that requires minimal sample preparation, then a precision
improvement may result.
Estimating SEL for a method is a detailed process, but one that
must be performed in order to truly assess an alternative model. By
just stating “we think the reference method has an accuracy of
0.2% w/w” is not good enough unless it is backed up by a protocol
to show this is the case. The number of times the present authors
have heard this argument from the uninitiated is all too frequent
and is an important reason why the use of chemometrics has not
expanded at a greater rate than experienced.
The analytical accuracy is a matter of the magnitude of the bias of
a particular analytical process. Estimating, and subsequently
correcting, for an analytical bias is a standard feature of validation
of any analytical method. This would then take care of optimising
the analytical accuracy—contrary to what applies for a sampling
process, however (chapters 3, 8 and 9). It is particularly germane
not to forget this broader context in which optimisation and
validation of the analytical method per se is embedded. The
sampling bias is very nearly always much larger than the analytical
ditto, indeed it very often dominates (if not attended to properly),
ibid. Oceans of confusion have been navigated badly in the history
of data analysis because of such analytical method/data modelling
tunnel vision.
Chapter 2 described the paired t-test as a method for showing the
equivalence of means and provides an example calculation for
estimating the equivalence of a newly developed NIR (alternative)
method to the primary reference method. The calculation of SEL is
performed in a similar manner and the following describe an example
calculation in detail.

7.11.1 Calculating the SEL for a reference method


1) Select a set of samples (between 10 and 20) from the calibration
set to be used as the set for estimating SEL (if SEL is not already
established). This set is also referred to as the Intermediate
Precision set.
2) Select two analysts (operators) to perform the following analyses.
Operator 1 should be the most experienced operator and operator
2 the least experienced. By using two experienced analysts, the
value of SEL will always be biased. The method of calculating SEL
proposed here is typical of the real level of variation to be expected
for the reference method.
3) The samples need to be prepared in such a way that the two
operators get a representative stock of each sample to be
analysed. This means that each sample delivered to the two
analysts must be a representative split, a representative sub-
sample from the progenitor sample (see chapter 3).
4) On Day 1, Operator 1 sets the reference analysis method up,
prepares samples for analysis from their representative original
sub-sample and runs these samples (preferably in duplicates) to
provide a single overall reference value for the samples.
5) On Day 2, Operator 2 performs the same procedure applied to their
sample set on a completely new setup of the reference method and
provides the results for each sample.
6) The combined results are tabulated and the differences between
the operator replicates are calculated (as these can be treated as
paired samples). The average difference between the operators is
the method bias. If the bias is significantly different from zero (as
assessed by the paired t-test), then the operators are deemed to
perform the analysis differently to one-another, and consequently
the Standard Deviation of Differences (SDD) is not a valid measure
of the expected variability of the analytical method. If, however, the
bias is found to be insignificant, then the SDD is a valid
representation of the SEL for the reference method.
7) This SEL value can now be used as the baseline statistic to
determine the number of components in the model to reach a
minimum level of acceptable precision, i.e. intermediate precision.
Once linearity of the model has also been established, analytical
accuracy is then implied [14].
An example calculation of SEL follows in Table 7.2.
After establishing that the values are normally distributed (K–S test,
chapter 2, section 2.5.3) the differences can safely be calculated.
The mean difference of 0.159 is compared to zero using a paired t-
test.
When the bias is found to be insignificant, the Standard Deviation
of Differences (SDD) is calculated and this value becomes the SEL
for the reference method.
According to the Pharmaceutical Analytical Services Group
(PASG), it cannot be expected that RMSEP to be any lower than 1.4
times SEL [15]. The RMSEP is the sum of the sampling error,
measurement error, model error and reference method error.

7.11.2 Further estimates of model precision

PRESS
An alternative error measure is the Predicted Residual Sums of
Squares, PRESS (equation 7.17):

Table 7.2: Calculation of SEL for a reference analysis method.

SampleTOS Operator 1, Day 1 Operator 2, Day 2 Differences

1 84.63 83.15 1.48

2 84.38 83.72 0.66

3 84.08 83.84 0.24

4 84.41 84.20 0.21

5 83.82 83.92 -0.10

6 83.55 84.16 -0.61

7 83.92 84.02 -0.10


8 83.69 83.60 0.09

9 84.06 84.13 -0.07

10 84.03 84.24 -0.21

Mean 84.06 83.90 0.159

Std Dev 0.34 0.34 0.57

As PRESS is reported as a square number it does not directly


relate to the values (or range) of the response variable Y, but the
square root does.
Some other useful statistics available when assessing regression
models are given below.

Bias
The bias is the average value of the difference between the reference
and predicted values for a carefully specified set of replicated sample
measurements (equation 7.16).

Standard error of prediction (SEP)


SEP is equivalent to the standard deviation of the prediction residuals
(equation 7.17).
7.11.3 X–Y relation outlier plots (T vs U scores)

In “well-behaved” regression models, all samples in a PLSR will lie


close to a straight-line regression fit in the relevant X–Y relation
outliers plots (T vs U scores), otherwise there is no basis for a model
in the first place. Figure 7.24 shows the use of the X–Y-relationship
outliers plots and how they can be used to locate the optimal number
of factors for a model.
In Figure 7.24, four common principal t–u patterns are shown.
Situation a): This defines the case where there is a strong
relationship between X and Y for the selected PLSR factor.
Situation b): This defines the case where there is only a
moderately strong correlation relationship between X and Y for the
selected number of PLSR factors. As the relationship between the
scores for X and Y decreases, so does the linearity of the T vs U plot.
Figure 7.24: The X–Y-relationship outlier plot for assessing the optimal number of PLSR
factors and for detecting outliers.

Situation c): This defines the case where there is little to no


relationship between X and Y for the selected PLSR factor. The
contemporary factor should not be added to the model.
Situation d): This defines the case where there is a strong
relationship between X and Y for the dominant fraction of objects for
the selected PLSR factor. However, there are two objects that do not
fit this implied trend (at all), which should be investigated as possible
outliers in either X-/Y-space or both.
The use of the X–Y relationship outlier plot is highly recommended
for all PLSR model developments and its use is highlighted in the
continuation of the octane in gasoline example presented in the next
section. The hallmark of an experienced multivariate data analyst is
the informed use of t–u plots, together with all other types of relevant
plots. In this game: experience is king!

7.11.4 Example—interpretation of PLS1 (octane in gasoline)


Part 2: advanced interpretations
Returning to the example presented in section 7.10 on octane
number in gasoline samples, it was found that an acceptable model
was able to be developed, even in the presence of what appeared to
be potential gross outliers. By using loadings and loading weights, it
was established that the regions of the spectra that characterised
these “outliers” did in fact not contain relevant information for
predicting octane number. Therefore, the question now is, should
these samples be removed, or is variable selection a better option for
improving the model?
By removing the samples from the model, this will make the
calibration specific to samples that do not contain additives, thereby
making the model more sensitive as an outlier test when samples
with additive are analysed by the method. By eliminating variables,
the calibration model is focused onto the region where the
constituent of interest is modelled, therefore eliminating
“unimportant” information.
To perform both of the procedures described above requires the
use of interactive modelling. Software programs such as The
Unscrambler® allow very easy interactive re-calculation of models
“without marked samples and/or variables”. Sample elimination will
be presented first followed by variable elimination and a comparison
of the results will be provided.

7.11.5 Sample elimination

For the purpose of this example, the samples found to be suspect in


the calibration set were eliminated, however, similar samples were
kept in the test set to show their effect on the model statistics. Using
the functionality in the software, the two suspect samples were
marked and the option to “re-calculate without marked” samples was
used to generate a new model. The PLSR overview is provided in
Figure 7.25.
Interpretation of the results is provided as follows,
1) The Explained Variance plot shows a highly erratic behaviour. A
one-factor PLSR model produces a negative variance which is
interpreted as there being no modelling capability for this one-
factor model whatsoever. By the time PLSR factor 2 is added, most
of the variability is explained but after PLSR factor 2, the model
again becomes highly unstable.
2) Comparing the scores plot in Figure 7.25 with the corresponding
Figure 7.16, shows that the model now has a completely rotated
data structure in which the suspect validation samples lie nowhere
close to the Hotelling’s T2 ellipse. This indicates the model is now
much more sensitive for the detection of such outliers.
3) The regression coefficients for a PLSR model with two factors is
very similar to the three-factor model developed earlier. It is now
evident that the first PLSR factor in the original model was indeed
just trying to describe the variation due to the difference between
the suspect samples and all the remaining “normal” gasoline
samples.
4) While the calibration data are modelled well, and the predicted data
set also shows a straight-line fit, the validation curve (red) is
skewed with respect to the calibration line, however, and this
indicates that the samples with additive are now biasing the model.

Figure 7.25: PLSR overview of octane in gasoline model with suspect samples removed.

Overall, this model without the samples with additive is more


specific and better able to detect the presence of such samples later
on (refer to the scores plot), but compared to the original model (see
part 1 of this example above) the model has lost its general
applicability. Thus, there is a trade-off, which can usually only be
decided upon on a case-by-case basis in the light of the overall data
analysis objective(s).

7.11.6 Variable elimination


There has been a common misconception in the chemometrics
community, particularly for spectroscopic data, that the entire
spectrum has to be included in the model, the premise being that the
model will “downweigh” the unimportant variables. While this may be
the case in some (perhaps even in many) situations, it is stated
categorically here that inclusion of truly unimportant variables in a
model always incurs a principal risk in that if something happens in
an “unimportant region” this could cause the model to generate a
suspect result—this is a bit like fighting a fire that just isn’t there.
Variable elimination can be performed manually (based on
subject-matter expertise) or invoked using automated approaches
that are designed to select the variables most influencing a calibration
model. One such automated method is known as the Martens
Uncertainty Test [16] (see section 7.13). In this example, variable
selection will be based on the information obtained from part 1 of the
gasoline example, where it was found that above 1400 nm, there was
no real modelling power for octane number.
Using The Unscrambler®, the re-calculate without marked option
was now applied to variables, in order to eliminate all wavelengths
between 1400 nm and 1550 nm. The PLSR overview for this model,
applied to the test set, is provided in Figure 7.26.
Figure 7.26: PLSR overview of octane in gasoline data using the method of variable
elimination.

Interpretation of the results is provided as follows,


1) The explained variance plot shows convergence after PLSR factor
2. After factor 2, the characteristics of the validation set are
different from the calibration set and this is why the calibration and
validation curves diverge from here.
2) The score plot, shows that the suspect calibration and validation
samples lie inside of Hotelling’s T2 ellipse. The previously noted
outliers are just inside the ellipse in the factor 2 direction which
indicates the model is still sensitive for these samples but cannot
be used as a detection method for such outliers (at the 5%
significance level).
3) The regression coefficients for a PLSR model with two factors still
holds a similar shape to the other two regression models
developed so far in this reduced wavelength region.
4) Both the calibration and validation data are modelled well in the
Predicted vs Reference plot.
Table 7.3 provides a comparison of the three models developed to
date.
The results in Table 7.3 suggest that the sample elimination model
is the worst approach as it eliminates a complete sample type from
the calibration model, which in fact, can be modelled globally when
its variation is explained by the addition of just one more factor to the
model. The use of the variable elimination model is a tempting
suggestion as there is very little difference between it and the original
model.
An important last word: It should be kept in mind that for the
objective of producing a workable model for octane number
prediction, the full-spectrum PLSR model was fully functional when
including the gasoline samples containing additives. Thus, for this
prediction objective one would choose to use all available samples.

Table 7.3: Comparison of three models for predicting octane number


in gasoline.
Original model Sample elimination Variable elimination
Parameter
Calibration Validation Calibration Validation Calibration Validation

Elements 26 13 24 13 26 13

Number of
3 2 3
PLSR factors

Slope 0.98 1.01 0.98 0.91 0.98 1.02

Pearson’s 0.98 0.98 0.98 0.94 0.98 0.99


correlation (R2)

RMSE(C/P) 0.27 0.23 0.28 0.41 0.27 0.25

SE(C/P) 0.27 0.24 0.29 0.36 0.27 0.26

Bias <0.001 0.008 <0.001 –0.21 <0.01 0.04

It is the overall data analysis objective that determines optimal use


of the many, diverse performance statistics and how to choose
between competing models. It is, strictly speaking, not legitimate to
compare models in which the sample basis is radically different,
although in the present case the data analyst can easily comprehend
the difference between “additive samples” in/out and make
reasonable judgements even so. But in many more complex contexts,
differing with respect to just one sample might be “radical” if this
sample is (or turns out to be)—an outlier.

7.11.7 X–Y relationship outlier plot

To finalise the model development aspects of the octane number in


gasoline example, an analysis of the X–Y relationship outliers plots
will be provided for the models. Figure 7.27 shows a four-pane X–Y
relationship outlier plot for the original model.
Figure 7.27 is called a quadrupole split screen plot in The
Unscrambler® and plots the T vs U scores for four PLSR factors. This
plot was sample grouped based on whether an additive was present
(Yes) or absent (No). PLSR factor 1 does not model the response well
as evidenced by the two samples with additives lying well off the
regression line. In the PLSR with two and three factors, the true
power of PLS modelling can be appreciated with clarity. The function
of factor 2 is to “rope in” the most deviating samples as they are
outlined by their “transverse” distance to a common regression line
(upper-left panel). The same effect, only at a much smaller scale, can
be observed for component 3. When PLSR factor 4 is reached, the
object (sample) points no longer line up along the regression line. This
is the point where PLSR factors should stop being added to the
model.
The X–Y relationship outlier plot for the sample elimination model
is shown in Figure 7.28 as a double plot, because this model only
requires two PLSR factors.
Elimination of the two suspect samples (i.e. those with additive)
now simplifies the model structure considerably. PLSR factors 1 and
2 are now enough to model the response as best possible. PLSR
factor 3 (not shown) does not show any further modelling capability.
The X–Y relationship outlier plot for the variable elimination model
is shown in Figure 7.29 as a quadruple plot, because this model
could require more than two PLSR factors.
Figure 7.27: X–Y relationship outlier plot for the original octane in gasoline model.

Variable elimination has forced PLSR factors 1 and 2 to be the


major predictive factors. Variable elimination has removed the
dependence of the additive contributing wavelengths to the model
and factors 3 and 4 are not required as there is a weak linear
relationship between the factors and their U-scores.

7.12 Prediction using multivariate models


By far the most important goal of developing a regression model is to
use it to predict Y-variable “measurements” for future samples. Once
a model has been properly defined and the training phase properly
executed, including the all-important validation, the end user who
developed the model, takes a risk when implementing the new
(alternative) direct prediction procedure to replace the primary
reference method.
The fundamental assumption behind this undertaking is that the
new X-data are assumed to be “statistically similar” to those used for
the calibration. In various places in this book it has been pointed out
how this assumption is intimately associated with a range of issues
[data analysis problem definition, model type selection, training and
test set selection, sampling issues (TSE), measurement (TAE), proper
validation]. Becoming aware and fully respectful of this complexity
behind multivariate calibration is perhaps the most important didactic
goal of this book. Compared to this, it is easy to come to master the
“mechanical” skills of blindly carrying out the software execution of
the methods presented to the new data analyst.
Figure 7.28: X–Y relationship outlier plot for the sample elimination octane in gasoline model.

Assuming all this is in place, there are a number of advantages


associated with the development of predictive models including,
1) The alternative prediction method is faster and less expensive than
the primary method.
2) The alternative method may in many cases be more precise and
allow for multiple measures to be performed to gain an estimate of
prediction errors and repeatability.
3) The alternative method may in some cases require little, less or no
sample preparation, particularly if the method is spectroscopic in
nature and is being implemented for monitoring and control of
processes in real time. However, much depends on the nature of
the lot material (ranging from 2 (or 3)-phase compound systems,
via mixtures of all kinds… to the archetype infinitely diluted
solution). Be aware of the “no sampling” fallacy often associated
with the introduction of PAT sensors to interact with such systems.
PAT sensors are in fact subject to the exact same set of potential
sampling errors as are physical sampling (chapters 3 and 13); see
Esbensen and Paasch-Mortensen [17] for details.
There is a major advantage of using multivariate models; the
powerful diagnostics used to assess the quality of the calibration
model are available for the predictions as well. In prediction, the only
real unknown is the actual reference value for the new sample. The
confidence one has in the predicted values generated is 100%
dependent on the quality of the calibration and its validation. Thus,
there exist useful diagnostic tools to assess the quality of the
prediction, related solely on the quality of the X-variables collected.
Since Y is unknown in prediction and through regression, X-
variables are related to Y-responses, therefore, the assumption made
in the multivariate model is that if there is a significant difference
between the new X-variables and those used to develop the model,
then there is reasonable doubt for the validity of the predicted values
generated. These statistics will be discussed below but first, the
general background for applying a model to new data is iterated here.
1) The new samples must be sampled, processed and presented to
the alternative method following the exact same protocol which
was used to develop the model.
2) The same preprocessing as used for the model development must
also be applied to the new data before the model is applied. This
also goes for the potential variable selection included in the
calibration etc.
3) It is imperative that the same number of factors as was used for the
final validation of the model is the model dimensionality used for all
predictions, in order for the predicted results to be considered
valid.
If these principles are followed on well-maintained equipment,
predicted results should be meaningful, whether the results meet or
don’t meet with expectations. It is also an important piece of
information to know that predictions are “deviating” w.r.t.
expectations—somewhere in the line of reasoning and data
analysis/modelling/validation there is obviously room for refinement
and improvement (this is the topic of model lifecycle management
presented in chapter 13).

Figure 7.29: X–Y relationship outlier plot for the variable elimination octane in gasoline
model.

Predicted values can be generated directly using the compacted


b-coefficient format, by multiplying each X-variable by its associated
coefficient summing them and adding the intercept (b0) as per
equation 7.18.

For PCR, the model equations can be expressed in the usual


form:

X = TPT + E

and

Y = TbT + F

The PLSR model equations are:

X = TPT + E

and

Y = TQT + F

For these models, Y is expressed as an indirect function of the X-


variables using the scores T, the X-loadings P and the Y-loadings Q
(for PLSR). The advantage of using the projection equation for
prediction, is that when projecting a new sample onto the X-part of
the model (this operation gives the t-scores for the new sample), one
simultaneously gets a leverage value and an X-residual for the new
object, hence allowing potential outlier detection also for test set
samples using precisely the same rationale as for training objects.
A prediction object with a high leverage and/or a large X-residual
may be a prediction outlier. Such samples may then not belong to the
same “population” as the training samples, and therefore such results
should be treated with caution and/or the new object should be
deleted. In order to have some level of confidence in the predicted
results, the following plots and statistics can be used and evaluated,
1) Projected scores against Hotelling’s T2 ellipse (“likely class
belonging”). (N.B. All class belonging demarcation approaches
assume a normal distribution behaviour of bona fide objects). Who
is to say whether such an assumption is right in a particular case?
The data analyst!
2) Influence plots of Q-/F-residuals vs Hotelling’s T2/leverage (“likely”
outlier disposition).
3) Y-Deviations (“unlikely y-levels” compared to what is known at
present). (N.B. spread and range intervals expectations assume a
normal distribution). Who is to say whether such an assumption is
right in a particular case? The data analyst!
4) Inlier vs outlier plots (N.B. Most outlier demarcation indices assume
one form of normal distribution behaviour or other). Who is to say
whether such an assumption is right in a particular case? The data
analyst!
It is always the data analyst who is responsible for the validity of
the results stemming from a multivariate calibration. The following
argument is completely general, however, and applies w.r.t. the
responsibility for all other multivariate data models used for, e.g.,
classification, discrimination, time series forecasting…..
Thus, it is the analyst’s responsibility to be aware of all inherent
method assumptions and prerequisites, regardless of whether these
are expressed clearly or not. Software vendors are “only” responsible
in that the methods offered carry out their calculations correctly. Even
though software packages offer the data analysts many helping
indices etc. these are all to some extent based on more-or-less
general assumptions that may be more-or-less realistic given a
particular data set.
There exists a grey zone in-between these two parties which
contains the all-important validation issue (this grey zone extends to
sampling as well, only even more so…). For example, while all
chemometric software packages offer the data analysts all three
validation approaches (leverage corrected, cross and test set
validation), there are very significant differences that the data analyst
need be aware of regarding the understanding for correct choice and
use of the chosen method. There is no doubt that the present
didactic approach is by far the most comprehensive.
On this cautionary basis, here are some more helping indices a.o.

7.12.1 Projected scores

When a bi-linear calibration model is developed and validated, the


calibration set defines the population space where all new samples
should ideally also lie. By using Hotelling’s T2 statistics confidence
intervals at the desired level of significance can be established such
that new samples can be projected onto the calibration space and
compared to the limits imposed by the model. This was introduced in
chapter 6, section 6.7.1 and is performed as per equation 7.19.

where TA,New are the projected scores for the optimal number of
components/factors, XNew is the new data being presented to the
model and PA,Model are the loadings of the model for the optimal
number of components/factors. By assessing the projected scores,
the new X-data can be tested to see if it is a reasonable assumption
that they originate from the same population as the calibration model.
If the projected scores lie outside of the model space, there is reason
to treat the Y-predicted value as suspect, until/unless proven
otherwise. For PLSR models, it is the Aopt rows of the W-matrix that
are to be used.

7.12.2 Prediction influence plots


Using the same projection approach, the X-variables can be
assessed for closeness of fit to the surface defined by the regression
model. When a sample is not modelled, this must be reflected by the
residual being “large”; Q-residuals (section 7.11.2) and F-residuals
(section 7.11.3) can be calculated and compared to the statistical
limits derived from the validated model. These values will project
along the y-axis of the influence plot. The x-axis will show either the
projected Hotelling’s T2 value or the calculated leverage for the
sample. This plot is sometimes known as a Multivariate Statistical
Process Control Chart (MSPC Chart) and shows in one plot, the
complete information for all components/factors used in the model.
From the influence plot, process control scripts or alarms can be
defined to detect outliers during the prediction phase. Refer to
chapter 6, section 6.6.4 for an explanation on how Influence plots
work.

7.12.3 Y-deviation

In software packages such as The Unscrambler®, the Y-deviations


are estimated as a function of the global model error, the sample
leverage and the sample residual X-variance. A small deviation
indicates that the sample used for prediction is “similar” to the
samples used to make the calibration model, while predicted Y-
values for samples with high deviations are less reliable.
These deviations may be interpreted in a similar fashion as
regarding the root mean squared error of prediction (RMSEP; or
standard error SEP) for new samples, but this happens without taking
the “true” Y into account, as this is unknown. While the RMSEP is
calculated based on all samples in the test set, the deviation is
estimated for each individual sample. For any moderately sized
dataset with an assumed multivariate normal disposition, a 95%
confidence interval for the prediction is typically given as Ypredicted ± 1
× Y – deviation. This is a standard statistical background, however;
deviating situations and/or data sets may occur. Who is responsible
for such a case not going unnoticed? Not the particular software, no
matter how sophisticated—it’s the data analyst!
This latter stipulation shows with all clarity the necessity for full
method and background understanding as described in this and
other chapters to date. While proper statistical underpinning is a
powerful tool for the kinds of deliberations described here, for real-
world data sets there always exist the possibility for significant
“deviation-from-normality” distributions… ranging to extreme
“clumpy” data sets, see Figure 7.24 and Figure 8.1 in chapter 8.

7.12.4 Inlier statistic

The inlier statistic is based on the principle that if samples, when


predicted, lie far from the nearest calibration sample in the scores
plot, they should be flagged as an “inlier”. For this reason, an “inlier”
should be interpreted as a potential outlier. Whereas samples with
high leverages will be found far from the origin of the scores plot
(outside the Hotelling’s T2 ellipse), an inlier may be found anywhere in
the scores plot.
The so-called inlier vs outlier plot is a useful tool for understanding
whether the new sample is likely to be a member of the calibration
population, or whether it lies in a region of the calibration space not
yet spanned by the current variability of the model. The outlier part of
the plot utilises Hotelling’s T2 as the statistic for determining whether
the sample lies inside or outside of the multivariate confidence limits
for the selected number of model components/factors.
The use of the inlier and the other diagnostics plots is
demonstrated in the continuation of the octane in gasoline example.

7.12.5 Example—interpretation of PLS1 (octane in gasoline)


Part 3: prediction
Returning to the octane number example, the test set will be used as
a new set of data for which to apply the variable elimination and
sample elimination models, to better understand the prediction ability
of these models. The Unscrambler® provides functionality to fully
assess inliers and outliers in a prediction data set as well as in a
predicted with Y-deviation plot. These are shown for the test set in
Figure 7.30.
The Y-deviation intervals for the predicted samples are of similar
size to each other, indicating that the test set is typical also of the
calibration samples. The same plot of predicted vs Y-deviation is
shown in Figure 7.31 for the model that eliminated the suspect
samples from the model.
The use of such plots can be highly subjective, therefore use of
diagnostics plots based on statistical confidence intervals will, in
general, be more reliable when business critical decisions are being
made based on the predicted results. The inlier vs outlier plot for the
variable elimination model is provided in Figure 7.32.

Figure 7.30: Predicted with Y-deviation plot for the variable selection model applied to the
test set, octane number in gasoline data.

Figure 7.31: Predicted with Y-deviation plot for the sample elimination model applied to the
test set, octane number in gasoline data.
The inlier vs outlier plot can be interpreted in the same way as the
influence plot (refer to chapter 6, section 6.6.4. Predicted objects that
lie within the confidence bounds of the Inlier and Hotelling’s T2
statistics are considered to be similar to the calibration objects.
Those that exceed the Hotelling’s T2 limit only are considered to be
high leverage objects. Objects that exceed the inlier statistic bound
only are samples that could be added to the model to make the
model more robust. Objects that simultaneously lie outside of the
inlier and Hotelling’s T2 bounds should be treated as outliers and
further investigated or deleted outright.
The inlier vs outlier plot for the sample elimination training data set
model is provided in Figure 7.33.
When the sample elimination model is applied to the test set, it is
better able to distinguish the samples with additive as outliers. The
inlier vs outlier plot also shows that one object may be a potential
inlier.
This example highlights that the model development process does
not stop when the initial model has been developed and
implemented. All multivariate models used for real world applications
must have a lifecycle maintenance scheme built around them that
defines when,

Figure 7.32: Inlier vs outlier plot for octane in gasoline variable selection model applied to
test set.
Figure 7.33: Inlier vs outlier plot for octane in gasoline sample elimination model applied to
test set.

1) A model is to be tested, or re-tested, for “fit for purpose”


assessment.
2) New objects are to be added to the model, based on detection of
new objects using inlier statistics.
3) A model is to be re-evaluated based on equipment maintenance
and repairs, or changes to the processes used to generate objects.
This topic is further discussed in chapter 13.

7.13 Uncertainty estimates, significance and


stability—Martens’ uncertainty test
This section deals with uncertainty estimates of the model
parameters in multivariate methods such as PCA, PCR and PLSR.
The included example will deal with PLSR, although the methodology
also is applicable to PCR. It is recommended that this chapter is read
only when a solid understanding of the concept of cross-validation is
reached, as described in chapter 8. This most specifically includes
the limits and pitfalls of cross-validation application.
When cross-validation is applied as the method of validation, it
gives a number of individual sub-models that are used to predict the
samples kept out in that particular segment. Therefore, loadings,
loading weights, regression coefficients and scores have been
perturbed and can be compared to the full model. The differences, or
rather variances, between the individual models and the full model
will reflect the stability towards removing one or more of the objects
dependent upon the size of the cross-validation segment in use. The
sum of these variances will be utilised to estimate uncertainties for
the model parameters.
Uncertainty estimates will be discussed in terms of:
Variable selection
Prediction performance
Stability

7.13.1 Uncertainty estimates in regression coefficients, b

The approximate uncertainty variance of the PCR and PLSR


regression coefficients b can be estimated by a process known as
jack-knifing (refer to equation 7.20).

where N = number of samples, s2b = estimated uncertainty variance of


b, b = regression coefficient at the cross validated optimal
components AOpt using all the N samples and bm = regression
coefficient at the rank A using all objects except the object(s) left out
in cross validation segment m.
On the basis of such jack-knife estimates of the uncertainty
regarding the model parameters, useless or unreliable variables may
be identified and eliminated, in order to simplify the final model and
making it more reliable. This is done by significance tests (refer to
chapter 2), where the t-test is performed for each element in b
relative to the square root of its estimated uncertainty variance s2b,
giving the significance level for each parameter. The uncertainties for
the regression coefficients are estimated for a specific number of
components, preferably the optimum number, AOpt.
An informative and visual approach is to show the b-coefficients ±
2 standard deviations, as this corresponds to a confidence interval of
approximately 95%. The more formal statistical test is that the
confidence interval is , or in simple words, any b-
coefficient whose confidence interval includes zero cannot be
distinguished from zero, therefore, the variable modelled is an
unimportant variable. The article by Efron [18] provides a concise
description of jack-knifing, bootstrapping and other resampling plans.

7.13.2 Rotation of perturbed models

The individual models have a rotational ambiguity, thus for a PLSR


model, Tm, Pm, Qm and Wm from cross-validated segments must be
rotated before the uncertainties are estimated. A rotation matrix Cm
satisfies the relationship of equation 7.21.

When Cm is estimated, for instance with orthogonal Procrustes


(equation 7.21) rotation of Tm vs T the individual models m = 1, 2,...,
M may be rotated towards the full model:

T(m) = TmCm

for scores, T, and

[PT,Q](m) = Cm–1[PmT,QmT]

for X- and Y-loadings, P, Q.


After rotation, the rotated parameters T(m) and [PT, QT](m) may be
compared to the corresponding parameters from the common model
T and [PT, QT]. The loading weights, W, are rotated correspondingly to
the scores, T. The uncertainties are then estimated as for b, thus the
significance for the loadings and loading weights are also estimated.
This can be used for component-wise variable selection in X-loadings
and loadings weights, but it also gives still another optional criterion
for finding AOpt from the significance of the Y-loadings, Q.

7.13.3 Variable selection

A parsimonious model, i.e. a model with fewer components and


variables than the full PLSR model, will give a lower estimation error,
since the number of model parameters is reduced. This affects the
RMSEP values as well as the uncertainty (or deviation) in the
individual predicted values themselves, named “YDev” in the
terminology of this book.
The general rule is thus that if a model with fewer
variables/components is as good or better with respect to
predictability as the full model, the simpler model is always to be
preferred (as shown by example in the octane number in gasoline
model using variable elimination). However, if the objective is to
interpret the components in full and their underlying structure, it could
be an advantage in some cases to keep some of the non-significant
variables to span the multidimensional space, at least in a first foray
analysis. An example of this is the visualisation aspect of plotting the
new predicted samples’ scores in the score plot from the calibration
model.
NIR spectroscopy is a field where multivariate calibration has
shown to be an efficient tool with its ability to cover for and
compensate for “embedded” unknown phenomena (interfering
compounds, temperature variations a.o.) in the calibration model.
There are, however, still a large number of applications that are
based on only two or three wavelengths for routine prediction. These
applications have shown that the full PLSR model is sometimes
inferior to a model based only on a relatively small number of
variables. This is partly due to the redundancy and the large amount
of noisy, non-relevant variables in NIR spectra, for which variable
selection based on jack-knife estimates is a fast and reliable method
with low risk of overfitting.
In the multivariate data modelling realm, there are very few solid
rules and principles with 100% applicability. It is much better to get
used to assuming full responsibility, not only for the data model but
also regarding the several associated issues before modelling.

7.13.4 Model stability

Model stability can be visualised in scores and loading plots by


plotting all perturbed and rotated model parameters together with the
full model. Examples will be provided below, but first, a formal
description of the use of stability plots is presented.
The leverage measure, hi is a useful tool to find influential
samples; samples which have a high impact on the direction of the
components. Simultaneous interpretation of score and loading plots
provides information about which variables span a specific direction
and which objects are extreme in this direction. The loading plots can
give information about correlation between variables, both X and Y,
but the explained variance is also an important part here. Even if
there seems to be a high correlation, it is essential to find out how the
two variables are correlated, i.e. a 2D scatter plot of the variables will
reveal this relationship. These aspects are particularly relevant for
models with a low number of objects.

7.13.5 An example using data from paper manufacturing


A small dataset from paper production is chosen to illustrate how
model stability can be visualised in the so-called stability plot. A
PLSR model was calculated with 14 samples for 15 process variables
(X) and one response variable Y (Quality). Full cross-validation was
performed and in this operation the Procrustes rotation in section
7.13.2 was applied to estimate the position of the object that was not
in this actual cross-validation segment. This yields one projected
score value for the individual object when it was kept out during
cross-validation, which can be used for visualising the model stability.
Figure 7.34 shows the score plot and Figure 7.35 the stability score
plot from this model. Sample 10 is situated in the lower left and thus
the model direction changes considerably when this sample is not in
the model. To find the reason why this is the case one has to move to
the loading weights plot.
The 2D stability plot of the variables in Figure 7.36 reveals that
sample 10 has a very high value for Permeability compared to the rest
of the samples. Thus, the relation (or correlation) between this
variable and the other variables is changed when sample 10 is kept
out. The stability plots based on cross-validation enable interpretation
of the subtle structures in the data very efficiently in the realm of the
objective: to establish a regression model between X and Y.

7.13.6 Example—gluten in starch calibration

Context
This data set was introduced in chapter 5 on preprocessing and was
taken from the work performed by Martens, Nielsen and Engelsen
[19]. It is well known that scatter effects dominate spectral data
collection of solid samples. In the case of transmission spectroscopy,
not only is scattering a major effect, but also pathlength effects and
packing density of solid samples influences the shape and slope of
the spectra. The objective of the current investigation is to show that
no matter what sample thickness and presentation, the same
constituent composition should be able to be modelled from the
samples.
Figure 7.34: Score plot for Factor-1 vs Factor-2, sample 10 is marked (lower right).

In their original paper, the authors presented the pitfalls of


Multiplicative Scatter Correction (MSC), for chemically diverse data
and they showed that by adding known chemical information into the
scatter correction algorithm, using modified Extended Multiplicative
Scatter Correction, mEMSC, a one factor PLSR model should result
from the analysis, i.e. the principle of parsimony.

Data set

The data set consist of transmission NIR spectra collected in the


range 720–1100 nm of binary mixtures of gluten and starch. 100
spectra were collected on various gluten–starch mixtures of differing
composition and sample preparations where changes in packing
compression and pathlength were introduced into the samples. The
raw material spectra of gluten and starch were recorded for use in the
mEMSC algorithm.
Figure 7.35: Stability plot. Note the position of sample 10 when it was not part of the sub-
model (occupying an isolated location between objects 8, 14 and 12).
Figure 7.36: Stability plot of loading weights/y-loadings. Note the position of x-variable
Permeability for one of the segments—the segment when sample 10 is left out.

The various steps to be described in this example include.


1) The study of raw data to look at the inherent variability.
2) Apply the regular MSC and EMSC transforms and study the PLSR
models developed.
3) Use chemical knowledge as the basis for modified EMSC and
study the PLSR models developed.
4) Determine the best method of scatter correction.

Data analysis

The first step, as always is to visualise the raw data. The line plot in
Figure 7.37 shows that this data set is highly diverse chemically, i.e.
the spectra do not show a similar profile, but are varying in shape. It
is also observed from Figure 7.37 that there is a large effect of
packing density as evidenced by the large baseline offsets between
the spectra. It is known that in the short-range wavelength region of
the NIR, diffuse reflectance is not a dominant effect and specular
reflectance is the mechanism of light scattering, however, this is not
the case for transmission data and therefore scattering effects are as
relevant to transmission as they are for diffuse reflectance in the
longer wavelength NIR region.
To determine which preprocessing method to use, it was shown in
chapter 5 that using the scatter effects plot, this data set did not
show constant straight line properties when plotted against the mean
spectrum (due to the diverse chemical nature of the samples). The
EMSC transform was better able to reduce the scatter and packing
effects in the data and the application of the mEMSC preprocessing
resulted in a very similarly looking data set.
PLSR models for gluten content using the raw, MSC, EMSC and
mEMSC were developed and the results obtained are described in
the following sections.

7.13.7 Raw data model

In all cases, when a model can be applied to raw data, this is the best
option, provided a reasonable number of components/factors are
used in the model. For a binary mixture, the optimal number of PLSR
factors would be 1, however, 2 factors (possibly 3) may also be
acceptable. Any more than this number would likely indicate that the
model is trying to base predictions around random or chance
correlation structures in the data set.
Figure 7.37: Raw NIR transmission spectra of gluten–starch mixtures.

The raw spectra PLSR overview (using test set validation) is


shown in Figure 7.38. From the overview, a three-factor model
describes a highly linear model. The scores plot shows a high degree
of heteroscedasticity (non-constant variance) across the
concentration range (note that there are five distinct groups in the
scores plot and these relate to the gluten content in the samples).
The model statistics are summarised at the end of the example
with all other models for comparison.

7.13.8 MSC data model


The MSC preprocessed data PLSR overview (using test set
validation) is shown in Figure 7.39. From the overview, a six-factor
model describes a highly linear model, however, this is not an ideal
situation for a model describing a binary mixture. In this model, PLSR
factors are being used to implicitly remove unwanted variation rather
than the preprocessing performing this task. The scores plot shows a
very high degree of heteroscedasticity particularly at the highest
concentration range. The regression coefficients are way too complex
and are not smooth in profile. This is a non-ideal model and shows
how for chemically diverse data, MSC can completely destroy the
data structure making it useless for modelling purposes.
Another important observation from this model was that 100% of
the X-data was described in PLSR factor 1. This is another reason
why this model is unacceptable for use, the remaining five PLSR
factors in the model use less than 1% of X to describe the remaining
modelled Y-variation.

7.13.9 EMSC data model

The EMSC preprocessed data PLSR overview (using test set


validation) is shown in Figure 7.40. From the overview, a three-factor
model describes a highly linear model, i.e. a similar complexity to the
model developed using raw spectra. The scores plot shows that the
variability seen in the previous two models has been reduced for all
concentration ranges, however, there is a non-linearity observed that
does not appear to be too extreme based on the predicted vs
reference plot.
Figure 7.38: PLSR overview of raw NIR spectra model.

The predicted vs reference plot shows a distinct


heteroscedasticity of the residuals going from high concentrations of
gluten down to low concentrations.

7.13.10 mEMSC data model


The mEMSC preprocessed data PLSR overview (using test set
validation) is shown in Figure 7.41. From the overview, a one-factor
model is now enough to describe a highly linear model; this is the
ideal modelling situation. This preprocessing explicitly removes the
effects of scatter, this time using external information regarding the
difference spectrum of gluten and starch (refer to chapter 5, section
5.4.2). PLSR factor 1 describes very close to 100% of both the X-
and Y-variability of the data and this is commensurate with the
problem, i.e. a one-factor model is the simplest model for solving a
binary mixture problem (there is only one degree-of-freedom
regarding composition, since the two, correlated proportions in any
binary mixture will always sum to 100%).

7.13.11 Comparison of results

The original objective for modelling this data set was to show that
through the effective use of meaningful preprocessing, a simple one-
factor PLSR model should result from the analysis of a binary mixture
of powders with diverse chemical information contained in their
spectra. Table 7.4 summarises the statistical parameters for each
model and also provides the statistics of each model for a one-factor
PLSR situation only for comparative purposes.

Figure 7.39: PLSR overview of MSC preprocessed NIR spectra model.

Inspection of Table 7.4 shows that for the one-factor PLSR


models, none come close to the mEMSC model. This example shows
how critical the correct choice of preprocessing can be and how
application specific problem-related information can guide a model to
its simplest solution. Although this example is quite advanced for an
introductory text, it serves as a good example of what can be
achieved when a greater level of experience is gained with
chemometric modelling.

7.14 PLSR and PCR multivariate calibration—in


practice
As has been demonstrated many times, data can only seldom be
used in its raw format, unless the process of data collection has been
performed at an extremely high level of caution. There may be
outliers, or unsuitable variables, or there is a need to transform some
of the variables, for instance, according to a linearisation objective. In
any event, it is quite normal to make several data analytical runs,
always starting with the raw data, and simply “try out” a few standard
routines to investigate the quality of the models being generated. This
process must be performed because multivariate methods are
empirical models, i.e. they are dependent on the data set presented
for modelling and only relate to the population being modelled, thus
very much making multivariate data analysis an iterative process.
Figure 7.40: PLSR overview of EMSC preprocessed NIR spectra model.

7.14.1 What is a “good” or “bad” model?

In order to determine whether a model is “good” or “bad”, it is


necessary to specify the exact purpose of the data modelling. For
example, a model intended for prediction of the protein content in
wheat (Y), using fast and inexpensive spectroscopic (NIR)
measurements (X), instead of time-consuming laboratory methods,
will in all likelihood require a high prediction accuracy. Authorities or
customers may demand that the prediction error (RMSEP) be within
strictly predefined limits, usually in the same range as the reference
method used for the Y-data of the calibration set. The experienced
data analyst (who knows the full content of chapters 3, 8 and 9 for
example) will of course never rely on the precision criterion alone, but
will be equally concerned about the accuracy that can be obtained—
not the analytical accuracy of course, the total “from-lot-to-analysis”
accuracy: The experienced data analyst is specifically aware that
accuracy is mainly a feature that is determined by the external
sampling representativity governing the samples that make it to the
reference laboratory, in which maximum suppression of TSE reigns
supreme. The specific data analytical/statistical accuracy only hold
relevance after this pre-requisite has been properly attended to.
Romanach and Esbensen [20] give a brief of these relationships
between TOS and development of spectroscopic calibration models.

Figure 7.41: PLSR overview of mEMSC preprocessed NIR spectra model.

In contrast, a model made in order to understand which process


variables influence the quality of an inherently variable product, say a
commodity like apples for example, can often be accepted with much
less accuracy. An explained Y-variance as low as 60–75% may be
enough to get on the right track of bad quality. The purpose of this
model is more to interpret the patterns of scores and loadings, i.e. to
find the “significant variables”. This may well be achieved without
aiming to find the absolutely largest possible explained Y-variance in
terms of accuracy.
Various expressions of model fit and prediction ability were
explained in this chapter for assessment of the performance of a
prediction model. Every solution must always satisfy the basic
minimum rules for sound data modelling though: no outliers and no
systematic errors. It is always preferable to be able to interpret the
model relationships, to be able to judge unequivocally to which
degree they comply with domain specific knowledge.

7.14.2 Signs of unsatisfactory data models—a useful


checklist

It is not the present authors’ intent that the reader follows prescriptive
rules for model development rigidly (although these summarise a
wealth of empirical experience). However, the use and adaptation of
the following list in the assessment of model quality will help to
achieve the most robust modelling situations for “typical data sets”.
The following list details reasons for caution:

Table 7.4: Comparison of gluten–starch models using PLSR and


various preprocessing methods.
Model ID Parameter Preprocessing

None MSC EMSC mEMSC

Cal Val Cal Val Cal Val Cal Val

Optimised # Factors 3 6 3 1

Slope 0.986 1.01 0.982 0.984 0.989 0.996 0.998 0.999

R2 0.986 0.983 0.982 0.985 0.994 0.992 0.998 0.998

RMSE(C,P) 0.041 0.048 0.047 0.044 0.037 0.032 0.015 0.015

SE(C,P) 0.042 0.049 0.047 0.045 0.037 0.033 0.015 0.015

Bias <0.0001 <0.0001 <0.0001 0.0016 <0.0001 <0.0001 <0.0001 <0.0001

1-Factor # Factors 1 1 1 1

Slope 0.693 0.654 0.581 0.537 0.927 0.927 0.998 0.999

R2 0.693 0.598 0.581 0.639 0.927 0.930 0.998 0.998

RMSE(C,P) 0.195 0.228 0.227 0.221 0.095 0.095 0.015 0.015


SE(C,P) 0.196 0.230 0.229 0.224 0.096 0.097 0.015 0.015

Bias 0.00 <0.0001 <0.0001 –0.0055 <0.0001 0.0043 <0.0001 <0.0001

The prediction error is “too high” for the problem at hand (w.r.t
external knowledge).
The residual variance displays an increase before the minimum, or
does not decrease at all (in PLSR).
There are conspicuous “isolated objects” (possible outliers), in
scores, residuals or influence plots, or groups of similar outlying
objects.
Similarity of isolated, outlying variables, although the interpretation
of unnecessary, or bad, variables is not identical to that pertaining
to objects.
The distribution of objects in T vs U score-plots (X–Y relation outlier
plot) has a non-linear shape or shows a worrisome presence of
groups (“clumpiness”) a.o. There are systematic patterns in the
residuals.
New data are not predicted well.

7.14.3 Possible reasons for bad modelling or validation


results

Outliers are (still) present in the training data (and/or in the test set
data).
Data are not representative: for instance, the calibration data are
not representative of future prediction conditions, or the validation
data are not representative with respect to the calibration data.
There are several other possible reasons for this characterisation as
well.¶
Unsatisfactory validation (wrong validation method).
Lack of variability in calibration data set range(s); ditto re test set
range.
Need for pretreatment of raw data (e.g. linearisation).
Inhomogeneous data (subgroups, “clumpy data”).
Errors in sample preparation.
Systematic errors in experiments (significant bias, either sampling
bias and/or analytical).
Instrument errors, e.g. drift, upset.
Lack of information—there is in fact no X–Y relationship (always a
possibility).
Strongly non-linear X–Y relationships.

7.15 Chapter summary


This chapter introduced the concepts and methods used to develop
multivariate regression models. Starting with simple univariate and
MLR models, the pitfalls of these approaches were described
followed by a discussion of why bi-linear multivariate models offer so
many advantages over the simpler model types.
The main model discussed was partial least squares regression
(PLSR), a bi-linear regression approach that aims to use reference Y-
data to guide the decomposition of the X-data such that the final
model uses a minimum number of factors; this crucial dimensionality
is determined by invoking proper validation. The other model type
discussed was principal component regression (PCR), which
overcomes the limitations of MLR by using PC scores as inputs to the
MLR equation.
PLSR offers two extremely useful plots compared to other
methods, namely the X-loading weights plot and the X–Y relationship
outlier plot (t–u plot). Unlike X-loadings in PCA/PCR, X-loading
weights describe the X-variables that effectively model the Y-
responses for a given PLSR factor. This provides X-loading weights
with a greater interpretability capability than regular loadings (unless
of course, the loadings and loading weights are identical).
The X–Y Relationship outlier plots show how each PLSR factors
models the response(s). When the straight-line relationship between
the PLSR score (T) in the Y-score (U) is no longer linear, or when the
particular t–u correlation of a factor no longer contributes to
modelling of the response(s), this is the dimensionality where factors
should stop being added to the model.
Correct validation and interpretation of a multivariate regression
model is critically important and the PCR and PLSR model provide
many diagnostic tools for interpretation in both X- and Y-space.
When developing a multivariate regression model, in order to validate
the number of components/factors to include in the model, this is
highly dependent on the complexity of the system being modelled. It
is also of importance, that the model predictive error be statistically
comparable to the standard error of laboratory (SEL) for the reference
method. Since error propagates from the reference to the alternative
(prediction) method, the prediction method can never be more
accurate than the reference method, but it can be more precise.
When provided with a data set with a large number of X-variables
(spectroscopic data or similar), there is no need to use all of the
variables to make a model. The process of using a smaller, yet more
significant set of variables, variable selection, can often be called into
use and the method of jack-knifing was introduced. This method was
used to show how various samples and variables influence the
stability of a model.
Finally, an advanced example showed how applying well-chosen
preprocessing methods can reduce the complexity of a PLSR model.
By implicitly removing the effects of baseline and scatter from the
data, rather than using PLSR factors to explicitly account for such
effects, the simplest model can be achieved which is not only easy to
interpret, but is more robust in the long term.
The information in this chapter will set the new data analyst on a
path to a rational and robust approach towards professional
multivariate regression modelling.

7.16 References
[1] Draper, N.R. and Smith, H. (1998). Applied Regression Analysis,
3rd Edn. John Wiley & Sons.
https://doi.org/10.1002/9781118625590
[2] Vandeginste, B.G.M., Massart, D.L., Buydens, L.M.C., De Jong,
S., Lewi, P.J. and Smeyers-Verbeke, J. (1998). “Quantitative
structure activity relationships (QSAR)”, in Handbook of
Chemometrics and Qualimetrics, Part B. Elsevier, Ch. 37, pp.
383–420. https://doi.org/10.1016/S0922-3487(98)80047-6
[3] Naes, T. and Isaksson, T. (1992). “Locally weighted regression
in diffuse near-infrared transmittance spectroscopy”, Appl.
Spectrosc. 46, 34–43.
https://doi.org/10.1366/0003702924444344
[4] Höskuldsson, A. (1996). Prediction methods in Science and
Technology, Vol. 1. Basic Theory. Thor Publishing, Denmark.
[5] Swarbrick, B. (2016). “Chemometrics for near infrared
spectroscopy”, NIR news 27(1), 39–40.
https://doi.org/10.1255/nirn.1584
[6] Wold, H. (1966). “Estimation of principal components and
related models by iterative least squares”, in Multivariate
Analysis, Ed by Krishnaiah, P.R. Academic Press, NY.
[7] Ergon, R. (1992). “PLS score-loading correspondence and a B-
orthogonal factorisation”, J. Chemometr. 16, 368–373.
https://doi.org/10.1002/cem.736
[8] Ergon, R., Halstenen, M. and Esbensen, K.H. (2011). “Model
choice and squared prediction errors in PLS regression”, J.
Chemometr. 25, 301–312. https://doi.org/10.1002/cem.1356
[9] Cook, R.D. and Weisberg, S. (1982). Residuals and Influence in
Regression. Chapman and Hall.
[10] Jackson, J.E. and Mudholkar, G.S. (1979). “Control procedures
for residuals associated with principal component analysis”,
Technometrics 21, 341–349.
https://doi.org/10.1080/00401706.1979.10489779
[11] Jackson, J.E. (1991). A User’s Guide to Principal Components.
Wiley series in probability and mathematical statistics. Applied
probability and statistics. Wiley.
https://doi.org/10.1002/0471725331
[12] Swarbrick, B. and Westad, F. (2016). “An overview of
chemometrics for the engineering and measurement sciences”,
in Handbook of Measurement in Science and Engineering, Ed
by Kutz, J. John Wiley & Sons, Hoboken, NJ.
https://doi.org/10.1002/9781119244752.ch65
[13] Esbensen, K.H. and Geladi, P. (2010). “Principles of proper
validation: use and abuse of re-sampling for validation”, J.
Chemometr. 24, 168–187. https://doi.org/10.1002/cem.1310
[14] ICH Harmonized Tripartate Guideline Q2(R1) (1997). “Validation
of analytical procedures: text and methodology”, Federal
Register 62(96), 27463–27467.
[15] Broad, N., Graham, P., Hailey, P., Hardy, A., Holland, S.,
Hughes, S., Lee, D., Prebble, K., Salton, N. and Warren, P.
(2002). “Guidelines for the development and validation of near-
infrared spectroscopic methods in the pharmaceutical industry”,
in Handbook of Vibrational Spectroscopy, Ed by Chalmers, J.M.
and Griffiths, P.R. John Wiley & Sons.
https://doi.org/10.1002/0470027320.s8303
[16] Westad, F., Byström, M. and Martens, H. (1999). “Modified jack-
knifing in multivariate regression for variable selection and
model stability”, in Near Infrared Spectroscopy: Proceedings of
the 9th International Conference, Ed by Davies, A.M.C. and
Giangiacomo, R. NIR Publications, Chichester.
[17] Esbensen, K.H. and Mortensen, P. (2010). “Process sampling
(Theory of Sampling, TOS)—the missing link in process
analytical technology (PAT)”, in Process Analytical Technology,
2nd Edn, Ed by Bakeev, K.A. Wiley, pp. 37–80.
https://doi.org/10.1002/9780470689592.ch3
[18] Efron, B. (1982). “The Jacknife, the Bootstrap and Other
Resampling Plans”, CBMS-NSF Regional Conference Series in
Applied Mathematics, Volume 38. Capital City Press, VT, USA.
[19] Martens, H., Nielsen, J.P. and Engelsen, S.B. (2003). “Light
scattering and light absorbance separated by extended
multiplicative signal correction. Application to near-infrared
transmission analysis of powder mixtures”, Anal. Chem. 75,
394–404. https://doi.org/10.1021/ac020194w
[20] Romanach, R. and Esbensen, K.H. (2016). “Theory of Sampling
(TOS) for development of spectroscopic calibration models”,
Amer. Pharm. Rev. September/November,
Spectroscopy/Review section, pp. 1–3.
* “The same sample” is the view from the analytical laboratory and the computer on which
the chemometric software resides. It is imperative not to forget, however, that “samples”
always originate through some process in which sampling is involved. Sometimes this full
understanding has consequences for the ease with which a relevant set of “the same
samples” can be, or will have to be produced. This reminder points to the broader context in
which all multivariate calibrations exist.
† See footnote * regarding the fact that in real life, analytical results most specifically do not
exist only as “data” i.e. as a potential collection of all possible analytical data only, which can
be sampled easy enough in the strict statistical sense, while they were in fact meticulously
arrived at by the complete pathway from lot-to-analysis, a framework which has the utmost
emphasis in this book.
‡There is a burdensome tradition in statistics and data analysis of not always using a strict
terminology. A “factor” is strictly speaking a factor in Factor Analysis, whereas a
“component” is a component in PCA. Both are linear combinations of all the P variables
making up the X matrix, or ditto re the Q variables in the Y matrix. Unfortunately, in the
international literature factors and components are rather often used inter alia quite liberally.
Thus “factor” is sometimes used liberally when meaning PCA, PCR, PLS components. Blame
the historical development of chemometrics…
§The systematics of vector and matrix names in the algorithmic overviews of methods were
laid down from the very first beginnings of chemometrics in a very helpful way: “X→Y, look
for the next letter in the alphabet”.
¶ Calling data “representative”, or “non-representative” is clearly one of the most complex
justification tasks one can meet. Representativity is a term that can be (should be) applied to
the sampling process, to the sample preprocessing steps in the laboratory, to the analytical
process and, finally, to the choice of model (model selection). All this takes place before the
“data” originated; thus, stipulating a matrix as indeed containing “representative data” is a
statement for which one can get into the hot seat very quickly. One better know whereof one
is speaking—in all aspects. Very often one finds an extremely frivolous use of this crucial
term in the pertinent literature, when in fact used mostly as wishful thinking….
8. Principles of Proper Validation
(PPV)

This chapter builds directly on the previous chapters on sampling


(chapter 3), PCA (chapter 4) and multivariate regression (chapter 7).
Validation is presented using the example of multivariate
calibration/prediction, but the general principles apply to all data
modelling for which a performance is to be assessed, e.g. model fit,
classification, prediction, time-series forecasting etc.

8.1 Introduction
A set of generic Principles of Proper Validation (PPV) is presented,
based on five distinctions:
i) The PPV are universal and apply to all situations in which validation
assessment of performance is desired: modelling, prediction,
classification, time series forecasting.
ii) The key element behind PPV is the Theory of Sampling (TOS), which
provides insight into all variance-generating factors, especially the
“Incorrect Sampling Errors” (ISE). If not properly eliminated, these
are responsible for an inconstant sampling bias, for which no
correction is possible, contrary to the widespread statistical
tradition for “bias correction” (see also chapter 3). A sampling bias
wreaks havoc with all types of validation—except test set
validation. The sampling strategy should always address how
known (and unknown) sources of variation can meaningfully be
included in the first dataset which is to be modelled, the training
set. Thus, sampling includes qualitative information about time,
location, batch ID and other qualitative information related to the
sampling strategy, and not only the physical procedure. Such a
comprehensive overview is imperative for a reliable estimation of
future prediction or classification performance.
iii) Validation cannot be understood solely by focusing on the
method(s) of validation—it is not enough to be acquainted only with
a particular validation scheme, algorithm or implementation as has
been a longstanding tradition in chemometrics. Validation must be
based on full knowledge of the underlying definitions, objectives,
methods, effects and consequences of the full sampling–reference
analysis–data analysis process.
iv) Analysis of the most general validation objectives leads to the
conclusion that there is one valid paradigm only: test set validation.
In this chapter, the most important alternative validation
approaches are discussed, critiqued and rejected in this
perspective.
v) Contrary to contemporary chemometric practices and validation
myths, cross-validation is unjustified in the form of a one-for-all
procedure for all data sets. Within its own methodological scope,
cross-validation is shown to be but a suboptimal simulation of test
set validation, as it is based on one data set only (the training data
set). However, such a “first” singular data set could, occasionally,
be representative of future variation, but one would never know
whether this is the case within the one-sample-set context alone;
only external experience or evidence would be decisive. Many re-
sampling validation methods suffer from this principal deficiency.
Thus, while there are cases in which cross-validation finds excellent
use, for example, categorical segment cross-validation (in which
categories can be batches, seasons, alternative models or pre-
treatments), see below, such cases represent only special
circumstances and no generalisation regarding validation principles
can be made hereupon.
This chapter shows how a second data set (test set, external
validation set) constitutes a minimum critical success factor for
inclusion of the sampling errors incurred in all “future” situations in
which the validated model is to perform. From this follows that all re-
sampling validation approaches based on a singular data set only (for
example a data set sampled on one day only, or with one instrument
only, or addressing one batch of raw materials only) should logically
be terminated, or used only with full scientific understanding and
disclosure of their detrimental limitations and consequences. This
chapter builds on a comprehensive analysis of validation in Esbensen
and Geladi [1]; see also Westad and Marini [2].
In the central case of PLSR, a call is here made for stringent
commitment to test-set validation based on graphical inspection of
pertinent t–u plots for optimal understanding of the underlying X–Y
data structures and interrelationships for validation guidance. The t–u
visualisation, or similar regarding other multivariate data modelling
methods in need of validation, is critical in order to stop continuation
of decades of “blind” use of a one-for-all procedure cross-validation,
a state of affairs that has led to quite some confusion among
generations of chemometricians.

8.2 The Principles of Validation: overview


Throughout the history of chemometrics discussions have often
surfaced as to what constitutes proper validation. There are few other
topics which have led to a more marked and heated set of different
opinions. Discussions on this basis can be broad-ranging and
informative, while at other times personal, emotional and counter-
effective. The matter at hand is a thoroughly scientific one, however.
In chemometrics, validation is most known in the context of
prediction validation, of which there are (at least) four types: test-set
validation, cross-validation, “correction validation” (leverage
correction is the prime example) and re-sampling validation methods
(bootstrap, jackknife, Monte Carlo simulation, permutation testing).
The standard cross-validation can be viewed as a single instance of
the jack-knife re-sampling procedure (chapter 7, section 7.13)—and it
can also be viewed as a simulated test set validation, albeit a fatally
flawed simulation.
This chapter illustrates the central PPV by a phenomenological
analysis of prediction validation and its objectives in the specific
multivariate calibration context.
The PPV are concerned with the question of how to establish a
general validation approach that does not depend upon assumptions
of any specific data structure(s), nor associated with any specific
variant of the many validation method alternatives that can be found
in the literature.
A few salient definitions are needed at the outset—strictness and
preciseness is a much-needed commodity in the validation debate:

proper adj.: adapted or appropriate to the purpose or circumstance


valid adj.: sound; just; well-founded; producing the desired result
validate v.t.: to make valid; substantiate; confirm

One reason for much of the often deeply felt differences-of-


opinion regarding what constitutes proper validation relates to the
fact that it involves both statistical as well as domain-specific issues
(e.g. chemical, physical, data analytical or other error issues). A
significant proportion of the historical debate simply reflects a much
too limited point of departure from which is typically attempted to
draw far too sweeping generalisations—for example, the belief that
validation is exclusively a statistical issue. Within this view “sampling”
is simply a matter of drawing from a population of independently and
identically distributed (i.i.d.) measurements; this understanding is
termed statistical sampling (sampling ) in what follows.
STAT

In the present context, a broader, TOS-based holistic


understanding of the interconnected sampling, analysis and
validation issues is advocated, while taking care not to fall into the
opposite, equally simplistic position, viz. that all data matrices result
from sampling from heterogeneous material. However, most of the
activities in the field of data analysis and data modelling, in fact, do
occupy a realm in which one has to assume the presence of
significant data errors. If these errors are neglected, they will cause
grave prediction and validation problems. It is, therefore, also
necessary to use the term sampling . The cases in which pure
TOS

statistical sampling suffices, these can simply be treated in an


identical fashion alongside the much more prevalent sampling-error
cases, allowing for unity in all validation considerations. The issues
regarding validation are neither about opinions (personal,
institutional), nor about following one or other established schools-of-
thought or traditions (thereby dodging a personal responsibility for
understanding and method selection). All validation issues are fully
tractable and lend themselves to rational discussion and sound,
objective analysis that ultimately lead to impartial conclusions.

8.3 Data quality—data representativity


The PPV need a few initiating discussion points related to the
concepts of data quality, data representativity and sample
representativity.
“Data quality” is a broad, but often loosely defined term; any
definition that does not include the specific aspect of representativity
is lacking, however. The term “data” is often equated with
“information”, but it is obvious that this can only be in a latent,
potential form. It takes data analysis with appropriate, problem-
context interpretation to reveal the structural “information” residing in
data matrices. In chemometrics, the prime interest is, of course, on
data analysis, while issues pertaining to the prehistory of a data table
usually receive but scant attention. In fact: “Chemometricians analyse
data…” is an often-heard statement going a long way towards
rejecting any chemometric responsibility for data quality, and hence
also of sample representativity. Nothing could be more dangerous,
however. One exception is Martens and Martens [3] which addresses
“multivariate analysis of quality”, where the focus is stated to relate to
the “quality of information”, which is defined as “...dependent on
reliability and relevance”. However, reliability and relevance are open-
ended, general adjectives which, must be given a specific
unambiguous meaning from the problem context at hand. It has,
therefore, been argued that a far more relevant characteristic is
representativity, partly because a clear definition is at hand, but
mainly because the specific derivation of this definition in TOS allows
for comprehensive account of the underlying phenomenon of
heterogeneity.
It is mandatory to contemplate the specific origin of any data set
and in this context, data analysis is always dependent upon at least
one primary sampling stage in order to produce the sample, PAT or
sensor signal acquisition stage no exception, often including mass-
reduction and sample preparation in later sampling stages. An
analytical stage i.e. (chemical, physical, measurement etc.) is also
required before data analysis can commence. It is, therefore, an
inescapable conclusion that “reliable analytical information” must be
based on representative samples. In this chapter, a critical distinction
is made between statistical sampling and the kind of physical
sampling addressed by TOS. In chemometrics, it is necessary to be
competent with respect to both these kinds of “sampling”.
There will always be large, significant or alternatively only small
sampling/signal acquisition errors involved—the point being that at
the outset this quantitative issue is unknown, and therefore cannot be
dismissed without grave danger. In chemometrics, the type of errors
colloquially known as “measurement errors” are mostly considered to
be related to the X-data only, typically conceptualised in the form
“instrumental measurement errors”, while they logically also must
refer to analytical errors pertaining to “reference measurements” (Y-
data in calibration). These effects are all incorporated into the
concept of Global Estimation Error (GEE) within the realm of TOS,
thus allowing a rational discussion of all sampling and analysis errors
and their impacts on data quality. By dealing universally with these
issues as if sampling issues were always significant, all cases can be
treated identically in a rational and efficient manner, covering all
combinations of large and/or small statistical errors as well as
large/small TOS-sampling errors (including the rare, pure statistical
case alluded to above).
The position most often met throughout the history of
chemometrics is that of simply assuming all sampling errors are
insignificant. This attitude represents a fatal illegitimate generalisation
for which no proof has ever been presented. Chemometric data
analysis without sufficient attention to the full context of relevant pre-
data issues cannot be considered comprehensive, but is in fact
incomplete, indeed unscientific. It is noteworthy how the complex
validation scenario is all too often simply swept under the rug in the
quest for a one-for-all method that automatically takes care of all the
troublesome issues. This is called “Chemometrics without thinking”.

8.4 Validation objectives


Validation, in the multivariate calibration context, means assessing
that the prediction performance is valid, i.e. to confirm that a
particular prediction model is fit for purpose. Usually the prediction
performance is specified as a certain prediction uncertainty (error)
maximum threshold, or similar. But it is known that the prediction
accuracy is a characteristic that must relate back all the way to the
original lot/material (or population in certain cases), which translates
to the prediction accuracy being a performance characteristic that
refers to the features of the original lot, represented by the reference
analytical values, which also require proper validation in order to be
used to calibrate a model.
This objective does not refer only to the technical
calibration/modelling and validation process, but first and foremost to
the circumstances surrounding the future performance for new
predictions when using the model on “similar data”. This means that
both the training and validation data sets must be as similar as
possible to those new data sets pertaining to the “future” working
situation in which the model is to perform its task. The training set and
the test set should not be as identical to one-another, as is a current
misconception.
Thus, already when designing and selecting a training data set for
modelling it is imperative also to pay attention to how the model is to
be validated. This means, that one must always try to be in a position
also to be able to choose at least an additional, second independent
data set, to be used only for validation. Such a data set is hereafter
generically called the test set. This is the data set with which to
represent the future working situation of the particular model. As shall
be clear below, at times this may demand some work, sometimes a
lot of work—there may not always be easy fixes here. Irrespective,
however, all prediction models must be validated w.r.t. realistic future
circumstances. It is simply not good enough to secure double as
many objects (samples, measurements) as what appear sufficient for
modelling and then slice off 50% (performing the so-called “test set
split”)*. It will become clear that there is much more involved in
establishing a realistic, reliable validation foundation. This is not to
say that the splitting of a pool of samples into calibration and
validation sets is always bad, it just requires a relevant, problem
specific approach to the design of the data sets involved; above all it
requires a complete understanding of all the issues treated in this
chapter.
In data analysis, statistics and chemometrics, ~20 years ago,
there was a somewhat rude awakening to the fact that far too little
prediction validation was on the agenda at the expense of mere
modelling. In Höskuldsson’s [4] reassessment of the entire realm of
“Prediction methods in science and technology”, it was described
how modelling fit assessment reigned pretty much supreme as
compared to the necessary, complementary prediction validation, for
which he introduced the “H-principle” of balanced assessment of
both modelling and prediction performance. Today, there is a much
more widespread awareness that modelling fit optimisation is
necessary, but there is still not a sufficiently well-known criterion for
prediction performance. Høskuldsson pointed to the Heisenberg
Uncertainty principle from quantum mechanics when naming his H-
principle of balanced modelling and validation complementarity.†
8.4.1 Test set validation—a necessary and sufficient
paradigm

The central theme of the present approach can be stated in quite


unambiguous terms: All other validation methods are but simulations
of test set validation, with various flaws.

simulate v.t: to assume or have the appearance of characteristics


of

The objectives of test set validation are always structurally correct


and complete. If a proper test set was always obtainable (and this is
not that difficult, see further below), no other validation procedure
need ever have been introduced; test set validation would then be the
only validation method in existence.

8.4.2 Validation in data analysis and chemometrics


Validation can be used for many different purposes; it is relevant to
speak of internal as well as external validation scenarios. Below are
discussed both legitimate and illegitimate approaches to validation
focused on multivariate calibration for prediction purposes. Thus,
cross-validation used on one data matrix (X) only, i.e. PCA (chapters
4 and 6) and SIMCA (chapter 10) is not covered in full, but neither is
this necessary for the purposes of this explanation. It is
straightforward to apply PPV for these cases, as most aspects of the
analysis, discussion and conclusions from the prediction scenario
can be carried over without loss of generality. It is the overarching
principles of validation which are in focus, followed by examples and
data analysis assignments which will allow the developing data
analyst ample opportunity to become familiar with all the intricacies of
practical validation.
8.5 Fallacies and abuse of the central limit
theorem
Abraham de Moivre’s Central Limit Theorem [5], also called normal
convergence theorem, states: A collection of means of reasonably
large subsamples taken from a large parent population form a
population that is normally distributed around the mean of the parent
population, no matter what the distribution of the parent population
is.
It is critically important to note that large statistical subsamples
are required and that the mean of the parent population is only found
in the limit for many such statistical subsamples. On many occasions
this point appears to be gravely misinterpreted or forgotten. Re-
sampling on a single data set (e.g. the training data set), often of a
significantly small size, or even on a fractional subset of a training set,
can only lead to knowledge about this very set alone—and only very
little, if any, useful knowledge about the parent population and much
less as to future samples not yet sampled. Much of the popularity of
cross-validation may be based on a too-swift dependence on the
respectability accorded to the central limit theorem. See Wonacott
and Wonacott [6] and Devore [7] or many current statistics textbooks
for a full description of the central limit theorem.
The key issue here is that the singular training set is supposed to
carry all possible and necessary information about the background
population, including that it is also able to represent any other (new)
data set(s) to be sampled in the future. In a very real sense the size of
the training set is only but the first quality criterion for getting this
statistical inference correctly started; much more important is the
“coverage” of future data sets as compared with that of the training
set. Small training data sets are often seen re-sampled/subdivided
into even smaller segments which are supposed to perform the role
of a “reasonably large subset” in the sense of de Moivre above.
Clearly a very dangerous practice! It is questionable how often this
fundamental limitation is known, far less respected, when doing a
“routine cross-validation” on a typical training data set with but a
relatively small number of objects, often less than, say, 50. Add to
this the central message of chapter 3, i.e. significant presence of non-
stochastic sampling/signal acquisition errors (TOS-errors). The most
common argument encountered by typically inexperienced
chemometricians is that a too small training set is available for test
set validation, so… “Just perform cross-validation… and all will be
well”. Well, not so fast….

8.6 Systematics of cross-validation


It is advantageous to treat all cross-validation variants under a
systematic heading, termed segmented cross-validation. This allows
significant simplification in discussing the historically disparate
variants: Leave-one-object-out (LOO), the plethora of differently
segmented cross-validation variants, including the so-called “test set
split” option (a particularly obfuscating terminology for an otherwise
straight two-segmented cross-validation approach). Indeed, two of
these names could not have been chosen in a worse fashion; “test
set split” is a terrible misnomer as no test set can ever be created
from this procedure—and “full cross-validation (LOO)” is actually the
worst of all possible segmented cross-validations, to be explained
below. The formal definition of “segmented cross-validation” is as
follows.
Depending on the fraction of training set samples (totalling N) held
out for cross-validation, an optional range of “s” potential validation
segments will be available for the data analyst, the number of
segments falling in the interval s = [2, 3, 4,..., (N – 1), N]. Various
“schools-of-thought” of cross-validation have developed over history
within chemometrics and elsewhere, some favouring “full cross-
validation” (one object per segment, resulting in N segments in total),
some defining 10 segments as the canonical number and others
favouring similar schemes each with its own preference (e.g. 3, 4 or 5
segments), whereas a small, but steadily growing minority see more
complexity in this issue than a more-or-less arbitrary selection from
the full range of s options. Reflection reveals that there always exist
(N – 1) potential cross-validation variants for any given data set with
N samples, but no set of principles for objective determination of the
optimal number of segments has ever been offered in the data
analysis, chemometric or statistical literature.
Below, some hard-won experiences with hundreds of (very)
diverse data set types and data structures are presented that will
allow an easy overview of the systematics of validation.

8.7 Data structure display via t–u plots


The canonical formulation of the PLSR-1 algorithm (X and Y: mean-
centred and scaled as needed) stipulates:

where t1 is the first PLS X-score, w1 is the first PLS X loading-weight,


u1 is the first PLS y-score and q1 is the first PLS y-loading.
The PLSR algorithm also calculates higher order sets (ta, wa, ua
and qa) [a = 2, 3, 4…], after suitable deflation allowing the following
general appreciation. Plotting ua against ta, [a = 1, 2, 3, 4…] reveals
the succession of the so-called “inner relationships”, which are direct
visual manifestations of the data structure present. It is precisely
these plots that are used as a vehicle for visual assessment of outliers
etc. in any regression context. The “t–u plots” also constitute a useful
check of whether a possible next component seems meaningful or
not, as evidenced by the strength of “inner” partial regressions as the
dimensional index, a, is increased by one. The t–u issues are identical
also for the PLS2 case, in which ua is no longer a scaled (and
sequentially deflated) version of Y alone, but a linear combination of
all Q Y-variables.
Although the PLSR algorithm can be written without explicit
projections in the Y-space, i.e. without equations 8.1 and 8.2, see e.g.
Martens and Næs [8], Martens and Martens [3], there is a serious loss
of potential information about the data set in doing so, as the purpose
of getting visual insight into the empirical data structure is completely
lost. It is immensely informative to assess the specific data structure
by t–u plots. In order to be able to take proper action with respect to
the actual data structure present in training, test or “future” data sets,
a general typology of the principal types of data structures associated
with multivariate calibration is presented in Figure 8.1.
There are three underlying features characterising the particular
manifestations of any multivariate t–u data structure: i) the number of
objects involved, N, ii) the degree of linear (or non-linear) correlation
present between X and Y and iii) data clustering, grouping
(“clumpiness”) and/or significant outliers. The four first cases shown
in Figure 8.1 constitute a systematic series of strong/weak correlation
vs small/large N, outlining the full spectrum of typical data sets for
which PLSR modelling is relevant and legitimate.
As an important contrast, three of the four latter cases in Figure
8.1 represent deviating covariance data structures for which PLSR-
modelling should never even have been contemplated. It is obvious,
that without t–u visualisation, such cases run a very high risk of going
unnoticed. The validation literature is ripe with examples of over-
generalised validation discussions and conclusions—which, when
shown on the simple t–u plot simply to be related to such degenerate
data structures, are in reality nonsensical and which should never
have been published. Collegial considerations disallow specific
references.
It cannot be overemphasised how much data structure
information can be gained from diligent inspection of components
cross plots. Many such plots are used throughout this book,
complete with interpretations of the meaning of the data structures
revealed. This is perhaps one of the strongest attributes of PLSR.
Contrary to the case for PCA, in which there only exist t–t plots, in
PLSR (as in PCR), the t–u cross-plots displays highly relevant insight
into the X–Y inter-space data structures. While t–t plots can always
also be called up for PLSR model inspection, these are normally only
consulted depending upon specific purposes behind the regression
modelling in which the X-space data relationships are of interest in
themselves, For the so-called “PLS-constrained X-space modelling”
the objective is often specifically to observe the X-space projections,
and here inspection and interpretation of t–t plots will play a central
role, while for ordinary feed forward X→Y regression model building,
t–u plots are the relevant information carriers.
A key issue of the highest value for data analysis beginners is the
following: PCA-like t–t plots is not the proper, and very far from the
most efficient, place to look for outliers when building a regression
model (PLSR, PCR). Identification of outliers influencing PLSR
modelling can most meaningfully, and shall here in fact only, be
pursued on the series of relevant t–u plots. This is an insight that will
spare novice data analysts a lot of grief and which will cut short an
otherwise long learning phase.
With the help of t–u plots, it is easy to appreciate that an empirical
close match between cross-validation and test set validation results
(evidenced by “similar” Aopt and RMSEP) which is simply a
manifestation of a particularly strong correlation between the X- and
Y-spaces, e.g. cases a) and c) in Figure 8.1. But this holds exactly
only for strongly correlated t–u data structures from which it follows
that no generalisation is allowable to other data structures. This is an
example of an illegitimate generalisation, because it is based on a
particular data set structure only. The chemometrics literature is ripe
with examples of such invalid transgressions within the realm of
validation.
t–u plots must be inspected for every regression model to be
validated as it will elucidate the underlying relationship between X
and Y (sample groups, non-linearities) and also to indicate the correct
model dimensionality.
There are several traditions within chemometrics that do not
adhere to this flexible understanding, however, which instead
prescribe “blind” adherence to one particular version of cross-
validation with no graphic inspection, i.e. which rely on “blind” cross-
validation with a fixed number of segments for all data sets (aptly
called straightjacket cross-validation). But even a cursory overview of
the principal correlation relationships delineated in Figure 8.1 leads to
the insight that a fixed number of segments will work in markedly
different ways depending on the specific data structure encountered.
This goes a long way to explain why repeated cross-validation,
identical but for alternative starting segment definitions, can lead to
significantly different validation results. A fixed number of segments
can never be said to pay the necessary tribute to the many different
data structures met with in the realm of science, technology and
industry. One, rigid scheme most emphatically does not fit all. There
is ample justified reservation as to the plethora of claims in the
literature, all hailing the so-called “robustness” of cross-validation.
These claims are simply wrong.
By way of contrast, based on Figure 8.1, all types of varying
validation results are completely comprehensible—a simple mental
picture of selecting a fraction of the N objects displayed in a t–u plot
(which is tantamount to selecting a segment in cross-validation)
allows the data analyst to picture the resulting effect of sub-modelling
of the remaining objects (perhaps first after a little experience, but
that is exactly the reason for working through a chemometric
textbook…).
Upon reflection, these relationships are but the reverse issue of
the often claimed optimistic, but not fully thought through, cross-
validation credo: validation is on safe ground as long as/if several
variants of validation, including several different segmented cross-
validations result in similar validation results (identical number of
components, “similar” RMSECV). All is indeed well if-and-when this
hopeful situation occurs—however, the only thing that has been
demonstrated is a case of a strongly correlated X–Y relationship, as
depicted in Figures 8.1a, c, f or h. In reality nothing has been revealed
as to the future prediction potential, unless it has been independently
proven beyond reasonable doubt, that this strong X–Y correlation
remains the defining feature also of all possible “future” data sets on
which the regression model is to perform. Such a relationship cannot
ever be taken on blind faith, however, but should be substantiated
based on theory and background knowledge about the actual
application. This would correspond to believing that all training data
sets are always and at all places… a 100% valid and reliable
representation of the potential myriad of all future data sets
sampledstat (from a population), or sampledTOS from a heterogeneous
target. If this were indeed so, there would be no need for validation:
the training set modelling fit would be all that was ever needed, as it
would be universal—alas, reality checks in here with a very different
lesson, see chapter 3; this is a hopelessly naïve belief.
Figure 8.1: Eight correlation data structures as depicted in PLSR t–u plots. There are three
features that determine the appearance of a t–u data structure: i) the number of objects, N, ii)
the degree of linear (or non-linear) correlation present and iii) data clustering or grouping
(“clumpiness”). An attempt has been made to cover as well as possible all principal data
structure types met with in practical data analysis. Cases e) and h) should never have been
subjected to regression modelling in the first place; case g) can not be modelled with a one-
component model unless a satisfactory liearising transformation has been employed—case
g) can be modelled straightforwardly, however, with an excess of components.

Figure 8.2: Schematic behaviour of RMSECV-estimation (prediction Y-error variance) based


on the full range of cross-validation segments available to all particular data sets with N
objects s = [2, 3, 4, ... N].

Demonstration of whether such a situation only holds locally, i.e.


for one specific data set, or not, is precisely the reason behind, the
objective of, including the second data set (test set) in the validation,
while all re-sampling approaches only deal with the singular Xtrain set
exclusively.‡
Still more insight can be gained from careful inspection of Figure
8.1. For all data sets of the type like cases a–d) one observes a
systematic regularity w.r.t. alternative segmented cross-validations
with a varying number of segments [s = 2, 3, 4, … N]. Figure 8.2
depicts the systematics of “RMSECV vs # PLSR-components” plots,
corresponding to the progression of all (N – 1) potential segmented
cross-validations for a particular data set.
There will always be a lowest RMSECV when the number of
segments is at its maximum, N (corresponding to leave one out
cross-validation, LOOCV). Conversely, when s = 2, RMSECV will be
at its maximum. These relationships hold for all reasonably regular
correlation data structures when cross-validation is performed on
one-and-the-same data set. Exceptions may occur, these
relationships may be slightly less regular, but then only due to some
influential data structure irregularity (“clumpiness” or presence of
adverse outliers, especially “transverse” outliers), which will tend to
blur the general pattern slightly. But the point is, again, there is never
any generalisation potential beyond the particular local data structure.
For the same data set, a reduction of the number of segments will,
in general, result in an increase in the RMSECV error estimate, and
vice versa, but there is no reason for confusion: different validation
setups will result in different validation outcomes; the number of
components may change in more influential cases, and the numerical
estimate of RMSECV will always change. Faulty conclusions may
easily result if, for such a particular data set, the data analyst
succumbs to the temptation to select the cross-validation setup that
corresponds to the lowest RMSECV. This may at first appear as a
legitimate cross-validation outcome, but it is only based on a
subjective desire to select an “optimal model”. Such a voluntary
approach is untenable, indeed unscientific, biased and subjective, to
say the least. In this book, the reader will be assigned the task
personally to investigate and substantiate to which degree the
general behaviour depicted in Figure 8.2 holds for the example data
sets supplied. Figure 8.2 is based on jointly accumulated 50+ years of
validation experience between the contributing authors (which
constitute only a minute fraction of the very many practicing
chemometricians who have had the same experiences).
Careful inspection of the pertinent t–u plot of any multivariate
calibration model is thus the only way to fully understand and
interpret results stemming from otherwise “blind” cross-validations.
This is possibly a reason why some data analysis and chemometric
traditions tend to avoid inspection of t–u plots; these may reveal an
inconvenient truth in the form of a complex (as opposed to an
assumed simple) X–Y data structure in regression modelling. Such
potential information does not reach the data analyst if systematic
inspection of t–u plots is not one of the first items on the model
building agenda.

8.8 Multiple validation approaches


When using segmented cross-validation several times over with
different seed sub-datasets, or when using a multitude of different
validation approaches, there is often a tacit assumption that the
majority of approaches will lead to practically the same optimal
number of components also yielding very similar RMSECV results.
When this happens, it is claimed that a successful validation has
resulted, and that the model is “robust”.§ Alas, from Figures 8.1 and
8.2, and the discussion above, it follows that this is a groundless
claim and all that has been proven is that one particular data
structure (the local data set) is characterised by a strong (X, Y)
correlation. In this situation, nothing regarding the general future
prediction performance nor the universal application of a variant of
re-sampling approaches was in fact proven in any valid sense.
It must be noted, however, that particular scenarios may at times
be so strictly bracketed that all future data sets will indeed behave
more-or-less as typological clones—laboratory calibration of
solutions abiding Beer’s law could serve as a good example along
with many others, however, generalisation to all systems is still not
warranted. The occurrence of such cases may, or may not be met
with over the entire career of any data analyst, however, it is always
easy to find out the objective situation. Instead of being but a follower
of assumptions—always inspect and interpret the relevant t–u plots
and always perform test set validation where and whenever possible.

8.9 Verdict on training set splitting and many


other myths
“Why is duplicated application of an identical sampling protocol in
order to produce two distinct data sets, Xtrain and Xtest different from
splitting a twice-as-large Xtrain sampled in one operation?”
This is undoubtedly the most often heard remark in discussions on
validation. Below it is argued forcefully why this is indeed so. What
follows is a conceptual analysis—and a refutation of the many
objections to test validation often raised as passive justification for
continuing to apply cross-validation. The following comprehensive
analysis has never before been given at this early stage of the
education of new data analysts.
Taking care of the number of measurements (samples, objects), N,
is the easiest obligation of the experimentalist/sampler/analyst/data
analyst. But it is far more important to be in control of the variance-
influencing factors when trying to secure a sufficiently representative
ensemble of these N objects to serve as the all-important training
data set. In the literature, and from chemometrics courses, there are
usually few useful guidelines in this game—except the universal
stipulation that the training set must span the range of X and Y-values
in a “sufficient” fashion, which is a problem-dependent issue; often
this is the only consideration given to the issue of training data set
“representativity”. This is a much too shallow understanding,
however.
The critical issue is, again, heterogeneity and herewith the
obligation to be in full command regarding identification and
elimination of the sampling errors, lest a sampling bias may dominate
the measurement uncertainty budget. On this basis, any t–u plot must
be seen as a fair reflection of the sum-total of all influencing factors
on the measurement uncertainty. Sampling using a well-reflected,
problem-dependent protocol should ensure that all circumstantial
conditions are influencing the sampling process in a comparable
manner regarding both Xtrain and Xnew, i.e. they are given the same
opportunity to play out their role irrespective of who is doing the
sampling, the sample preparation and analysis. This is the role of an
objective sampling-and-analysis protocol. It is important that the
protocol has both systematic requirements securing an effective
span, as well as a modicum of random selection requirements,
deliberately trying to represent possible unknown circumstantial
effects and their impact on the correlation of the X–Y data structure.
Circumstantial conditions are capricious; however, they are time-
varying and in general defy systematisation. But any second sampling
from a bulk lot will better reflect the situation at the time, or at the
place in the future scenario, pertaining to the application of the
prediction model. This may, or may not, be characterised by the
same set of conditions as governing the training set generation. The
key issue is that the sampler has no control over which, and to which
degree, these conditions may have changed between the time of
sampling the training data set, Xtrain and the “future data set”. The
demand of a “second sampling at a future time/place” is prescribed
so as to deliver a best possible glimpse of the future application
situation. All the data analyst can rely on in this quest is to let the
second data set capture the data structure pertaining to the test set
as objectively as possible. And yes, when all of these intricacies are
fully understood and acknowledged—two independent test sets
would be (even) better. But there are limits to what one can do!
By focusing on test set validation, to the degree conditions have
indeed changed, there is now a trustworthy representation hereof
involved in the validation, as illustrated in Figures 8.3 and 8.4.
Figure 8.3: Synoptic display of Xtrain and Xtest as a basis for evaluation of empirical data
structure differences and their t–u expressions pertaining to two independent sampling
events. The two data set models shown here display significantly different loading
covariances (t,u) and the one data set (grey) displays a distinctly smaller variance in the X-
space than the other. If the grey data set is the test data set, it is obvious that there is no
similarity with the training data set (black), see also Figures 8.4 and 8.5.

In “some cases” it has been argued that the difference between


the training and the test data set is not of sufficient magnitude to be
of practical influence. The winning argument is that it is not possible
to identify such cases a priori. Whatever the situation is at the time of
decision of which validation to go for, the status of this issue is
manifestly unknown, and in fact any of the scenarios shown in
Figures 8.4 and 8.5 can potentially be on the agenda. To the degree
that the two data sets depicted are bona fide training and test sets it
is vital to include them both in the pertinent validation, which can only
be performed using test set validation. The reasons such more and
more marked disparities between training and test set can arise is, of
course…. heterogeneity. How would one be able to ascertain if such
is the case without going through the reasonable effort of (always)
taking a second, independent test set?

Figure 8.4: Illustration of a training set and test sets of progressively less-and-less overlap.

This is a fundamentally unacceptable uncertainty, which is not


resolvable within the one-data-set-only paradigm, again highlighting
the dangers of an “auto-pilot” cross-validation approach depicted in
the “one-click-model-development” options in less well-reflected
offerings.
In fact, the only way this dilemma could ever be circumvented
would be by carrying out both a test set validation as well as a
particular cross-validation alternative. The cross-validation process in
this case is used to assess the internal stability of the model (using
stability plots) and the test set validation is used as an assessment of
the future reliability of the model. If and when this approach is taken,
the structurally inferior cross-validation would never be accepted as a
performance indicator for a final model, one would only rely on the
superior test set validation result. Still, for the reader’s educational
benefit, in this book there are plenty of data analysis assignments
where it is required to carry out and compare both test set and the
several principal variants of segmented cross-validation in an attempt
to build up experience for both new and experienced
chemometricians.
The logical conclusion to the everlasting “cross-validation”¶
dilemma is to declare a test set validation imperative. Nothing
adverse will ever result from always applying test set validation,
which delivers estimates of both Aopt as well as RMSEP, while
everything is uncontrollably risky by basing a re-sampling cross-
validation on the principally untestable assumption of representativity
in the form of a timeless, constant data structure. Also, a random
splitting of the objects in the first data set into a calibration and test
set does not mean one is “home safe”—since the “second opinion”
of the underlying data structure is missing. Calibration and validation
is a systematic approach in which much understanding of the
sampling background and the span of the X- and Y-space is required
[5].
Figure 8.5: 3-D geometry renditions of progressively less-and-less similar covariance data
structures. To the degree that these are bona fide training—and test sets—it is vital to allow
them both to influence the outcome of validation, which can only be performed using test set
validation.

As a pertinent example, the development of a spectroscopic


analysis of pharmaceutical tablets is briefly presented below. When
developing such a method, many batches of previously made tablets
within their expiry dates are usually available for the sample pool. A
protocol is defined that states how many random samples are to be
taken from each lot, thus, a pool of manufactured lots is available
covering the expected range of raw material and manufacturing
conditions at the time of calibration development. This is the best
estimate of the future conditions available and is supplemented with
batches made at the time of method development.
Due to the tight specifications imparted on the production of
pharmaceutical tablets, the variability of the Y-responses will typically
be very low, thus the samples obtained should only represent the
centre of the calibration line. In order to develop a linear model, the
Y-response range must be expanded, typically through the
manufacture of development samples that span 75–125% of the
target Y-response, defined by the manufacturing set of samples.
To develop this extended range set, design of experiments (DoE)
is employed to develop tablet formulations that vary the Y-response,
but also change the other components in a designed way, so as not
to lead to unrealistic “binary” mixtures of the constituent of interest
and the rest of the tablet matrix. To test that the extended set and the
manufacturing set of data are from similar populations, a set of
replicated “manufacturing condition” tablets are also developed and
the spectra are compared using methods such as PCA to assess if
they are spectrally identical. If this is the case, the extended set can
be combined with the manufacturing set to produce a representative
sample pool for the development of a calibration model with
appropriately chosen validation set. To choose the validation set, the
pooled objects are typically sorted in ascending order based on the
Y-response. A systematic split of the data into a defined number of
calibration and validation objects can now be made that best covers
the Y-span. The selected objects are then tested for X-span and
some row exchange is employed such that:
1) The calibration sample set spans the greatest variability of both X-
and Y-space simultaneously.
2) The validation sample set completely lies within the calibration
span, but covers the second greatest span. This defines the
working range of the model to be developed.
3) Both the calibration and validation sets cover the widest variations
in raw material composition, manufacturing dates and operating
conditions as best as possible.
This validation set can be labelled the first internal validation set,
as it is primarily used to assess the choice of preprocessing(s)
applied and also to test the linearity of the calibration model. This
internal validation set is the most representative set available at the
time of model development.
Once internal validation has been performed and is complete, the
model is assessed by its application to an external validation set. This
set is typically made up of samples collected from new batches made
after the development of the calibration model. It usually only
assesses the centre of the model, as the production samples will only
have a tight range of Y-response values, however, this is the overall
objective of the model development, to assess the production of
tablets for consistent manufacture at the target response value.
It is hopefully clear now that calibration development is not a
simple random process in which samples to be selected for the
validation sets are randomly selected from a manufacturing pool. This
situation is highly applicable to the realms of industries that can
design samples for making the calibration range larger, but what
happens in the situation where this luxury does not exist, i.e.
natural/agricultural systems?
In this case, the calibration development analyst is at the mercy of
what nature delivers. This does not necessarily have to be a bad
situation, however, most models developed on natural systems
usually take many seasons of sample collection to become robust.
There are ways of “smart” calibration development that can be
employed for this situation. The process of natural system calibration
development is actually quite similar to the pharmaceutical
development described above, once a sample pool has been
established.
The steps for successful natural system calibration are
summarised as follows:
1) Collect representative composite samples from the expected strata
(for example, geographical regions) that the calibration model is to
be developed for.
2) Analyse all the collected samples using the X method
(spectroscopic or other) and perform a PCA on the data to look for
trends or groupings.
3) Perform a limited number of reference analyses (Y) on extreme
objects found in the PCA (after removal of gross outliers, should
such exist).
4) Develop an initial model. At this stage, due to the small number of
calibration objects available, one might use cross-validation to
establish a first “indicator” model complexity (only). If reasonable
linearity can be established in a small number of
components/factors, use this model on all new samples obtained
to look for “holes” in the calibration line.
5) If a linear model cannot be obtained, then use the PCA model on all
new samples to isolate objects that are different from what has
been collected to date and submit those for reference analysis until
a pool of samples is available to build extended calibration and test
sets for “robust” model development.
It is stated categorically here that such protocols are the only way
of developing reliable models. One of the current authors has used
this protocol many times in industry and has developed models that
are still in use today after being developed 10–15 years ago. Test set
selection is systematic and requires great planning by the diligent
analyst to develop robust calibrations. Cross-validation in these
method developments is only used for the following reasons:
1) To establish initial models when not enough samples are available
for test set validation. These models are only used to aid in the
finding of more samples that can be used to build the sample pool.
2) To test the internal consistency of the calibration model. In
particular:
a) Random cross-validation is used to assess the stability of the
model when random segments are taken out.
b) Systematic cross-validation is used to assess the quality of
sample replicates.
c) Categorical cross-validation is used to assess the model stability
under predefined conditions like growing seasons, manufacturing
shifts, raw materials and other non-controllable factors.
In the “machine learning” community for example, random
splitting has become a firm tradition, indeed repeated splits into
calibration and test-sets and tuning model parameters to find the
“best” model is often used here. It should be clear that this is a
dangerous tradition based on the universal belief that any-and-all first
data set is always representative of the future prediction situation.
Such a fixed belief is totally unjustified by reality, however, as outlined
in the calibration model development protocol described above.
It is often claimed that cross-validation is to be used to determine
the “correct” number of factors, i.e. cross-validation is often accepted
for internal validation purposes—but (interestingly) that for the most
reliable estimation of RMSEP a “completely independent” set is also
pointed to by the very same cross-validation proponents. The present
treatment has no quarrels with the latter stipulation of course—but is
in total disagreement regarding any use of cross-validation for
determination of Aopt. Such internal use of cross-validation is the
worst application imaginable, as it can only bring forth information as
to the singular training set. It is never in anybody’s interest to invoke a
two-step, dual method validation approach. By using test set
validation, one is presented the most reliable estimate of RMSEP
(never structurally underestimated) based on the objectively correct
number of components, all in one go.
This is the principal reason for not routinely splitting a training data
set randomly, however large it may be. With random splitting, there is
still no information pertaining to the future application situation,
unless complete consideration to the design of the set can be given
the attention it requires. A massive redundancy in the number of data
available is mistaken for a realistic basis for future performance
validation. Test set validation is the best possible way to remedy this
predicament—by securing (at least) one new data set from as far in
the “future” as is logistically possible, i.e. the external validation set.
By accepting that circumstantial conditions may well change (on
occasion), but that information about this will usually be unknown, the
test set validation approach is the best one can ever do. This also
brings up the question to what extent an empirical model can be
extrapolated, i.e. to a situation in which test-set samples as well as
other future samples may be found to lie outside the full calibration
space (if this was, somehow, under-represented). It is in order to
guard against such undesirable situations that the demands for a
proper training data set are so stringent.
From this discussion, it also transpires that a regimen of regular
test set validation model checking is a wise approach within the
arena of quality monitoring and quality control. The above discussion
appears particularly easy to understand in the process technology,
process monitoring and process control settings. Proper process
sampling in this context is treated specifically by Esbensen and
Paasch-Mortensen [9] and will be taken up in the chapter on PAT
(chapter 13). In particular, the US FDA in its 2011 process validation
guidance [10] has stated that every batch is now a validation batch in
the realms of quality by design (QbD) and this requires continuous
verification strategies. In other words, the use of process analytical
technology and modern control/data management systems now
allows manufacturers to test set validate every batch produced. This
is where responsible chemometrics meets proper consumer
protection.

8.10 Cross-validation does have a role—category


and model comparisons
There is a role for cross-validation, however, several in fact—but they
are all strictly compartmentalised and cannot be subjected to
generalisation.
In the arena of model comparison (both regarding models of
structurally identical nature, but of optionally alternative parameter
settings: for example, different pre-processing alternatives, different
X-variable selection alternatives… as well as more distinctly different
models), cross-validation is in fact a particularly relevant approach.
For this specific purpose, cross-validation furnishes precisely what is
needed, a general identical performance framework within which the
effects from alternative models, parameter settings, preprocessing,
categories (seasons, for example) can be objectively compared
without having to deal with data set structure variations for each
“segment”. In this context, it is a necessity to use the same number
of segments for all sub-validations, in which case it is strongly
recommended to use a low number of segments, preferentially two,
in order to impart the greatest possible semblance of realistic data
set variability to influence the validation results—and never LOO
cross-validation (full cross-validation). In this area of applied
validation, there is very good reason to use cross-validation, although
it is interesting to contemplate how one is to deal with the possibility
of a different number of components Aopt for alternatively optimised
models?
In the exemple of a prediction model encompassing distinctly
different seasons, it is intuitively clear that in order for such a model
to have truly predictive power, the only relevant category with which
to segment the training data set, will use seasons as the cross-
validation segments. A “robust” model, i.e. a prediction model able to
predict all year round, should be stable with respect to seasonal
perturbation. A standard “blind” cross-validation segmentation will
invariably have representative objects from all seasons in the different
segments, with the result that the model is not tested at all with
respect to the individual seasons.
There exist several other categories that function similarly as
seasons for many data sets, for which the exact same argument
holds, as is laid out in full in Westad and Marini [2] in which it is
shown how correctly applied cross-validation gives valuable
information about these specific types of sources of variation. Thus,
validating across relevant categorical object designations enables the
analyst, for example, to evaluate the robustness across raw material
suppliers, location, time, operators etc. These are meaningful
segment definitions that qualify use of cross-validation.
Returning to the pharmaceutical tablet example given above,
assume that the objective of a project is to develop a model for
predicting the active ingredient in every tablet produced at multiple
production locations. For this purpose, a spectrometer is used on-line
to provide the necessary information in real time and it may be
assumed that there exists an established reference analytical method.
The experimenter then needs an estimate of the sources of variation
for such a system to be implemented at the different production sites.
Among the many considerations one needs to take, the final set of
objects may be stratified into segments according to, for example:
a) replicated measurements on one side of one tablet
b) acquiring a spectrum on both sides of the tablet
c) changes over time for one production batch
d) changes between various batches of raw materials
e) changes due to equipment characteristics in the production line at
one site
f) variation across production sites
g) variation due to the standard sampling and analytical procedures
(covered in chapter 9)
By carefully setting up schemes for cross-validation according to
this type of qualitative information about object groups, termed
“conceptual cross-validation”, the influence on the prediction results
from these various sources of variation can be estimated and
compared. If, for example a) above is the main source of variation,
there is a fundamental problem with the measurement process. On
the other hand, if cross-validation across instruments reveals large
differences in the hardware components, the conclusion is that
individual models for each instrument are needed, or some relevant
method for instrument standardisation or model transfer is required.
Sometimes it is argued that since all the objects were acquired on
a specific day with a specific instrument, on a specific batch of raw
materials and by one person only, the above intricacies do not apply,
and one can simply just get on with simple “blind” cross-validation. In
such a case, however, estimates of the model performance will
severely lack credibility because no prospective conclusions can be
made, as per the many arguments above before section 8.10.
Unfortunately, this situation is the typical basis for the validation
presented in quite a number of technical reports, scientific
publications and in oral presentations, which unavoidably must lead
to overoptimistic validation assessments, findings that later cannot be
reproduced. The history of chemometrics has very many such
examples which has contributed to a certain measure of institutional
confusion.

8.11 Cross-validation vs test set validation in


practice
Usually there is more focus on strict adherence to one or another
cross-validation procedure, complete with preferred number of
segments (a fixed scheme), than openness with respect to what
exactly are the assumptions and prerequisites behind cross-
validation. This is troublesome, as no amount of discussion pro et
con a specific number of segments will ever reveal the underlying
structural problems associated with cross-validation using whatever
number of segments “s”. The general verdict, following from all of the
above, is:
Cross-validation is, in general, not a validation which incorporates
information as to the future use of the particular data model. Cross-
validation is overwhelmingly an internal sub-setting stability
assessment procedure; cross-validation here only speaks about the
robustness of a particular (local) data model, as gauged by internal
sub-setting of the particular training set.
Caveat: The latter feature can be turned to good use in specific
cases, specifically, the case of “conceptual cross-validation”, which
often occurs in the process realm, as well as when comparing
models. Cross-validation finds valid use for estimating the magnitude
of resulting variabilities, provided all future samples lie inside the
same conceptual modelling domain.**
The operative aspects of cross-validation versus test set
validation are illustrated forcefully by the multivariate image analysis
(MIA) examples in Esbensen and Lied [11]. Even though this
publication is addressing MIA, the image analytical examples here
throw unprecedented illumination on the general principles of cross-
validation because of the extraordinary magnitude of the X and Y
matrices involved. Because each pixel counts as an object, even a
modest image illustrates truly huge data sets, i.e. 10,000 to 1,000,000
objects or more. Here the workings of cross-validation are visualised
like nowhere else in chemometrics.
8.12 Visualisation of validation is everything
With the help of the relationships presented in Figure 8.2 above,
further developed below as Figure 8.6, it is possible to delineate the
universal deficiency displayed by segmented cross-validation, indeed
also compared with leverage-corrected validation: Test set validation
will always result in the highest estimate for RMSEP than any of the
segmented cross-validation alternatives (and often very much higher
than the leverage-corrected RMSE estimate) precisely because it
incorporates all sampling, conceptual categories, model and analyses
uncertainties. This will, therefore, always constitute the most realistic
estimate. The point is not to search for the lowest RMSE outcome
between a voluntary set of alternative validation methods/variants—
the point is to estimate the most realistic future prediction error, and
this is universally delivered by the test set estimate.
Figure 8.6 summarises experience with validation of many
hundreds of projects and data sets. In the last two to three decades
of chemometric experiences (teaching, professional, consulting)
behind the present book, innumerable data analyses have dealt with
all manner of types of data structure the general patterns of which are
depicted in Figure 8.1, especially those of more regular appearance,
types a–d). Occasionally partly deviating curves to the ones depicted
may appear, but invariably related to a local, more irregular data
structure. The “gap” between the test set validation and two-segment
cross-validation curves represents the missing TSE-component,
which can only be quantified by comparing test set and the cross-
validation results. This represents the missing TOS-error components
that can only be incorporated by sampling a second data set, the
essential feature of which is that the sampling protocol is identical for
both the training and the test data sets. From these universal
relationships emerges one very powerful conclusion: only test set
validation can stand up to the logical and scientific demands of all the
characteristics of proper validation. One should henceforth observe a
test set validation mandate if and whenever possible based on the
arguments provided in this chapter. Cross-validation used as a model
validation technique when it is possible to perform a test set
validation is unacceptable in every way and is a main cause of why
chemometrics has been given a bad name in some situations. In one
case, one of the authors witnessed a situation where a so-called
“data analyst” performed cross-validation on a full set of 6000
samples. The mind boggles at such incompetence!

Figure 8.6: Relationships between the three principal RMSE-estimate procedures as a


function of model complexity. Leverage-corrected estimates are universally lower than those
pertaining to cross-validation, which are always structurally lower than those stemming from
test set validation proper. For one-and-the-same training data set [X, Y], the systematic
relationships between the different segment variants of cross-validation are indicated in
principle; these were laid out in detail in Figure.8.2. Stronger (X, Y) correlation will result in
more close-lying curves, but the principal relationships shown remain the same.

There has been a persistent chemometric tradition of validating all


data sets, large or small—regular or chaotic w.r.t. data structure,
which are often unknown unless visualised by t–u plots. This is
especially dangerous when dealing with small data sets, see e.g.
Martens and Dardenne [12]. In such situations where the number of
objects is limited, setting aside a certain proportion of the objects as
a test-set comes with the very likely cost of removing significant parts
of the overall latent structure of the data.
Thus, there exist situations (so-called “small sampleSTAT cases”) in
which absolutely all objects are needed for optimal modelling and
interpretation of the data structure, such as the relationships between
the variables etc. In such cases, it is definitely better to allow for the
possibility of not validating data sets when the conditions for proper
validation are lacking. “To validate, or not to validate” is thus an
evergreen valid question for the consummate chemometrician.

8.13 Final remark on several test sets


Arguments can easily be raised for invoking a postulated need for
several test sets: of course, more than one test set will always allow
for more valid assessment, since more test set realisations
correspond to more examples of the future in-work prediction
scenario for the prediction model. One properly materialised test set
will have a decent chance of incorporating the principal information
from the future situation. This in contrast to the vociferous objections
and postulated budget or effort constraints that are often claimed, not
even allowing for a single test set. In a rational context, it is evident
that a decision regarding the real need for several test sets will be
based much more on problem-dependent specifics, always related to
the complete problem-dependent background regarding the likely
consequences, and the price to pay, for sloppy validation. Thus, it is
here advocated always to plan for and materialise one test set,
acknowledging the occasional need for more, but this decision is left
well and safely in the hand of the informed data analysts who are
closer to the relevant data and their background in all cases.
8.14 Conclusions
Re-sampling and cross-validation approaches work on one data set
only, Xtrain. The tradition of cross-validation is particularly strong in the
realm of less experienced chemometricians. The current use of cross-
validation and its huge popularity is based on tacit, unsubstantiated
assumptions of the training set always being fully representative of the
future scenario and future measurements on new samples. However,
this belief finds itself in strong disregard of the extremely varying
origin and the very diverse data structures in the real world. This
widespread assumption was shown to be untenable in the light of the
significant bias-generating sampling errors described in the Theory of
Sampling (TOS).
On the other hand, in some cases, one cannot “wait forever” until
all sources of variation for a given application are represented in the
first training set of objects before the starting to establish a particular
data model. Thus, it is of critical importance exactly what is
represented in this singular data set and if it has been acquired within
the framework of a “suitable sampling strategy” that forces it to
reflect future variation.
Instead of the endless series of partial examples (based on local
data set structures only) presented in the chemometrics and other
literature, and from which no valid generalisation can be made, this
chapter presented first principles, the principles of proper validation
(PPV), which are universal and apply to all situations in which
assessment of performance is desired—be this prediction,
classification, time series forecasting or modelling validation. The
underlying element in PPV is the Theory of Sampling (TOS), chapter
3, which is needed in order to identify and eliminate all bias-
generating sampling errors, which are otherwise responsible for
unnecessary, significantly inflated measurement errors, for which no
statistical corrections are possible. Invoking the complete body of
theoretical and practical experiences from ~60 years of application of
TOS, it was shown to be untenable to continue with bland, unjustified
assumptions regarding universal training data set representativity.
On the basis of chapter 3 and the present chapter, it was
concluded that re-sampling and cross-validation approaches miss
out with respect to the crucial samplingTOS variance. This variance can
only be accommodated by a test set (a second independent sampling
—more than one if deemed necessary by local, problem-dependent
reasons), without which simple re-sampling validation on one-and-
the-same data set will always structurally underestimate the realistic
prediction error. No theoretical procedure exists to derive an
approach that can estimate the magnitude of this missing part. For
this reason, re-sampling and cross-validation should logically be
terminated, except for the cases of exception described in section
8.10. Standard use of “blind cross-validation” only performs
assessment of internal sub-setting model stability. Use of cross-
validation must always be accompanied by full disclosure of the
procedures used and the inherent method deficiencies described in
this chapter.
The main purpose of establishing a model may not necessarily be
for predicting or for classifying new objects, but simply to understand
the inherent structure in the system under observation. The previous
chapters describe methods that provide insights into the underlying
structure of any process or system under observation, through scores
and loadings relationships a.o. All model interpretation is highly
dependent on the number of latent variables retained, and therefore it
is vital to be able to determine the correct dimensionality (rank) of the
model. It is important to distinguish between numerical rank,
statistical rank and the application-specific rank, which may not
always be identical.
Regarding PLSR, a major chemometric regression method, a call
was made for stringent commitment to test set validation based on
graphical inspection of t–u plots for optimal understanding of the
operative X–Y interrelationships. Simple visual inspection will also
allow a reliable premonition of the outcome of any particular
validation approach. There is no justification to reject the work effort
involved in securing a test set for validation purposes, acknowledging
that this is the only approach which eliminates the deficiencies
outlined. The comparatively rare occasions when a test set is not an
available option (historical data a.o.), have no generalisation power.
The comprehensive understanding outlined in this chapter will stand
the data analyst in good stead when, feeling forced to make use of
some form of re-sampling. Complete understanding and full
disclosure of the structural RMSE underestimation deficiency is
mandatory in all such cases.
Many reasons are given in numerous traditional arguments for
continued use of cross-validation and re-sampling for validation. The
following arguments and reasons are not valid:
Complacency: one cross-validation approach/method for all data
sets is an easy buy, but one that completely disregards the gamut
of vastly different data structures and correlations.
Focus is on algorithms, implementation and software, without
critical thinking.
Unwillingness to investigate consequences of traditional statistical
assumptions (myths).
Resistance to the Theory of Sampling (TOS) for complementary
understanding regarding heterogeneity and sampling process
issues.
Misunderstanding, or misplaced universal trust in the central limit
theorem.
No interest for how “data” and “data quality” originate.
Blind adherence to traditions or schools-of-thought: “This is the
way chemometrics has been doing validation for more than 40
years…”
This chapter mostly discussed how a system can be validated
using the best available information about the origin of the data
(objects), Esbensen and Geladi [1]. However, validation may have
various meanings in different scientific communities. Questions like
“do I use the expected chemical information in my instrumental
variables to predict product quality”, “do various methods give the
same interpretation” or “do I find the same subset of variables with
various variable selection approaches?” are examples where cross-
validation in specific, bracketed situations may be useful in broader
and more advanced contexts, see Westad and Marini [2] for more on
these issues.

8.15 References
[1] Esbensen, K.H. and Geladi, P. (2010). “Principles of proper
validation: use and abuse of re-sampling for validation”, J.
Chemometr. 24, 168–187. https://doi.org/10.1002/cem.1310
[2] Westad, F. and Marini, F. (2015). “Validation of chemometric
models—a tutorial”, Anal. Chem. Acta. 893, 14–24.
https://doi.org/10.1016/j.aca.2015.06.056
[3] Martens, M. and Martens, H. (2001). Multivariate Analysis of
Quality. An Introduction. Wiley, Chichester, p. 445.
[4] Høskuldsson, A. (1996). Prediction Methods in the Sciences.
Thor Publishing. Copenhagen.
[5] Press, W.H., Teukolsky, S.A., Vetterling, W.T. and Flannery, B.P.
(2007). Numerical Recipes, The Art of Scientific Computing, 3rd
Edn. Cambridge Press, NY, p. 777.
[6] Wonacott, T., and Wonacott, R. (1990). Introductory Statistics,
5th Edn. Wiley, New York.
[7] Devore, J. (1995). Probability and Statistics for Engineering and
the Sciences, 4th Edn. Duxbury Press, Belmont, CA.
[8] Martens, H. and Naes, T. (1989). Multivariate Calibration. Wiley,
Chichester.
[9] Esbensen, K.H. and Paasch-Mortensen, P. (2010). “Process
sampling (Theory of Sampling)—the missing link in process
analytical technologies (PAT)”, Process Analytical Technologies,
2nd Edn, Ed by Bakeev, K. Wiley-Blackwell, Oxford.
https://doi.org/10.1002/9780470689592.ch3
[10] US FDA, Guidance for Industry—Process Validation: General
Principles and Practices.
http://www.fda.gov/downloads/Drugs/Guidances/UCM070336.pdf
[Accessed 6 January 2017].
[11] Esbensen, K.H. and Lied, T.T. (2007). “Principles of image
cross-validation (ICV): representative segmentation of image
data structures”, in Techniques and Applications of
Hyperspectral Image Analysis, Ed by Grahn, H.F. and Geladi, P.
Wiley, Chichester, Ch. 7, pp. 155–180.
https://doi.org/10.1002/9780470010884.ch7
[12] Martens, H. and Dardenne, P. (1998). “Validation and
verification of regression in small data sets”, Chemometr. Intell.
Lab. Syst. 44, 99–121. https://doi.org/10.1016/S0169-
7439(98)00167-1

* In chemometrics, there has been a key terminology confusion regarding the terms, sample,
object, observation, measurement a.o. The duality regarding “sample” in the statistical vs the
TOS contexts is covered comprehensively in this book. An “observation” can be understood
as a passive measurement. A physical sample may result in several “replicate”
measurements, which then would become individual objects when represented in a data
matrix. Chapters 3, 8 and 9 have made a profound effort to clear up these confused
terminology issues.
† Chemometrics owe Høskuldsson a very great depth of gratitude for his seminal 1996
treatise; the H-principle name may very well connote the last name of this chemometric
author as well.
‡ One test set is the minimum requirement… but more can, of course, also be contemplated
in specific situations, especially in cases where data set structures are varying more than
what is comfortable. This issue is particularly relevant for PAT implementations and other
process data modelling realms, see chapter 13.
§ Note that in reality the model is only “robust” with respect to alternative validation methods
or strategies.
¶ In a recent meeting of the Australian Near Infrared Spectroscopy Group (ANISG), the
respected chemometrician Professor Tom Fearn presented a paper entitled: “Crass-
Validation”—a term that well aligns with the current argumentation in this textbook.
** However, there is a paradox when one set of objects is collected on one day, with one raw
material etc., there is still no information about the samples as basis for systematic validation
in the sense of “conceptual cross-validation”. In this case only random cross-validation or
setting aside one part of the objects as a test-set are real alternatives. A random split into a
calibration and a test set will not reveal if the model is stable towards future sources of
variation. Again, only a well thought out training data set will allow validation to address all
the relevant issues; the specific cross-validation method alone will be insufficient.
9. Replication—replicates—but of
what?

This chapter provides an overview of an issue which has not always


received its fair attention, the question of “replication”, exemplified by
the following examples that can be found frequently in the literature: i)
replication of sampling; ii) replication of samples); iii) replication of
measurements (analysis). What is meant by replication here? Are
these “replications” identical? The commonly implied connotation is
that of a beneficial averaging carried out with the help of replicates.
There are, however, many uncertainties and imprecise assumptions
involved when considering averaging, averaging of what exactly? This
issue needs careful consideration before data analysis can be
performed appropriately.

9.1 Introduction
From the discipline of experimental design (design of experiments,
DOE, chapter 11), comes a well-organised strict understanding and
terminology for “replicate measurement”, because of the rigidly
controlled situation surrounding the actual design. For example, in
the situation of chemical synthesis influenced by several experimental
factors (say, temperature, pressure, concentration of co-factors), it is
easy to understand what a replicate measurement means: repeat the
synthesis experiment under identical conditions for these controllable
factors and replicate the outcome measurement, for example, the
yield. By definition of a designed experiment, care has been taken to
randomise all other potential factors, in which case the variance of
the experimental results, be it small or large, is supposed to furnish a
measure of the “Total Analytical Uncertainty”. Upon reflection,
however, this variance also tells of the uncertainty contributions
stemming from other influences, for example, from small-scale
sampling of reactants involved, which may not necessarily represent
“homogeneous stocks”, but more likely are of uniform composition
only, see chapter 3. Added uncertainty contributions may also arise
from resetting the experimental setup, i.e. to what precision can one
“reset” parameters such as temperature, pressure or concentration
levels of co-factor chemical species after having turned the setup off
and cleaned all the experimental equipment (perhaps even waiting
until next week) before “replicating”? However, such uncertainty
contributions are usually considered insignificant because of the
“controllable” situation attending DOE.
There are, however, many other scenarios preceding data analysis
that far from parallel this nicely bracketed situation of a controlled
experiment. Indeed, most data sets do not originate from within the
complacent four walls of an analytical laboratory only, but from
sampling of heterogeneous systems and processes from all of
science, technology and industry. What are described below
constitute the opposing end of a full spectrum of possibilities in which
the researcher/data analysts must recognise as significant sampling,
handling and preparation errors in addition to Total Analytical Error
(TAE). The issue can most conveniently be organised around one key
concept: What is meant by “replicate samples”? Note that one
physical sample may end up being represented in a data matrix as
several objects etc. with a very real danger of potential confusion.* To
add to the complexities, replication may also be carried out at other
levels, for example, related to sampling of multiple batches, lots or
production units, sampling from different seasons or from different
instruments, perhaps carried out by different operators. All these
higher-order replication issues are discussed in more detail in chapter
8. For each of these scenarios, it is imperative that the reader is
offered full disclosure of the stage, level and intensity of replication.
Upon reflection, this issue will appear more complex than what
may seem the case at first sight, indeed it merits careful definition,
contemplation and a strict terminology among other reasons because
it is also intimately related to the validation issue presented in chapter
8. There has been much confusion (due to unawareness and
sometimes neglect) because of far too vague or incomplete
definitions of what it is exactly that is replicated.
As the case in point, what is meant by “replicated samples”?
With reference to TOS, it will be appreciated that “replication” can
concern (at least) the following alternative scenarios:

Stage 1: Replication of the primary sampling process (all the way to


analysis), with due regard to the possible effects of time, sequence,
raw material variation etc.

Stage 2: Replication starting at the secondary sampling stage (i.e.


first mass reduction).

Stage 3: Replication starting with the tertiary sampling process


(further mass reduction).

Stage 4: Replication starting with aliquot extraction and preparation


(e.g. involving compaction or other problem-dependent operations,
which all add variance when replicated).

Stage 5: Replication starting with aliquot instrument presentation (e.g.


surface conditioning).

Stage 6: Replication of the analysis only (TAE).

The last situation logically corresponds to the term “replicate


analysis”. But does this mean that the aliquot (the vial) stays in the
analytical instrument all the time while the analyst “presses the
button” repeatedly, say 10 times? Possibly—this would indeed
correspond to TAE sensu stricto, but it may seem equally relevant to
extract the vial and insert it in the instrument repeatedly, allowing
normal temperature variations to influence TAE because this is a
more realistic repetition of the between-samples situation than
repeating measurements on one static sample housed in the
analytical instrument without replacement. This is a first foray into
“Taguchi thinking”. † But to another analyst, it may perhaps appear
equally reasonable also to include some, or all, of the “sample
conditioning/preparation” variations in the replication scheme, for the
same reason: to be more realistic. For which reason, such
perturbations should then logically also be repeated 10 times (stages
4 and/or 5 above).
Having opened up this avenue, it now seems an unavoidable
logical step to follow up with further, equally relevant and realistic
perturbations of the circumstances surrounding “analysis”, and in fact
include also the tertiary, secondary and ultimately the primary
sampling errors in the replication concept. Following the full impact of
chapters 3 and 8, it is clear that the only complete “sampling-and-
analysis” scenario, which is guaranteed to include all possible
uncertainty contributions to the total Global Estimation Error (GEE), is
the one that starts with replication of the primary sampling method
(“replication, from the top”).
Repeating the primary sampling, say, 10 times, each sample being
subjected to the exact same protocol governing all the ensuing sub-
sampling (mass-reduction), sample handling and preparation, is the
only procedure that allows the full set of uncertainties and errors‡ to
be reliably manifested more than once, i.e. this approach is the only
fully realistic replication of all the elements in the sampling-and-
analysis pathway compared to the routine workflow of typical
laboratories. By contrast, starting at any other of the levels in stages
2–6 is guaranteed to result in inferior TSE + TAE estimations, which
structurally will be guaranteed to be too low.
There is always an obligation for the analyst to describe the
rationale behind the specific choice of a replication scheme and to
fully disclose voluntarily exactly what was in fact replicated, else the
user of the analytical data will be in the dark. Undocumented,
unexplained (and sometimes even ill-understood) application of the
term “replicates” (w.r.t. sampling, samples, sub-samples, aliquots?)
has been the source of a significant amount of unnecessary
confusion in analytical chemistry, statistics and chemometrics.
However, many times the problem boils down to that s2(TAE) simply
has been misconstrued to imply the much larger s2 (TSE + TAE), a
grave error, for which someone must be responsible—but who? Who
or what is the culprit in this context? More importantly, how can this
be rectified?
The above scenarios illustrate the unfortunate
compartmentalisation of responsibility, which rules the day in a wide
swath of current laboratory, scientific and industrial contexts.
Comments commonly heard are: “..the analyst is not supposed to
deal with sampling outside the laboratory”; “...this department is only
charged with the important task of reducing the primary sample to
manageable proportions, as per the laboratory’s instructions”;
“...sampling is automated, there is no sampling problem here”; “...I
am not responsible for sampling, I analyse the data!” and a legion of
similar excuses for not seeing, or wanting to deal with, the complete
“measurement uncertainty” issue. All too often the problem belongs
to “somebody else”.
If “excuses” like these continue to be allowed in practical
experimental work, in technical guidelines and reports and in the
scientific literature, there is a grave danger that this unfortunate stand
will only be perpetuated: “Replication” will then mostly still take its
point of departure at stage 3 (maybe stage 2, but almost never from
stage 1), the primary sampling stage. Add to this a distinct lack of
stringency on behalf of authors, reviewers and editors to focus—or be
knowledgeable enough to be able to focus—and crack down on this
enormous ambiguity regarding “replication”. The issue is manifestly
critical. Grave errors are still being committed, for which a universal
remedy has not yet seen the light. This chapter intends put an end to
this unfortunate issue. How? Well, the issue is no longer what is
wrong, the issue rather is: what can be done about it? Indeed, who is
going to do something about it? Well, it turns out that the answer is
very close—it’s YOU, the insightful and competent data analyst, by
insisting on appropriate data quality precautions as well as relevant
strategies for collection of data to ensure representative variation in
the data to be analysed, which is a key point in all multivariate
considerations.

9.2 Understanding uncertainty


The basic assumption underlying the application of multivariate
analysis is that the measured data carry relevant information about
the studied properties and experimental objectives. It is obvious that
it matters very much whether the individual data in any matrix are
fraught with measurement uncertainties proportional to s2(TSE + TAE)
or to s2(TAE) alone, else data analytical interpretations and
conclusions run the risk of addressing patterns, results and issues
which are in reality “below the effective uncertainty level”. Data
quality needs to be quantifiable and how to achieve this is the
purpose of this section.
With reference to chapter 3, most aspects related to the
replication issue can be assigned to incomplete or too vague notions
as to the role and influence of spatial heterogeneity, DHL of the
analyte in question. But with proper insight one will never neglect the
most influencing of TSE contributions from primary, secondary and
tertiary sampling, effects which must be determined by an empirical
inquiry. Depending on the application, various other aspects of
heterogeneity in the system under observation might also be
instrumental, for example, temporal irregularity. What is confusing,
and frustrating, is that in many cases the data analyst is, quite
literally, miles away from the location where the issue originates. It is
not, however, impossible, difficult, nor expensive to do something
about this: the Replication Experiment can easily be carried out by
the same personnel responsible for the primary sampling.
9.3 The Replication Experiment (RE)
The scene is now set for a remarkably powerful tool with which to
deal effectively with all issues regarding TSE vs TAE. The Replication
Experiment (RE) will be able to resolve all the issues that were raised
above, and in chapters 7 and 8, regarding the operative influence(s)
on the total measurement uncertainty, and at all relevant sampling
stages. As luck would have it, the principle of a Replication
Experiment is simplicity itself.
The quantitative effect of lot distributional heterogeneity (DHL)
interacting with a specific sampling process (i.e. any sampling
process based on a pre-selected sample mass in a specific sampling
plan using either grab sampling or composite sampling, or whatever)
can be quantified by extracting and analysing a small number of
replicate primary samples with the objective to “cover the spatial
geometry, or the salient time span of the lot/process” as best
possible, and from this to calculate the empirical variance of the
resulting analytical results aS. This procedure is termed a Replication
Experiment.
A relatively small number of primary samples may often suffice,
though never less than 10 if possible. The issue at hand is not only
statistical, but more associated with the presumed lot heterogeneity.
If DHL is, or is suspected of being, significantly influential, it is
senseless to be frugal w.r.t. the number of replicated primary samples
for obvious reasons. It is suggested to use an informative replication
index, r, to the replication experiment term, RE(r). Thus RE(7) can be
meaningfully compared to RE(12) for example. The issue is not so
much the exact r value, it is rather that the experimenter honours an
obligation to report on what basis RE was physically carried out—e.g.
in Figure 9.1 a manual stream sampling subjected to a RE(10). Note
that this particular sampling operation may, for example, not
necessarily be representative—in which case this is precisely the kind
of insight that will be provided by a replication experiment.
Figure 9.1: A generic example of a specific primary sampling operation (in this case a manual
sampling of a process stream) that can easily be replicated (for example 10 times), which is
all that is needed for a Replication Experiment, RE(10).

The replication experiment must be governed by a protocol that


specifies precisely how all procedural elements are to be carried out.
It is essential that both primary sampling as well as all sub-sampling
and mass-reduction stages, sample preparation etc. is replicated in a
completely identical fashion. Obviously, it is preferable for a RE(r) that
all Incorrect Sampling Errors (ISEs) have been eliminated, i.e. that
only correct sampling is employed (TOS’ preventive paradigm, refer
to chapter 3). However, it is also feasible first to gauge an existing
sampling-and-analysis procedure in which this requirement has not
necessarily been fulfilled. The replication experiment will then include
the pertinent error effects from these factors, i.e. include the adverse
sampling bias effects. This will soon be seen as a major bonus,
however, revealing the real practical power of RE(r).
It has been found convenient to employ a standard statistic to
characterise results from the replication experiment. The relative
coefficient of variation, CVrel is an informative measure of the
magnitude of the standard deviation (σ) in relation to the average (Xavr)
from a series of properly replicated analytical results, expressed in %
(equation 9.1):

When RSV is calculated from data originating from a replication


experiment starting with the primary sampling, it is clear that it
encompasses all sampling and analytical errors as manifested r times
through the full “lot-to-analysis” pathway. RSV measures the total
empirical sampling variance influenced by the specific heterogeneity
of the lot material as expressed by the current sampling procedure.
The core issue is that a properly designed RSV is not only a reliable
(TSE + TAE) estimator, it simultaneously furnishes a quantitative
measure of the effective heterogeneity of the lot, precisely as
manifested by the sampling procedure in use. The specific RSV
magnitude can be seen to be directly proportional to the effective
total heterogeneity of a(ny) lot/material, because some form of
sampling of the lot must be carried out in order to end up with
analytical results. What could be more relevant than a proper
quantitative characterisation of any practical “measurement situation”
including the critical sampling component?
Since all sampling errors, at all scales from lot to analysis, are
included for each replicated primary sample, it is certain that both the
inconstant sampling bias (if present) as well as all Grouping and
Segregation Error (GSE)-induced variability effects (always present to
some degree, as are the Fundamental Sampling Error effects, FSE)
are allowed to manifest themselves r replicated times. It follows that
RSV provides a highly relevant expression of the effective total
“measurement uncertainty”. This has the desirable consequence that
all sampling procedures may easily be put to the test, if a specific
threshold with which to compare is at hand.
Figure 9.2: Examples of different empirical RSV magnitudes, expressed with respect to the
pertinent average of the r analyses (for example 100 ppm as illustrated). The larger the RSV
magnitude, the larger the spread of the r final analytical results realised through the RE(r). Is
there a universal threshold?—see text.

From the TOS community, a general acceptance threshold has


(reluctantly) been suggested as 20%, but as this is based on
theoretical model understandings of FSE alone, an additional
measure must be added allowing for the effects also from GSE for all
types of significantly heterogeneous materials. Thus, a RSV which is
higher than, say, 30% signifies an unacceptably high sampling
variability—with the mandate that the sampling procedure must be
improved. Remember, however, that such a general threshold issue is
in reality passing judgement over what is strongly problem-dependent
(materials heterogeneity may well be so that a higher threshold is
warranted, but there also exist many types of materials for which a
lower threshold than 20% is entirely appropriate§). Be all this as it
may, for the data analyst, it is sufficient to be conversant enough with
these matters to demand that some form of RSV threshold must be
available, else data analysis operates with a much too uncomfortable
margin.
There is a significant work effort and economic savings potential
in recognising that samples resulting from a proper RE(r) can be
analysed for any number of analytes (variables). It is the same set of
samples which is sent to the analytical laboratory. This constitutes a
comprehensive screening of all potential analytes involved. One of
these is bound to display the largest RSV, signifying that this analyte
is exhibiting the largest heterogeneity, which means that if the entire
sampling-processing-analysis procedure is focused on this aspect
alone, all other analytes will display an acceptably lower
heterogeneity and thus cause no trouble.
Quality control of a sampling operation is completely tied in with
the degree of trustworthy spatial (or temporal) “coverage” that was
achieved in the deployment of the (r) primary sampling replications,
Figure 9.3. The schematic lot depicted is meant as a metaphor only—
it is meant to represent very many types of lots in general. For a RE(r)
to be relevant, “coverage” is everything because this is in reality
testing how an alternative singular primary sampling may come out.
Having access to ten such alternative primary samples (and their
corresponding analytical results) furthers critical insight into how well
the particular sampling operation works in relation to the inherent
heterogeneity (CH + DH) of the target lot material, i.e. how stable, or
“robust” is the sampling operation in use?
There is another very useful aspect of RE(r). Consider that an
initial sampling procedure testing resulted in a RSV of, say 127%.
There is obviously something distinctly wrong here, 127% is way
above 30% (or 20%, or lower); this represents a situation in which
one would very likely find a non-representative operation somewhere
in the sampling pathway. It is possible to detect precisely where the
culprit part-process is to be found—for which reason the definition of
full vs partial vs hierarchical replication experiments is made first.
Figure 9.3: “Coverage” is everything—but not without insight. Even though the ambitious
sampler is trying to “cover the ground” widely, it is also clear from a TOS point of view that
the whole lot is far from covered properly. This illustration is a warning that deploying a
RE(10) in-and-of itself is not enough and that it can in fact lead to misinformation, if the
fundamental TOS demand of lot coverage is not properly understood.

Full replication: By the option of replication “from the top”, i.e. in


a baseline RE, direct, unambiguous and quantitative information as to
the efficiency and validity of the total sampling + measurement
procedure can be had. In the event of a RSV transgression, the
message has been sounded clearly—remedying activity will be
needed in order to lower the sampling quality criterion below the
pertinent threshold.¶ In this fashion a full RE(r) delivers immediate,
highly relevant information on any sampling process currently in use.
The “gamble” by starting out with a full RE(r) is that this may
substantiate the current procedure, in which case no further action is
necessary (the gambit was won). But the opposite side of this issue is
that one must implement relevant TOS-remedial actions in the case
where RSV transgresses 20–30%, no exceptions! There is only one
remedy for a “too high” RSV, be this for a full, a partial or a
hierarchical Replication Experiment (r)—TOS to the fore for
remediation.
Partial replication: It is also possible to start the replication
experiment at a “lower stage” in the replication hierarchy. Figure 9.4
illustrates the general setup for full vs partial replication experiments.
Hierarchical replication: A hierarchical replication consists of the
full and all partial replication setup combined. This is complete
replication at all levels (of course starting with the primary sampling).
The situation is illustrated in Figure 9.5 in which the magnitude of
each s2 is represented by the length of a horizontal bar. As each lower
level variance is included in any of those pertaining to higher levels,
the lower level variances can be subtracted from those from all higher
levels. By this simple subtraction, one can perform a complete
decomposition of the level-specific s2. For details, the reader is
referred to http://www.spectroscopyeurope.com/sampling/sampling-
quality-assessment-replication-experiment.
Figure 9.4: Illustrating full vs partial Replication Experiments, i.e. different starting levels for
the RE (horizontal arrows). This illustration harks back to the hierarchical levels of
“replication” delineated in the introduction, section 9.1.

9.4 RE consequences for validation


From the above, it is apparent that a Replication Experiment (r)
furnishes the exact information needed to tackle all the issues
presented earlier. RE(r) is an unambiguous quantitative index of the
total sampling-and-analysis variability induced through the full lot-to-
analysis pathway. In this sense, precisely, the Replication Experiment
(r) is a “measure of everything”. This has a direct impact on the
validation issues, especially regarding the aspect of “replicate
measurements”. Two easy outlines of these interrelationships can be
found in Esbensen et al. [3–4]. The next section discusses these
principles in more detail focusing on development of multivariate
calibrations for spectroscopic applications.

Figure 9.5: Hierarchical replication setup. A separate RE(r) for each sampling stage allows full
decomposition of RE(r) variance into stage components.

9.5 Replication applied to analytical method


development
Multivariate regression is typically carried out on samples, hopefully
representative samples, that must cover the relevant range of
constituent composition (or other properties) such that these samples
in particular span what is expected in the future. Note here that
analyte range is used as a simplified proxy for DHL. Where the
application requires collection of natural, technological or industrial
samples, heterogeneity is a common and serious issue, particularly
the “representativity of the training data set” with respect to the target
lot/system. Additional questions also arise, more related to the
analytical issues, such as “how many replicate scans should be
performed?” In some industries, for example the pharmaceutical
industry, although products are supposed to be manufactured to
within tight specifications, there may still be serious issues of
heterogeneity, for unit operations including mixing, where it is
necessary to extract samples from powder blends for quality control
purposes, see Esbensen and Romanach [5] for an example of this
situation, or to extract tablets from the full populations of produced
units for acceptance sampling.
It is important to keep track of the “lot-to-analysis” hierarchy of
scales in order not to be confused. Assume that sampling issues
levels 1–3 have been attended to in a satisfactory manner. The main
remaining issue to be addressed when utilising spectroscopic
methods [such as near infrared (NIR) or Raman spectroscopy] then
concerns the heterogeneity of the “analytical sample”. Thus, the
section below deals with replication stages 4–6 only.
Spectroscopic analysis of solid materials is limited to the size of
the “beam foot print” of the instrument. In the case of materials
exhibiting low heterogeneity (for example, certain raw materials, or
“well blended” powders), the conventional beam spot size may well
be able to further a representative measure of the entire material, if
and when suitably validated and verified. However, in the case of
more heterogeneous materials, including natural samples of fruits,
vegetables, grains, aggregates, soil … the characteristic sample
heterogeneity will be much larger than the beam/spot size scale
footprint (a particular issue with quantitative NIR and Raman
spectroscopy).
In such cases, replication takes the form of taking a number of
scans over one, or more regions of the surface of the analytical
aliquot and averaging to obtain a spectrum that minimises the
inherent variability encountered at this analysis stage. Many vendor
solutions to this particular issue can be found, the common feature of
which is to enlarge the analytical area/volume of the effective
footprint by mechanical means, including the use of a rotating sample
dish or a translating beam. A particularly effective way to increase the
analytical area is to analyse from the outside of a rotating and
translating cylinder. With such solutions, the effective analytical area
can be increased 10-fold to 50-fold, allowing very powerful coverage
and averaging to come into play. This approach could be termed
area-enhancing spectral-acquisition replication, but it is important to
keep in mind that the target here is one analytical sample only (which
may, or may not be representative in itself**).
Replicated spectral acquisition with the same spot location
amounts to nothing more than an estimate of TAE, which usually has
been produced many, many times over earlier; it is a particularly
nonsensical, and economically wasteful, operation to mandate
replicating the analysis over and over as per routine (clearly without
thinking).
These features should seem obvious, but in many calibration
situations, a bulk sample is still split so that half goes to the
laboratory (for reference analysis) with the other half for
spectroscopic analysis, tacitly assuming that all 50/50 split samples
will always be identical. In many cases, however, calibration models
have poor precision due to this type of disconnect between the
measured sample and the related laboratory reference values,
precisely because there may still be a significant heterogeneity at the
bulk sample level; it is all again a matter of the specific heterogeneity
of the material. Fortunately, there is an easy solution to this problem:
it should always be the same sample that is measured by the
particular analytical instrument which is also sent for laboratory
analysis. If one is even to begin contemplating deviation from this
mandate it is comfortable to know that representative sample splitting
is entirely possible (chapter 3).
Even in the situation where the reference analysis indeed is
representative of the sample scanned, there is a misconception about
replication in method development. The International Conference on
Harmonisation (ICH) has developed guidance to industry on how to
validate analytical methods. In the document entitled “Validation of
Analytical Procedures: Text and Methodology” [6] the conventional
principles of Repeatability and Reproducibility are defined. These are
important aspects that must be considered when replacing a primary
analytical method with an alternative, secondary method. Within this
context, the precision of an analytical method is separated into three
components.
Repeatability typically measures the precision of the analytical
method on the same sample (stage 6, TAE) and is to be measured
within a short time-period. In this case RSV will of course be low,
otherwise the analytical method itself would be considered to include
too much random variation for the precision to be acceptable. The
suggested value for RSV in this case is expected to be below 2%
(ICH).
Intermediate Precision is a measure of how well a procedure can
be performed in “a short time period”, assessed over factors such as
days, analysts, instruments etc. It is typically performed in a non-
biased way using an experimental design that includes analyst 1
performing analyses on instrument 1 on day 1. The same sample
stock is given to analyst 2 on day 2, using either another instrument
or the same instrument analyst 1 used, but completely “reset”. This is
done in this fashion to ensure that the samples are “true replicates”.
To add statistical credibility to the results obtained, analyst 1 is
typically the most experienced analyst and analyst 2 the least
experienced.
The results of an intermediate precision test are statistically
compared using a paired t-test (chapter 2), in which a significant
analytical bias can be detected and its magnitude can be assessed.
For example, if the bias between analysts is found to be insignificant,
then there is no difference in the replicate samples being measured
by the two analysts and therefore, the Standard Deviation of
Differences (SDD) can be used as a measure of the Standard Error of
Laboratory (SEL), which is a primary statistic used to compare the
precision of the primary method to the alternative method. The SEL is
discussed in detail in chapter 7.
Reproducibility is a measure of how well different laboratories can
perform the analysis developed by one laboratory and is a measure of
method robustness. It is obvious that the sample stock heterogeneity
is critical here, and needs to be fully evaluated (for example using a
replication experiment).
These three components may nevertheless constitute a smaller
part of the total prediction error, or classification rate, compared to
stratification grouping of samples due to time, raw material supplier
and other uncontrolled sources of variation.
Under the auspices of ICH, when it can be shown that an
analytical method has no bias—accompanied by a suitable precision,
this is usually taken to indicate that the primary sampling, or sampling
regime used, is adequate. But this conclusion is only applicable to
certain, well-specified industry sectors dealing with demonstrable
low-heterogeneity materials, most notably in the pharmaceutical
industry. It is essential not to fall into the trap of illegitimate
generalisation here. Where a significant heterogeneity exists in the lot
material, as well as in the sample being measured, the method
development scientist must take this into account up-front when
developing a relevant sampling plan. The full brunt of the
heterogeneity issues treated above can be present, or be present to
an intermediate degree, or not at all; the point is that one usually does
not know what this status is. But this is all a matter of the
characteristics of the lot material interacting with a specific (good or
bad) sampling method—plus TAE, most emphatically not pertaining
to the analytical method alone.

9.6 Analytical vs sampling bias


A critical feature concerns the relationship between the analytical bias
and the sampling bias. The analytical bias, a well-known concept in
analytical chemistry and metrology, signifies a systematic deviation of
constant magnitude, i.e. the classic statistical bias. Applying a
dedicated experiment, its magnitude can always be estimated, after
which it can be corrected for by a simple subtraction (lower-right
panel in Figure 9.6).
In opposition to this conventional analytical bias understanding,
the sampling bias is of a distinctly different nature—the sampling bias
is not constant, but varying. Because the sampling process interacts
with different, spatially dislocated parts of a heterogeneous lot
material every time when a “replicate sample” is extracted, repeated
attempts to estimate the magnitude of the sampling bias will in
principle result in different dispositions of the ensemble analytical
results, as illustrated in the upper-right panel of Figure 9.6. The
sampling bias is inconstant, and can therefore never be subjected to
any form of bias-correction. This is the most fundamental difference
between appropriate understanding of the analytical process and the
specific issues pertaining to the sampling regimen.†† A significant,
regrettable confusion has existed for many decades with this root
cause. These issues are also put under the validation microscope in
Esbensen and Geladi [8].

Figure 9.6: Analytical vs sampling bias. The analytical bias is per definition always assumed
to be constant and can therefore be subjected to a statistical bias-correction (lower-right
panel). The sampling bias is of a fundamentally different nature, however, due to
heterogeneity. The sampling bias varies in magnitude every time it is estimated as a
consequence of the heterogeneous nature of the lot/system and can therefore never be
corrected for (upper-right panel). If estimated one more time the sampling variability would
again be both biased and imprecise (but would constitute a different point swarm location
and disposition in the dart board illustration metaphor used here). Instead the sampling bias
has to be eliminated, the tools for which is to be found in TOS.

TOS draws the only logical, scientific conclusion possible: the


sampling bias must instead be eliminated. As it turns out, this is fully
possible albeit with very different means than a conventional
statistical correction. The salient issue is that it is impossible to know
a priori when the case is of low, intermediate or high material
heterogeneity if a heterogeneity characterisation, in the form of a
replication experiment has not been performed, see above.
It is not possible to design an appropriate sampling procedure or
sampling plan in the absence of information about the inherent
heterogeneity met with (at whatever scale). It is a persisting myth,
borne mostly out of ignorance, that sampling representativity can be
achieved simply by acquisition of a particular piece, brand or type of
sampling equipment or by following a standard(ised) sampling
procedure—the claims of very many standards and OEMs
notwithstanding. Many existing types of equipment are in fact not in
compliance with the principles of TOS, and will not deliver
representative samples, DS 3077 [2].
Without sampling representativity, there can be no data
representativity—without which the data analyst is in reality “flying
blind”, delving into the complexities of multivariate calibration based
on a too-limited conceptual understanding of all the types of errors
influencing the total Measurement Error (MU). There is no escaping
this troublesome situation, not within analytical chemistry, within
statistics nor within data analysis in general. This is the reason behind
chapters 3, 8 and 11 in a curriculum for data analysis.
In point of fact, there has developed a firm tradition within
chemometrics surrounding multivariate calibration validation that only
measurement uncertainty in the strict sense (“measurements”) and
the search for the definite validation index which can be used for all
types of data has been established. The most prominent of these is
undoubtedly the Ratio of Performance to Deviation (RPD) index used
massively within the NIR community. This traditional usage has
recently been put into a proper perspective by a slightly iconoclastic
paper, Esbensen et al. [9], in which can be found a broader
understanding of the straightjacket restrictions that actually pertain to
RPD—which makes it significantly less than the universal validation
performance indicator sought for.

9.7 References
[1] Pitard, F. (2009). Pierre Gy’s Theory of Sampling and C.O.
Ingamell’s Poisson Process Approach. Pathways to
Representative Sampling and Appropriate Industrial Standards.
Doctoral thesis, Aalborg University. ISBN 978-87-7606-032-9
(available from the author at [email protected])
[2] DS 3077. Representative Sampling—Horizontal Standard (2013).
Danish Standards, www.ds.dk
[3] Esbensen, K.H, Geladi, P. and Larsen, A. (2013). “Mythbusters
in Chemometrics: The replication Myth 1”, NIR news 24(1), 17–
20. https://doi.org/10.1255/nirn.1390
[4] Esbensen, K.H., Geladi, P. and Larsen, A. (2013), “Mythbusters
in Chemometrics, 6: The Replication Myth 2: Quantifying
empirical sampling plus analysis variability”, NIR news 24(3),
15–19. https://doi.org/10.1255/nirn.1364
[5] Esbensen, K.H. and Romañach, R.J. (2015). “Proper sampling,
total measurement uncertainty, variographic analysis & fit-for-
purpose acceptance levels for pharmaceutical mixing
monitoring”, TOS forum Issue 5, 25–30.
https://doi.org/10.1255/tosf.68
[6] “ICH Harmonized Tripartite Guideline Q2(R1), Validation of
Analytical Procedures: Text and Methodology” (1997). Federal
Register 62(96), 27463–7.
[7] Esbensen, K.H. and Wagner, C. (2014). “Theory of Sampling
(TOS) versus Measurement Uncertainty (MU) – a call for
integration”, Trends Anal. Chem. 57, 93–106.
https://doi.org/10.1016/j.trac.2014.02.007
[8] Esbensen, K.H. and Geladi, P. (2010). “Principles of Proper
Validation: use and abuse of re-sampling for validation”, J.
Chemometr. 24, 168–187. https://doi.org/10.1002/cem.1310
[9] Esbensen, K.H., Geladi, P. and Larsen, A. (2014). “The RPD
myth…”, NIR news 25(5), 24–28.
https://doi.org/10.1255/nirn.1462

* This chapter is intended to deal comprehensively with the confusion surrounding all issues
of “replication”. A general argument will be presented covering the most common and also
less common scenarios.
† Taguchi approach: http://en.wikipedia.org/wiki/Taguchi_methods
‡ “There is always an uncertainty, regardless of how small it is, between the true, unknown
content aL of the lot L and the true, unknown content of the sample, aS. … tradition has
established the word ‘error’ as common practice, though it implies a mistake that could have
been prevented, while statisticians prefer the word ‘uncertainty’ which implies no
responsibility. However, in practice, as demonstrated in the Theory of Sampling, there are
both sampling errors, and sampling uncertainties. Sampling errors can easily be minimised,
while sampling uncertainty for a pre-selected sampling protocol in inevitable. …. Because
the word ‘uncertainty’ is not strong enough, the word ‘error’ has been selected as current
usage in the Theory of Sampling, making it very clear it does not necessarily imply a sense of
culpability”; quoted from Pitard [1] p. 33 who graciously informs that this statement rightly
originates with Pierre Gy (in a monograph written in French, 1967).
§ There has been an extensive debate at meetings and in the literature within the
international sampling community as to setting up a (completely) general RSV threshold.
Opinions have slowly converged to accept a suggestion, which originates from the
characteristics of the Poisson distribution (sampling can in a certain sense be likened to a
Poisson selection process), that a RSV larger than 20% signifies that the average of repeated
sampling-and-analysis is out of control. Be aware, however, that this is an attempt to
characterise all the world’s extremely different materials, lots and processes with one
singular threshold—a very dangerous simplification! More on this important issue can be
found in DS 3077 [2] and in Pitard [1].
¶ There exist many types of materials with significantly different heterogeneity levels. The
general threshold for RSV (20–30%) refers to significantly heterogeneous materials, a very
wide and diverse class of materials, for example, mineralisation, ores, pollutant effects,
industrial aggregates (building materials, waste, biomass …), the list is exhaustive. There are
also many other types of materials which exhibit less heterogeneity, for which a lower
threshold will be relevant, say at the 10%, or 5% level, or even lower. It is emphasised that
the demand for a general RSV is a demand which cannot be met universally. It is necessary
to invoke common sense when making RSV operational in a distinct problem-dependent
fashion [2–4].
** A classical case is that of analysis of protein and moisture in wheat, where a bulk sample is
introduced to the instrument and is analysed many times over to produce a single averaged
predicted value, essentially following the same procedure, but can this procedure be
generalised? An appropriate answer would be related to how representative the reference
sample is with which the spectral replication/averaging approach is to be linked in a
calibration, in addition to the specific spot size issue pertaining to the instrument. There is
never analysis without (some form of) preceding sampling, sub-sampling, sample preparation
etc.
†† A full account of these interrelations can be found in Esbensen and Wagner [7].
10. An introduction to multivariate
classification

10.1 Supervised or unsupervised, that is the


question!
As discussed in detail in chapters 4 and 6 on Principal Component
Analysis (PCA), this is one of the most powerful Exploratory Data
Analysis (EDA) methods available. EDA helps the data analyst to
search for potential data patterns, groupings, clusters, outliers in a
specific data set—in short, performing an act of Pattern Cognition
(PAC) in an unsupervised manner.
By unsupervised, it is meant that the data analyst is using a data
analytic method to look for and identify potential sample groupings,
or clusters, that may exist in the data table. When the data analyst is
using PCA and is contemplating potential patterns by using score
plots, PCA is used as a visual clustering technique without any
pattern structure template; the data analysts discover structured
patterns. Others approaches, aiming for the same recognition,
include k-Means clustering and Hierarchical Cluster Analysis (HCA).
Based on whether clusters are recognised (or not) in the selected
data, this dictates whether classification rules can be established that
will uniquely classify new samples in future data (in the statistical
sense).
When classification rules are developed, this is known as
supervised classification, because there is now a template data
pattern (formulated as a particular multivariate data structure model)
to compare it to. Supervised classification encompasses methods
such as Soft Independent Modelling of Class Analogy (SIMCA), Linear
Discriminant Analysis (LDA), Support Vector Machine Classification
(SVMC) and Partial Least Squares Discriminant Analysis (PLS-DA), to
name just a few.
This chapter provides an overview of some of the most commonly
used unsupervised and supervised multivariate classification
methods for identifying data classes and developing subsequent
classification rules with which to gauge new samples. These methods
find widespread usage in research and industrial applications
including, but most certainly not limited to, grouping of data w.r.t.
potential common patterns or modelling such common patterns with
the aim of better identifying outliers, analysis of medical diagnostic
characteristics, identification of new disease variants, identifying new
biological species (or variants), early detection of process faults,
modelling data structure background (e.g. the natural background
level of multi-element chemical signatures with a view of quantifying
univariate or multivariate pollution signals—or mineralisation signals
in economic geology), identifying common traits in yearly accounting
data… right through to raw material identification in the
pharmaceutical industry. Chapter 13 provides a brief introduction to
Hierarchical Modelling and shows how various supervised
classification rules can be joined together, using Boolean logic, to
develop a decision-making strategy that leads to unique results
without any human intervention.
The scope of use for multivariate classification is, in reality,
limitless and constitutes a “must have” tool set that any scientist or
engineer should have for solving problems of the classification type. It
is stated here that multivariate classification is just as important as
the multivariate calibration methods described in chapter 7 and both
approaches should share an equal portion in the data scientist’s
knowledge space.
10.2 Principles of unsupervised classification and
clustering
The fundamental first step to solving any data analysis problem is
visualisation. As the saying goes: “Visualise before you Analyse”. This
mandates the plotting of samples (objects) as a function of the
variables, to gain a quick understanding of the typical object profiles
and thereby possibly detect gross outliers before formal data analysis
is commenced; saving much analysis time. It is also an excellent
practice to use a Descriptive Statistics analysis first, as this can help
to indicate what type of preprocessing (chapter 5) would appear
optimal before analysis (refer to chapter 2 for more on descriptive
statistics). Experience, practical experience, is as ever the real master
here.
However, in the end multivariate data should be analysed
multivariately and the best place to start is always some form of EDA,
PCA being the by far most commonly used approach. PCA offers the
advantage over most other methods that it can:
1) Show how objects group (cluster) in multivariate space (typically in
no more than three-dimensions).
2) Provide important information on the variables (and their
interactions), thus providing interpretability of the clusters
observed.
3) PCA is a highly validatable method and thus a quality of fit estimate
of the EDA can be provided.
4) PCA allows easy recalculation of new models based on the
identification of important/unimportant variables (variable
selection).
5) PCA is highly visual and aligns well with the “Visualise before you
Analyse” principle.
There are, of course, several other methods available for
identifying clusters in a data set. Here the methods of k-Means
Clustering, HCA and PCA will be compared using a set of NIR
spectra collected on three different raw materials used in the
pharmaceutical industry.
10.2.1 k-Means clustering

k-Means Clustering is conceptually very easy to understand, but


unfortunately highly non-visual in its presentation (unless it is used in
conjunction with a more visual approach such as PCA). It aims to
separate objects into k predefined classes with grouping based on
the objects being closest to a centroid for each class. As new objects
are added to a class, the centroid is recalculated and the objects in
the set are reassessed for class membership. This concept is shown
graphically in Figure 10.1.
When the “nearest neighbour distance” is statistically exceeded,
the new object is assigned to its own unique class and so on until all
objects have been clustered into the number (k) of predefined
classes, Miller and Miller [1]. The algorithm works by minimising the
within-cluster sum of squares, Adams [2], thus allowing the definition
of statistical limits for class belongings, i.e. object
acceptance/rejection from a defined class. In all cases, k-Means aims
to assign all objects into one class only.

Figure 10.1: The concept of cluster definition in k-means clustering.


Adams [2] describes a four-step process for the k-Means
algorithm as follows,
Step 1: Define k clusters (k usually being a small integer) to group
the data into and define any initial objects per cluster (should
knowledge of the number of classes exist). Calculate the cluster
means and the initial partition error.
Step 2: For the first object, calculate the increase/decrease of the
partition error by moving the object into the alternative classes. If
the error is reduced by moving the object to a particular class,
keep it in that class, otherwise, leave it in its original class.
Recalculate the class means each time an object is moved to a
new class.
Step 3: Repeat step 2 for all objects in the data set.
Step 4: If no objects have been moved, stop the process,
otherwise go to step 2.

Where k-Means clustering fails


There are a number of disadvantages encountered when using k-
Means (or the related k-Medians) methods. These include:
The requirement that the number of clusters to partition the objects
must be known, and set, before analysis can begin. This can result
in many iterations of cluster definition during the analysis phase
and subjectivity rather than objectivity may start to creep into the
analysis.
The final grouping of objects 100% reflects the initial choice of
clusters, or the initial objects chosen to define the first cluster
centroids (also related to subjectivity).
The data analyst has to determine the similarity/dissimilarity
distance metric used for class assignment. The most commonly
used distance metric is the Euclidean distance, but other methods
such as the City Block or Correlation distances are also available.
This point alone renders k-Means and even HCA a very subjective
approach and these methods should only be used when they can
be verified against a more robust method such as PCA.
For a more extensive discussion of the k-Means algorithm the
interested reader is referred to the texts by Everitt [3] and
Romersberg [4].

Clustering raw materials using NIR and k-Means


For many years, the pharmaceutical and related industries have used
NIR for the identification of raw materials. In this example, three raw
materials (from a number of sub-lots) were scanned to form k = 3
potentially unique clusters. On this basis, it is now the task of the
data analyst to use the cluster analysis method to determine whether
the NIR method is capable of uniquely clustering the available
samples into unique classes. If this is the case, then the method of
NIR opens up the possibility of developing classification rules that
can be used to determine the identity of new samples. This method is
not only limited to NIR, and can be used for Mid-IR, Raman or any
other method that is capable of detecting class differences.
The physical samples were scanned by NIR using diffuse
reflectance mode; the spectra of the three material classes are shown
in Figure 10.2. The spectra were preprocessed using the Standard
Normal Variate (SNV) method previously described in chapter 5.
Initial visual inspection of the data indicates that there are three
distinct classes of objects present and that any clustering algorithm
should be able to separate the materials. To keep things simple, the
use of the Euclidean distance metric was used, as this is the simplest
approach and as always, simplest is “probably” best (the rule of
parsimony).
Figure 10.2: SNV transformed spectra of three raw material classes used in the
pharmaceutical industry.
Figure 10.3: Results of k-Means clustering of three raw materials scanned by NIR.

The results of the cluster analysis are typically presented as a


tabular format as shown in Figure 10.3.
The results are shown in The Unscrambler® as a new Class
Category column and each Cluster as a coloured row band. As
expected, the cluster analysis was successful on these data; the
simple k-Means method was able to cluster the raw materials into
three unique classes.

Hierarchical Cluster Analysis (HCA)

Hierarchical Cluster Analysis (HCA) is a step up from k-Means


clustering in that it aims to separate the original data into a few
classes by either agglomerative or divisive methods, Adams [2].
Agglomerative methods fuse together smaller sub-clusters of
samples and successively build to larger clusters of samples,
whereas divisive methods start with a single cluster and divide it into
smaller clusters of similar objects.
As with k-Means, a disadvantage of HCA is that the number of
clusters has to be defined at the beginning of the analysis and the
distance metric also has to be decided, but there are as many to
choose from as there are available HCA methods, including the well-
known Ward’s method, Everitt [3]. The reader is strongly advised to
consult the cluster analysis textbook par excellence: Cluster Analysis
for Researchers, Romersburg [4], unequalled even today.

Figure 10.4: Dendrogram of clusters after application of HCA to NIR spectra of raw materials.

The major advantage of HCA over k-Means is that it provides a


graphical display of the clusters known as a Dendrogram. A
Dendrogram is a tree structure showing the linkages and the
similarity/dissimilarity relationships displayed by the objects. The raw
materials data were analysed using the method of HCA, again using
the Euclidean distance measure and the results are shown in the
dendrogram of Figure 10.4.
This is all that will be discussed regarding HCA here, the
interested reader is referred to the excellent texts available by Adams
[2], Everitt [3] and Romersburg [4] for more detailed discussions of
this and other k-Means methods.

PCA for cluster analysis


To show how PCA can be applied as a clustering method, the NIR
spectra of the same pharmaceutical raw materials will be used again.
Also shown is how the k-Means and HCA methods can be validated
using PCA.

Figure 10.5: Cluster analysis of NIR spectra of raw materials using PCA.

A simple PCA was applied to the data using Categorical Cross


Validation (based on the three classes of material used for the
analysis) and the t1 vs t2 score plot is shown in Figure 10.5. The three
distinct clusters can be seen in the score plot; these were sample
grouped based on their known class assignments.
To validate the k-Means and the HCA approaches to cluster
analysis, the cluster assignments generated by these methods can be
used to sample group the score plot. Note that two PCs describe
99% of the data structure in Figure 10.5, so the PCA is highly
interpretable.
Figure 10.6a and b show the t1 vs t2 scores plot grouped by the
clusters found by k-Means and HCA, respectively.
This example shows that these three unsupervised methods of
clustering were all able to uniquely group the three materials scanned
in the data set. This demonstrates the clear possibility to creating
operative class models for each material in the data set—and the
problem now becomes a supervised classification one.

Fisher’s Iris data


A classical experiment suitable for assessing the ability of a clustering
methods applicability is Fisher’s famous three Iris species data set,
first published by Sir Ronald Fisher in 1936 [5]. The methods of k-
Means, HCA and PCA will be used here to describe how each
method approaches a data set that is not at all as “clear cut” as the
above, i.e. an example of a data set that cannot be classified into
well-separated, unique classes.
The main aim of the analysis is to develop an objective method of
classifying three types of Iris, viz. Iris setosa, Iris versicolor and Iris
virginica based on four experimental variables measured on 150
individual samples. The variables measured were
1) Sepal Length
2) Sepal Width
3) Petal Length
4) Petal Width
Figure 10.6a and b: Validation of k-Means and HCA clusters using PCA.

Table 10.1: So-called “Confusion Matrix” for Iris classification


performance assessment using k-Means clustering (Euclidean
distance).

Predicted/actual Versicolor Virginica Setosa Classification


rate (%)

Versicolor 48 2 0 96

Virginica 14 36 0 72

Setosa 0 0 50 100

50 samples of each species were measured to generate a data


table of dimension 150 rows by 4 columns. For each sample, the
class name was assigned, therefore, the number of classes to define
is k = 3. Using the Euclidean distance measure, the classification rate
is presented in Table 10.1 for a k-Means clustering.
Using the k-Means method of classification with Euclidean
distance, it can be seen from Table 10.1 that:
1) Setosa can be uniquely classified from versicolor and Virginica
[there are zeros in the off diagonal elements of the Confusion table
(10.1)].
2) Versicolor can be uniquely classified from Setosa, but is “confused”
twice with Virginica (a misclassification rate of 4%).
3) Virginica like Versicolor can be uniquely classified from Setosa, but
is “confused” 28 times in 100 with Versicolor.
To visualise the results of a k-Means analysis, plotting is possible
as long as only three variables are plotted at a time. The Iris example
presents four variables; therefore, only three variables at a time can
be shown. Figure 10.7 provides a 3D scatter plot of the variables
Sepal Length, Sepal Width and Petal Length with samples assigned
to their known clusters shown by colour annotation.
Figure 10.7 shows that the species Versicolor and Virginica are
close to each other in properties and that this is why the algorithm
cannot completely separate the two classes (refer to the confusion
matrix in Table 10.1).
Application of the HCA approach to the Iris data using the
Euclidean Distance measure is provided in Figure 10.8.
Here the green clusters are Versicolor, the red clusters Setosa,
and the blue clusters are Virginica. As with the k-Means method,
Setosa was uniquely classified, however, Versicolor and Virginica
variables do not hold enough discrimination power to separate these
two classes.
To complete the analysis, PCA was also performed on this data
set to see if it could separate the classes uniquely. The PCA overview
is provided in Figure 10.9.
An explanation of the data is provided as follows:
1) The explained variance plot suggests two PCs is enough to
describe the data (96% total).
2) The scores plot is grouped by species. As with k-Means and HCA,
PCA can clearly separate Setosa, but cannot separate Versicolor
and Virginica.
3) The correlation loadings plot shows that the main contributors to
separating the classes are Petal Length, Petal Width and Sepal
Length. Sepal Width is the main contributor to why the data
separate along PC2.
This example shows that for data sets with a small number of
variables and a large number of objects, it is the discriminating power
of the variables that is most important for successful classification.
The three main variables responsible for separation allow the unique
classification of the species Setosa using all three clustering methods
in this example, however, there is not enough “discrimination”
information in these variables to allow the distinction between
Versicolor and Virginica.
This is a famous data set in statistical and data analytical circles.
Any new classification approach must be tested w.r.t. the Iris data
set, which has become a difficult-to-classify standard that must be
passed. It is very instructive to try one’s own hand in taking the
discrimination/classification task a step, or two, further than the
introductory level presented above. It is actually possible to push the
discrimination between Versicolor and Virginica almost to perfection,
leaving only one intermediate object, positioned right in-between.
Hint: Try dedicated PCA on these classes only (justified by the
perfect separation of Sestosa already proved); the beginner data
analyst should also try to apply PLS-DA to this refined data set
(section 10.5).

Figure 10.7: 3D Scatter Plot of Fisher’s Iris data grouped by k-Means clustering.
Figure 10.8: Separation of classes of Iris species using HCA and Euclidean distance.

Figure 10.9: PCA overview of Fisher’s Iris data.

10.3 Principles of supervised classification


Classification is an example of a supervised pattern recognition
objective used, for example, to see if one or more new objects belong
to an already existing group (class) of similar objects. For this, and
related, purposes the Soft Independent Modelling of Class Analogy
(SIMCA) approach is the unsurpassed methodology, as well as the
approaches of Linear Discriminant Analysis (LDA), Support Vector
Machines (SVM) and Partial Least Squares Discriminant Analysis
(PLS-DA). However, there is a major pitfall when using LDA and SVM,
as will be discussed in a later section of this chapter.

Soft independent modelling of class analogy (SIMCA)


The philosophy behind this classical chemometric technique—SIMCA
was in fact the very first chemometric method to be formulated, Wold
[6]—is that objects in one class, or group, show similar rather than
identical behaviour; similar in a sense that can be treated by a local
PCA model. This can be described to mean that objects belonging to
the same class show a particular class pattern, which makes all these
objects more similar to each other with respect to any other group or
class. The goal of classification is to assign new objects to the class
to which they show the largest similarity. The SIMCA approach
specifically allows objects to display intrinsic individualities as well as
common pattern characteristics, but only the common properties of
the classes are modelled by PCA.
The easy part of understanding SIMCA is that 90% of this
approach is simply based on PCA (sometimes PLSR models are also
used). SIMCA is nothing other than a flexible, problem-dependent,
multi-application use of PCA-modelling. First, a practical introduction
of how to use SIMCA is given below, without all the technical
background. The SIMCA approach has been retold numerous times
since its inception (see reference list), but really never surpassed by
Wold’s classical paper of 1976 [6], in which can be found the
traditional full introduction, complete with all technical details.
Figure 10.10: Grouping (clustering) as revealed in the initial overview scores plot.

Grouping or clustering appears in bilinear projections according to


what is shown schematically in Figure 10.10 (which is similar to the
scores plot shown in Figure 10.5 for the raw materials scanned by
NIR).
SIMCA classification is simply based on using separate bilinear
modelling for each recognised data class, which concept was
originally called disjoint modelling. The individual data class models
are most often PCA models (because in the simplest SIMCA
formulation there is no Y-information present). In today’s terminology,
the two, very clearly individual data classes delineated in Figure 10.10
can each be modelled with local, i.e. independent PCA models.
Classification is only feasible, indeed only interesting, if there are
several objects in each class with some form of intrinsic variability,
because every class has to support an A-dimensional PCA model, in
which A runs in the interval [0, 1, 2, 3... Aopt]. There is not much
interest in classifying w.r.t. to a set of simple hyper-spheroid class
structures, both because such a situation is rare and because SIMCA
is extremely more powerful than this. A fully developed SIMCA
classification usually consists of several PCA models (any number
really), one for each recognised class and each of these with its own
different, or similar, dimensionality (as the data structures may be).
There is also the equally interesting marginal case of just one SIMCA
class, in particular for practical applications that require a Pass/Fail
classification or an in/out assignment with respect to just one class.
As shall be seen, this constitutes one of the most powerful features of
SIMCA, being able to treat rationally the so-called “asymmetric
classification” case.
It is noteworthy that a PCA model made by measuring the same
sample numerous times will result very little variability with which to
create a more substantiated model (unless the sampling involved is,
in itself, so variable that the model mainly describes physically
significant sampling errors rather than chemical information, chapters
3 and 8). In this case, the PCA loadings will unavoidably mainly reflect
the sum total of sampling + measuring errors, i.e. data analytical
noise—there is simply not much else to model. As a good example,
Figure 10.11 presents score and loading plots for a single raw
material measured, i.e. scanned several times over using diffuse
reflectance NIR.
There is an important general lesson here: “More samples” or
“more measurements” do not by themselves constitute a better basis
upon which to model complex multivariate phenomena; there has to
be a significant object similarity/dissimilarity data structure present
that is causally related to the phenomenon involved which is being
modelled, e.g. by a PCA model (or other types of data models). There
is the ever-important interaction between the physical sampling of the
objects upon which the chemical (or other) analytical characterisation
is based. These relationships illustrate the insight that even powerful
chemometric data analysis cannot create information (data structure)
where none existed in the first place: “Noise in/Noise out” is a nice
way to re-phrase the famous data analytical truism: “GIGO: Garbage
in/Garbage out”.
Building natural variability into the model is a highly important task
for SIMCA model development in order to be able to catch the above
“neutral” case wherever present. Thus, the heuristic rule that, as a
minimum, six unique samples should be present in each class (with a
typical number of objects, say, between 15 and 30) in order for the
data analyst to hope for what could be called a “robust model”. The
objects should represent and span, as best as possible, the
maximum available variability among the class members, signifying
(for example) different seasons, different lot numbers, different work
shifts, different conditions under which the objects are collected
(sampled), see also chapter 3 and chapter 9.
Figure 10.11: Scores and loadings for a PCA of repeated material measurements.

SIMCA-modelling—a two stage process: first overview


Disjoint class-modelling is always the first step in the SIMCA
classification procedure, called the training stage. Here the individual
PCA models of the data classes in question are established, with a
strong emphasis on the data structure in each class, and its
validation. This situation is shown graphically in Figure 10.12.

Figure 10.12: Each data class from Figure 10.10 as modelled by an individual (local)
PCmodel (SIMCA).

The subsequent classification stage then uses these established


class models to assess to which classes new objects belong. Results
from the classification stage also allow the study of the so-called
modelling and discrimination power of the individual variables. In
addition, a number of very useful graphic plots are available, which
allow the study the classified objects’ membership of the different
classes in more details, as well as quantifying the differences
between classes—and (much) more.
If the data classes are known in advance, i.e. if the specific
membership of all the training set objects is known a priori, it is very
easy to make a SIMCA model of each class. This is a full-blown
example of supervised classification.
Thus, there are two primary ways into a SIMCA classification:
Classes are known in advance (which objects belong to which
classes)
There is no a priori class membership knowledge available
For the second case, any problem-relevant data analytical method
which is capable of finding patterns, groupings, clusters etc. in the
data may be used (assuming, of course, that a pattern recognition
problem indeed is at hand), e.g. cluster analysis. However, there is
already a very powerful and trustworthy method readily available,
namely PCA to be applied to the full data matrix present. It may be as
simple as that—in fact it often is!
It is not a critical issue by which technique a training pattern has
been discovered and delineated, what matters is that this pattern be
data analytically representative of the classification situation. In any
event, when/after the problem-specific data class setup is known, the
SIMCA-procedure(s) are simple, direct and incredibly effective.

Steps for building a SIMCA library


The following lists the steps involved in the development of robust
SIMCA libraries of individual, independent class models (PCA).
1) Plot the data and look for gross outliers and other anomalies.
2) Preprocess the data appropriately, if necessary (refer to chapter 5
for details on preprocessing).
3) Make a PCA projection model of all objects to initially identify
and/or confirm the individual classes.
4) Delineate the pattern-specific classes, while simultaneously
discriminating between classes. Study all relevant score plots and
identify all the problem-specific groups or clusters as well as
potential outliers. Find out which objects belong to each subgroup
in this training stage. There is always a great deal of interaction
between the general problem context and the initial data analysis
results at this stage. Developing valid SIMCA models is very much
an iterative task; it is especially important that the data analyst is
competent so as to be willing to assume responsibility for relevant
outlier detection/elimination. Therefore, the data analyst should
have a sufficient level of subject matter expertise relevant to the
data set. If this is not the case, it is imperative to ask all the
appropriate and relevant questions needed for a truly
representative class description! The last thing required is to carry
out data modelling in splendid isolation from the relevant
information.
5) Make a separate model for each class. Different modelling contexts
for the different classes can be explored to develop an optimal joint
data pre-treatment (standardisation, weighting or more advanced
preprocessing if necessary). Validate properly, i.e. all classes must
be validated in the exact same way, or the membership indices and
threshold limits will not be comparable.
6) Since removing outliers will have the most influential effect on PCA
models, there is sometimes a complex outlier deletion/remodelling
phase before a final set of well-understood and accepted individual
models are at hand. It is obvious that this stage is critically
important and one that should never be left in the hands of
automated data modelling (machine learning etc.) before a very
considerable expertise, indeed the main thrust of the present book
is to educate the data analyst to be fully able to assume this
comprehensive responsibility.
7) Also, study appropriate higher-order score plots to see if there
could be more classes present than what is “known” in advance
(conventional wisdom may easily be wrong!) If so, repeat from step
4. Now, determine the optimal number of PCs for each class, Aopt.
The classification modelling stage is now completed. This may take
place immediately before the next step (see below), or this may
represent a modelling task carried out earlier, on the basis of which
the present classification is to be carried out—all is problem-
dependent.
8) Classify new objects. Create an overall library of validated PC
models and read the new data into the library software for
classification. The library must be established using the validated
number of PCs for each class model, this number was determined
in step 6. The data pretreatments developed for the final set of
models will be applied to the data automatically by the SIMCA
program before classification is performed.
9) Evaluate the classification obtained by studying the results in both
tabular and graphical formats. Which plots to use and how to
interpret them are described in section 10.3.6.

Classification results
The typical result output for a SIMCA model is the Classification
Table, which shows all objects and their classified, i.e. assigned class
memberships.
For each object, an asterisk is shown in the respective class
column for a particular model whenever the object in question has
been assigned as belonging to this highlighted model with the current
significance, i.e. when the object simultaneously satisfies both the Si
and Hi limits set (these statistics are described in more detail below).
Non-marked objects do not belong to any of the tested classes, i.e.
they will not contain an asterisk in any column of the Classification
Table. This is one of the most powerful aspects of SIMCA with
respect to many other of classification methods, its ability to identify
an object into “the null class”. This is a further development of
SIMCA’s power to deal with the asymmetric classification scenario
(also see further below).

Classifying a new object can result in principally different results:


1) The object is uniquely allocated to one class, i.e., it fits one model
only within the given classification threshold limits. This often
means that the distance to the next closest class is typically (much)
larger than the accepted distance with respect to this class, but the
case of new objects lying just within one class and (very) close to
another/several others are principally no different. In this sense
SIMCA is a “hard modelling/classification” approach.
2) The object may simultaneously fit several classes, i.e. it has a
distance that is within the critical limits of several classes. This
ambiguity can be due to two reasons; either the given data are
insufficient to distinguish between the different classes—they may
perhaps have very significant sampling and measurement errors, or
they may actually belong to several classes that are not uniquely
separated in the training stage. A particular object may be a
borderline case or it may have properties characteristic of several
classes (for example, being both sweet and salty at the same time).
If such an object is classified to fit several classes one may for
example study both the object distance (Si) and the Leverage (Hi) to
determine the best fit; at comparable object distances, the object is
probably closest to the model with which is displayed the smallest
leverage. PCA score plots are indispensable for such “close call”
cases. When these cases occur, multilevel routines such as
Hierarchical Models can sometimes be used to try to resolve the
ambiguities encountered. This is discussed in detail in chapter 13.
3) The object fits none of the classes within the given limits. This is a
very important result despite its apparent negative character. This
may mean that the object is of a new type, i.e. it belongs to a class
that was unknown until now or, at least, to a class that has not
been identified in the classification. Alternatively, it may simply be a
true outlier.
A particularly complex and difficult classification study involving all
of the above cases can be found in the matched papers by Massart
et al. [7] and Esbensen et al. [8]. The first paper explains development
of a new strategy needed for very complex classification tasks (a two-
tier approach: “hierarchical–non-hierachical” clustering), the second
applies it to front-line research on the global iron meteorite population
(literally way “out-of-this-world” samples) [8]. Although perhaps
esoteric w.r.t. the data type, one should not shy away. Together these
papers give valuable insight into how to formulate objectives,
organise data and how to carry out complex iterative data analytical
classification tasks, starting with application of standard classification
approaches but following through with sensitive regard to the
empirical data structures revealed. This adaptive data modelling
approach has many carrying-over potentialities to virtually all other
data types and classification contexts.
Data structure cognition vs recognition
One of the most important scientific potentials of the SIMCA
approach is related to this powerful aspect of “failed” pattern
recognition—one must always be prepared to accept that one or
more objects actually do not comply with the assumed data structure
pattern(s). Clearly it is important to be able to identify such potentially
important “new objects” with some measure of objectivity. At some
point in this pattern cognition process it will become important to be
able to specify the statistical significance level(s) associated with
such a “discovery”. Hence, below follow a few remarks on the use of
the statistical significance level in the SIMCA, and similar,
classification context.

Statistical significance level in SIMCA


Statistical significance tests were discussed in detail in chapter 2 and
were defined as being based on the concern of making mistakes.
Thus, the initial statistical hypothesis, the null hypothesis H0, always
is that a new object belongs to the class in question. The statistical
classification test (an F-test, see further below) checks the risk of
rejecting this hypothesis by mistake (a so-called type I error). This
means that, as always in statistics, one does not test the probability
that an object actually belongs to a particular class! In the setup used
in SIMCA classification, the test carried out quantifies the risk of
concluding that a particular object lies outside a specific model
envelope—even if it ”truly” belongs (that it in reality lies within the
model boundary). This double negative way of stating significance
may, perhaps, appear a little confusing, but this is how statistics
works. As luck would have it, it is easy to show how it works in
practice.
With SIMCA, classification results may be studied using varying
significance levels, usually between 0.1% and 25%. An a priori
significance level must always be declared by the data analyst (who
must understand what goes on here, or... back to chapter 2) from
which the program is designed to determine whether an object falls
within (belongs to) or outside the two closest models at this
significance level. This latter part of the procedure can safely be fully
automated, but as with any approach based on statistics, the most
important aspect is that the significance level is chosen by the data
analyst before any classification is carried out. The significance level
is intimately related to the problem at hand, or at least it should be,
i.e. to the risk one is willing to accept for making a wrong decision.
There are, however, many application studies which rely more-or-less
completely on “standard” statistical rules-of-thumb which may be all
there is in some situations—clearly not a desirable option.
This “normal” statistical significance level used is 5%. In practical
data analytical classification terms this “means” that there is a 5%
risk that a particular object falls outside the class, even if it actually
belongs to it; thus “only” 95% of the objects which truly belong will
fall inside the assigned class.
At the opposite end of the spectrum of significance levels typically
used, these issues may be illustrated in the following manner: A high
significance level (e.g. 25%) means that a stricter interval is imposed
—only highly “certain” objects will belong to the class, and (many)
more “doubtful” cases will be classified as lying outside. Fewer
objects that truly belong will fall inside the class (in this case, 75%).
The risk taken is to reject more than accepted, even though the
rejected samples may then lie in the population. This attitude focuses
more on the so-called type II error in favour of type I errors.
A low significance level (e.g. 1%) on the other hand means a very
“loose” criteria is being established for class membership—cases,
which are doubtful, may still be classified as belonging to the class,
i.e. almost all of the true member objects (i.e. 99%) will be classified
as members of the class in this case. Therefore, the type I error is
being favoured over the type II error.
It is critical to understand that the significance test only checks
the object with respect to the “transverse” object-to-model distance,
Si, which is compared to a representative measure of the total Si-
variation exhibited by all the objects making up the class, called S0. A
standard F-test is used. A fixed limit (depending on the class model)
is used for the leverage. The SIMCA approach does not make a
check of the class belonging in the “longitudinal dimension” of the
local PCA models, which in data analytical practice is often a matter
that can (much) more easily be assessed in relevant score plot
projections. Figure 10.13 provides some typical classification
situations in a visual representation.
In Figure 10.13, two main situations are presented:
1) Object A (represented by the open triangle) and object B
(represented by the open circle) are unique members of classes 1
and 2, respectively.
2) Object C (represented by an open square) is not a member of any
classes in the present library.

10.4 Graphical interpretation of classification


results

10.4.1 The Coomans’ plot

Purpose

This plot shows the transverse (orthogonal) distance from all new
objects to two selected models (classes) at the same time (class pairs
can be treated in a sequential manner until all classes in a library have
been tested). The critical, cut-off class membership limits (S0)
constitutes the threshold for whether to assign objects as within or
outside the two classes in question. The statistical significance level
to be used in the classifications must be stated a priori otherwise the
assessment will be subjective and therefore biased. The higher the
significance level, the more strictly the new objects will be judged
with respect to “true” membership. This means that only “certain”
cases will be recognised as belonging to the “nearest class”; the
“doubtful” cases will fall outside. As part of the local model building, a
particular S0 limit may occasionally be changed by the data analyst by
changing the significance level, but importantly only for exploratory
purposes.

Figure 10.13: Typical classification situations when using the method of SIMCA.

The Coomans’ plot shows the object-to-model distances for both


calibration objects as well as all new objects for two classes at the
same time. This is a very useful feature when evaluating classification
results.

Interpretation

If an object does belong to a specific class it should fall within the


model membership limit, which is designated by S0 (that is to the left
of the vertical line or below the horizontal S0 lines in this plot, see
Figure 10.14. Objects that are within both lines, i.e. near the origin,
are classified as de facto belonging to both models. The Coomans’
plot looks only at the orthogonal distance of an object with respect to
the model. To achieve a more comprehensive classification, the
leverages should also be studied (leverages to some extent considers
the longitudinal i.e. along-component data structure), e.g. in the Si vs
Hi plot.
If an object falls outside the S0 limits, i.e. in the upper-right
quadrant, it belongs to neither of the models included in the current
Coomans’ plot (it may belong to one or more of the other classes in
the library). After having decided on the significance level before
carrying out the classification, the ensuing classification results must
be fully respected. There is no second round w.r.t. classification with
the SIMCA approach. The only, very powerful, influence on the
classification available to the data analyst is the a priori selection of
the significance level to be used—all data analysts must have solid,
very good reasons for deviating from the overwhelming cases of
multivariate classification application which is based on the 5% level
(undoubtedly influenced by classical statistics). Basic competence
and willingness to assume the relevant responsibility regarding outlier
identification and deletion shall have been gained through PCA data
analysis before competence w.r.t. SIMCA can be entertained.
Figure 10.14 provides a generic Coomans’ plot and describes its
interpretation.

Figure 10.14: The Coomans’ plot and its interpretation.

To make interpretation easier, the objects are colour coded in


software packages such as The Unscrambler®.
In the Coomans’ plot of Figure 10.15 for the Fisher Iris data, green
objects are the new objects being classified. Blue objects represent
the calibration objects in model one (in this case Setosa), while Red
objects represent the calibration objects from model two (in this case
Virginica).

The Si vs Hi plot (distance vs leverage)

Purpose
The Si vs Hi plot can also be called a membership plot w.r.t. one
selected model because it encompasses both the transverse as well
as the longitudinal types of limits used in the classification, i.e. the
distance to the model (the residual standard deviation) and of the
leverage (distance to model centre) measures for each object.
Objects that fall inside these limits are very likely to belong to the
class (model) at the chosen significance level. Alternative local
models can be tested for in succession.
This plot is similar to the Influence plot, used to detect outliers in
PCA and discussed in chapter 6. See below for an important caveat.

Interpretation
The Si vs Hi plot shows the object-to-model distance and the
leverage for each new object. The leverage can be gauged from the
abscissa and the distance from the ordinate. The class limits are
shown as horizontal threshold lines for the object-to-model distance
and similar vertical demarcations for the leverage limit. The limit for
the object-to-model distance again depends on the significance level
chosen. The leverage limit also depends on the number of PCs used
in the local model and is fixed for a given classification. The leverage
value shows the distance from each object to the model centre. In
one sense, the leverage tries to summarise the complete information
contained in the model, i.e. the variation described by the number of
PCs developed. It pays to remember well that both the Influence plot
(PCA) and the Si vs Hi plot (SIMCA) are but an attempt to simplify
(with a two-dimensional plot) the full multivariate relationships of
original dimensions (N, p). There is always a projection loss that must
be taken into account by the experienced multivariate data analyst—
who should always be looking at the data structures modelled in the
higher-order component space, just to be sure that no important
structural elements could be hiding here.
Figure 10.16 provides a generic Si vs Hi plot and its interpretation.

Figure 10.15: Coomans’ plot for Fisher’s classical Iris data.

Objects near the origin, within both limits, are classified as bona
fide members of the model. Objects outside these lines are classified
as not belonging to the model. The further away from the origin of
the plot objects lie, the more different they are from the model
ensemble. Figure 10.17 shows the Si vs Hi plot for the model Setosa
when a test set of all classes was applied to the SIMCA model for the
Fisher Iris data. Objects close to the abscissa have short distances to
the model and in this case, represent Setosa, however, in some
cases, these may be extreme (they may well have large leverages,
reflected along the ordinate). Objects in the upper-right quadrant (e.g.
the Versicolor and Virginica objects in Figure 10.17) do not belong to
the class Setosa, while all Setosa objects lie well within all limits of
the test. Class membership characteristics such as within/outside a
model and “extreme objects” (objects that may in fact not belong) are
all specified by the significance level chosen. Extreme objects should
be checked based on domain-specific knowledge pertaining to the
problem context.
The above constitute all the information available from SIMCA
classification. It is now up to the data analyst to decide how to deal
with the specific information, i.e. how to view specific objects,
especially objects that lie close to the appropriate threshold limit etc.
Statistically, these objects are either inside or outside because of the
a priori significance level choice—however, there may be internal data
structures revealed at first by such comprehensive plots that may
occasionally lead to a more refined local PCA modelling (in many
situations these internal irregularities are also to be observed already
in the first in toto PCA projections).

Figure 10.16: Generic Si vs Hi plot and its interpretation.


There is always a certain leeway available to the ardent data
analyst in the data modelling phase, but in the end, it is questionable,
indeed unscientific, to toggle the limits after a classification has been
carried through because of a specific result that “clearly can be
improved if only the significance level was lowered marginally”. This
is very bad form indeed!

The Si/S0 vs Hi plot

Purpose
The Si/S0 vs Hi plot is quite similar to the Si vs Hi plot, above, and is
used the same way. However, the values of the ordinate axis in the Si
vs Hi plot are in absolute values. In the present Si/S0 vs Hi plot, the
object distance is expressed relative to the representative average
distance for the model (S0), thereby making it easier to relate the
transverse distance measures to one another.

Interpretation
The plot is interpreted in exactly the same way as the Si vs Hi plot, i.e.
objects in the lower-left corner belong to the class in question within
all pre-set limits etc. The Si/S0 vs Hi plot for the Setosa model is
displayed in Figure 10.18.

Model distance plot

Purpose
The model distance plot is used to visualise the distance between
models, i.e. to quantify if they are really different. A large inter-model
distance indicates clearly separated models. The model distance is
found by first fitting objects from two given classes to their own
model as well as to the alternative model, and this double
classification scheme can be carried out on all data class models in
the library in turn. The model distance measures can then be
calculated from pooled residual standard deviations.
Interpretation
A useful rule-of-thumb states that a model distance greater than 3
indicates models which are significantly different. A model distance
close to 1 suggests that the two models are virtually identical (with
respect to the given data). The distance from a model to itself is of
course, by definition, 1.0. In the distance range 1–3 models overlap to
some degree.

Figure 10.17: The Si vs Hi plot (“membership” plot).

The example in Figure 10.19 is the model distance plot for the
model Versicolor. Using the four measured variables, it was found
that two of the species are very similar. This is reflected in the model
distance plot where the distance from model Versicolor is shown. The
distance to the first model (Setosa) is large (around 50), but the
distance to the last model (Virginica) is smaller, and is a reason why
there is overlap of the models. The second bar in Figure 10.19 is the
distance to the Versicolor model itself, i.e. 1.0.

Variable discrimination power plot

Purpose
A measure analogous to the above model distance, but calculated
from one model to all other alternatives can be plotted for the
individual variables. The discrimination power of a variable thus gives
information about its ability to discriminate between any two models.
The discrimination power is calculated from the residuals obtained
from fitting all objects from one model to all the alternative models—
compared to fitting to the specific class model.
If a poor classification is observed, deleting the variables with
both a low discrimination power and a low modelling power may
sometimes help. The rationale for this specific deletion is, of course,
justified by the fact that variables which do not partake in either the
data structure modelling nor in the inter-class discrimination are not
interesting variables—at least not in a classification perspective (they
may be otherwise interesting of course). However, there is a warning
here. If multiple classes make up a model, the Discrimination Power
is a pairwise test. If unimportant variables for the two models being
compared are removed, this has to be done for all models in the
library. N.B. unimportant variables for one pairwise comparison may
be important for another. It is therefore important that the data
analyst ensures that for the sake of one class comparison, the
integrity of the rest of the library is not compromised, i.e. if
unimportant variables from one class are eliminated to improve its
classification ability, this may be detrimental to the other classes in
the library (as they may be important variables for those classes). In
such cases, the Hierarchical Models discussed in chapter 13 may be
a better option.
Figure 10.18: The relative distance-leverage plot, Si/S0 vs Hi (same data as in Figure 10.17).

Interpretation
The discrimination power plot shows the discrimination power of
each variable in a given two-model comparison. A value near 1
indicates no discrimination power at all, while a high value, i.e. >3,
indicates good discrimination for a particular variable.
The example in Figure 10.20 again shows a plot from the Iris
species classification. Here the objects belonging to the Iris setosa
class are projected onto model Iris versicolor. The plot shows which
variables are the most important in discriminating between these two
classes. All the variables have a discrimination power larger than 3
and all are therefore useful in the overall classification.
Figure 10.19: Model distances for the three classical Iris classes.
Figure 10.20: Discrimination power plot for the four classical Iris variables.

Modelling power plot

Purpose
The modelling power plot is used to quantify the relevance of a
particular variable in the modelling of the individual local models. The
modelling power provides information on how much of the variable’s
variance is used to describe a particular class.
The modelling power can thus be a useful tool for improving an
individual class model. Even with careful variable selection, some
variables may still contain little or no information about the specific
class properties. Thus, these variables may have a different variation
pattern from the others, and consequently, they may cause the model
to deteriorate. Different variables may show different modelling power
in different models, however, so one must again keep a strict
perspective with respect to the overall classification objective(s) when
dealing with multi-class problems.
Variables with large modelling powers have a large influence on
the model. If the modelling power is low, i.e. below 0.3, the variable
may make the model worse and therefore be a candidate for deletion,
particularly if its discrimination power is also low.

Interpretation
The modelling power is always between 0 and 1. A rule-of-thumb is
that variables with a value equal to or lower than 0.3 are less
important.
In Figure 10.21 the modelling powers for the four Iris variables are
shown w.r.t. the Setosa model. All variables are important for
modelling the class Setosa. It is not a tradition in chemometrics to try
to polish every variable set to the limit, indeed very many times low
modelling variables can be kept without loss of information and this is
where subject matter expertise is paramount.

10.5 Partial least squares discriminant analysis


(PLS-DA)
Another type of multivariate classification model, known as Partial
Least Squares Discriminant Analysis (PLS-DA), was initially proposed
by Svante Wold [9]. This approach is based on Partial Least Squares
Regression by using so-called “Dummy variables” in the Y-space.
Dummy variables are used to include in the numerical data modelling
the fact that training set objects all have a known class assignment.
This is very powerful information that otherwise is not used in the
training stage. In cases where objects within a class span a relatively
broad range of variation, the method of SIMCA may not always
perform well, as SIMCA, in essence, is based on some marked form
of class homogeneity (Soft Independent Modelling of Class
Similarity). Objects falling far from the centre of a local model may
erroneously be rejected as not belonging to their own class. In such
cases of non-homogeneous classes, one may instead try to approach
classification by taking advantage of the discriminability between
classes. The PLS-DA approach consists of building a global
classification model of all classes, where discrimination is achieved
based on inherent class-wise differences. This method is known as
Discriminant PLSR (PLSR-D, or PLS Discriminant Analysis, PLS-D, or
PLS-DA).

10.5.1 Multivariate classification using class differences,


PLS-DA

The philosophy behind PLS-DA is to determine latent variables that


will describe the different classes of objects in relation to their class-
wise differences, but with an ingenuous twist: the local models are
modelled by a PLSR model in which the class membership
information is expressed via the dummy variables. To do so, a PLS
regression model is calibrated on the full set of class assigned
training objects. In this model, the set of X-variables are known (or
assumed) to contain relevant information enough for class separation,
while the set of Y-variables are simple dummy variables, typically with
coding 0/1 or –1/1 for each object, signifying hard class membership
(refer to Table 10.2 for an example). For the prevalent two-class
cases, the binary coding –1/+1 appears logical, but PLS-DA can in
fact also deal with any number of classes, simply based on a set
dedicated dummy variables, each representing a single class. A
recommendation is always to code the variables –1/1 for practical
reasons: even though the coding has no impact on the model (auto-
scaling will take care of this slight asymmetry in the assignment
metric), results interpretation is visually much easier when the
demarcation between predicted class membership is a line through
the zero, rather than a line through 0.5.
Figure 10.21: The modelling power plot of Fisher’s Iris data for the Setosa class.

When calibrating a PLS-DA model, the fact that all dummy


variables are appended in the Y-space, ensures that all the
information which is relevant for the empirical discrimination training
is ready at hand: an object belongs to class A (1), and at the same
time does not belong to class B or C (–1), etc. Unlike with the SIMCA
approach, a natural consequence of including all classes into one
model is that PLS-DA is well-suited to cases where categorical, or
ordinal variables may play as important a role for class membership
as do continuous (ratio) variables. This is a very powerful multivariate
classification augmentation. So-called PLS1-DA is used a lot for the
general two-class cases, which occur frequently in the multivariate
regimen.
PLS-DA operates in closed modus. This means that the method
only applies when all possible classes are known and defined. Any
new sample is bound to be predicted in one of these classes,
provided these are well separated in the training phase. With PLS-DA
there is no possibility to reject a sample from all classes. For
example, if four food oil types were modelled based on spectroscopic
data, any new oil sample spectrum will be predicted as one of these
four classes, ignoring the fact that a new sample may be, for
example, the odd safflower oil (this problem is detailed later in this
chapter in section 10.8). When a new class is to be added to an
existing model, a whole new PLSR model needs be calibrated that
includes the new class in addition to the former ones.
Note, however, that an object which is strongly outlying, may in
practice be interpreted as one which is unknown by the model, and
therefore not belonging to any of the defined categories. It is possible
to take full advantage of the standard way one works with scores and
loadings in the X-space during the training calibration of the special
PLS-DA model.
The PLSR model will, as usual, make use only of the variations in
X that are relevant to Y, i.e. in this case the variations in X that best
explain the inherent discriminations between the data structure
classes. In such a model the X-variables that show random variation,
or systematic variation that is not related to class discrimination, will
be modelled in later PLS factors, while the first PLS-factors will
enhance the discriminating variations in X. For this reason, PLS-DA
performs well in situations where some of the X-variables are not
specific or relevant. In the chemometric literature there are many
examples of both simple classifications according to the direct PLS-
DA approach, as well as many more sophisticated variations on this
theme.

Table 10.2: Organisation of a data set for classification in three


classes A, B and C with PLS-DA.

X-variables Y-variables

Magnesium Phenols Flavanoids … Class Class Class Classa


A B C

1 127 2.8 3.06 1 –1 –1 A


2 122 1.51 1.25 –1 –1 1 C

3 100 2.65 2.76 1 –1 –1 A

4 104 1.30 1.22 –1 –1 1 C

5 88 1.98 0.57 –1 1 –1 B

6 101 2.05 1.09 –1 1 –1 B

7 101 2.8 3.24 1 –1 –1 A

… … … … … … … … …

a
The samples in the data set may or may not be sorted by class; this
is without relevance for the model, but will of course help make the
overview of more complex cases clearer. N.B. The largest number of classes
treated with PLS-DA by any of the authors of the present book is 7.

There are even examples of very particular PLS-DA modellings in


which the dummy Y-variable guidance of the X-space decomposition
is the actual objective of the data analysis!
The above “easy-to-understand” phenomenological introduction
to PLS-DA should be complemented by the scholarly full study of all
methods and technical aspects of PLS-DA by Brereton and Loyd [10]
as soon as the beginning data analyst starts to use this approach in
earnest; this article is greatly recommended.

Prediction of class memberships


Classification of new samples by PLS-DA is done by directly applying
the PLS prediction model to the new X-variable measurements. The
PLSR model is used to predict values for the dummy variables
representing classes. The predicted value is a continuous variable,
and the sample is attributed to the closest group. Thus, in the case of
dummy variables coded with –1 and 1, samples predicted above 0 for
a specific class may indicate membership to this class, while values
predicted under 0 indicate the alternative class (or non-membership).
The closer to 1 (resp. –1), the safer is the “conclusion”, but be aware
that there is no fundamental model or statistical theory behind the
empirical application of PLS-DA, ibid.
In an extension of the methodology, statistical boundaries may be
placed around the (–1, +1) target values for class membership
rejection. The assumption here is that since the PLSR model should
obey the principles of least squares fitting, unique classes should
distribute themselves normally around the target values. This is
shown in Figure 10.22.

Figure 10.22: Defining statistical limits around target values in PLS-DA for class rejection.

This is a highly uncertain assumption, however. Upon just a few


minutes’ reflection it is clear that everything regarding the
performance of PLS-DA prediction is decided upon by the training
stage calibration: how representative are the training set objects (of
the future classification situation)? Outliers are even more influential
than usual in the context of PLS-DA. How was the PLS-DA model
validated? If ever there was a situation in which the data analyst
suffers a risk of serious self-deception, it is the case of cross-
validating a PLS-DA model—only scrupulous application of test set
validation will guard against these dangers!
Recently, the PLS-DA approach was subjected to a rigorous
treatment from within chemometric community in a scholarly paper
greatly recommended, Brereton and Loyd [10]. Here one can find all
positive potentialities well described as well as all the necessary
sceptical warnings.

PLS-DA key points

The following points should be taken into consideration when


developing a PLS-DA model.
Definition of the training data set is critical; it must be
representative of the future classification situation. There is no
magic, rather GIGO!
Class membership is determined by direct prediction from a
conventional PLSR model.
PLS-DA operates in a closed universe: PLS-DA assumes that any
new sample has to belong to one or another of the known classes.
This is the implementation used by The Unscrambler®.
PLS-DA models class differences. It can perform better than
SIMCA for non-homogeneous classes; sometimes the same
applies re asymmetrical classification.
PLS-DA is able to handle categorical variable values easily.
Adding new classes requires re-calibrating the PLS-DA model.
PLS-DA is critically dependent upon informed validation practises
(no cross-validation).

10.6 Linear Discriminant Analysis (LDA)


In general terms, all Discriminant Analysis (DA) methods seek to
describe mathematically, but often also graphically, the discriminating
features that separate objects into different classes. Thus, DA finds
“discriminant axes”, linear combinations of the initial k variables that
optimally separate two or more data classes.
Figure 10.23 shows the classical linear DA, known as Linear
Discriminant Analysis (LDA), sometimes known as Fisher’s Linear
Discriminant Analysis. LDA finds the line that best separates two
training classes (“dots” vs “triangles” in Figure 10.23). It can be seen
from this schematic illustration, the border between the two classes is
“fuzzy” as reflected by the two individual class distribution curves—
they partially overlap.
LDA is not limited to the straight-line separation case only; the
following represent a small list of potential separators commonly used
in both simple and complex problems:
1) Linear separators.
2) Quadratic separators.
3) Mahalanobis distance.
DA is partly an unsupervised EDA technique, but is also often
used for supervised pattern recognition in a subsequent step. With
DA knowledge of the objects in the training data set is required in
order to start dividing the objects into two (or more) classes.
However, the overwhelming abundance of LDA applications is
concerned with separation of the two-class problem.
LDA may be viewed as a projection onto A = 1 dimensions. If the
objects are projected onto the “LDA-axis”, this axis could be viewed
as a “component vector” separating the two classes, i.e. a 1-
dimensional representation. This singular discriminant axis may be
extended to higher dimensions in more advanced versions of DA (e.g.
quadratic DA), but a great many of the classic methods stay with this
very low dimension of the subspace employed. For situations with
two classes this makes good sense. There are, however, many real-
world data sets where this simple, low-dimensional picture is not
enough. Such systems are of a sufficiently higher complexity in which
a 1-dimensional approximation is a gross under-representation.
The basic linear DA assumes that there is a common covariance
structure for both classes. Quadratic discriminant analysis may
perform better in situations where the different groups being
classified have their main variability in different directions, but only
when the training sets used are large and indeed representative. The
Mahalanobis distance (chapter 6, section 6.6.3) option for LDA is a
way of measuring the distance of an observation to the centres of the
groups, and uses ellipses to define the distances.

Figure 10.23: Linear Discriminant Analysis (LDA).

A number of major limitations must be pointed out with LDA which


are important to take into consideration as they relate to collinearity;
LDA suffers from the same problems as does MLR (refer to chapter
7). A remedy for this is to apply PCA as a pre-step in the analysis
similar to what is done for PCR (refer to chapter 7). As is known,
scores from PCA can be regarded as “super-variables”, a weighted
sum of the individual variables for each component. Then one
achieves three advantages:
1) The score vectors are orthogonal; thus, inversion of the covariance
matrix poses no problem.
2) With a potentially much lower number of variables as input to LDA,
the requirement that each class has more samples than variables is
much easily fulfilled.
3) As the noise-part of X is omitted when retaining only the relevant
components, this is a better starting point for the LDA algorithm. In
this case, it may not be so critical if some more components are
included—but as always in multivariate data analysis, proper
validation rules (chapter 8).
Finally, the LDA algorithm is non-specific in the case where a
sample cannot be fitted uniquely to a particular class. Figure 10.24
shows the situation where a quadratic separator is used to separate
the two-class problem.
With reference to Figure 10.24, the data in this case are better
separated using a quadratic separator than a linear separator (for
instance). The distance d1 in this case is the distance from each class
centre to the line of separation. In LDA, this distance is equal
between the classes along the separator. The new object A clearly
does not fit within the statistical bounds of the two classes described.
This is a major issue with LDA, the algorithm calculates the distances
d2 and d3 and the new objects is assigned to the class that has the
smallest distance to a particular class centre. This is a major pitfall
with LDA—and of the Support Vector Machine Classification (SVMC)
method described in the next section.
Naes et al. [11] is a relatively recent authoritative publication on all
matters regarding classification; it is highly recommended when the
need arises to proceed with more detailed descriptions of these
methods w.r.t. the very brief introduction here.
Figure 10.24: Definition of the Quadratic Separator and handling of a sample that does not fit
a particular class.

10.7 Support vector machine classification


The theory behind Support Vector Machine Classification (SVMC) is
outside the scope of this introductory textbook and the interested
reader is referred to the literature [12] for a more in depth discussion
on the optimisation algorithms (known as the Kuhn–Tucker method).
SVMC is a pattern recognition method that is used widely in data
mining applications, and provides a means of supervised
classification, in a similar manner to SIMCA and LDA.
SVM was originally developed for the case of linear separable
data, but is applicable to non-linear data with the use of kernel
functions. Kernel functions map a higher dimensional space onto
lower-dimensional space (known as the Feature Space) in which
separation is supposedly more easily achieved. Figure 10.25 provides
a graphical example of how a kernel function works.
In The Unscrambler®, the Kernel types available for separation of
classes can be chosen from the following four options:
Linear: xi × xi
Polynomial: (γxixj + coefficient)degree
RBF: exp(–γ | xi – xj |2)
Sigmoid: tanh(γxiT × xj + coefficient)

Figure 10.25: Using a Kernel Function to map a high-dimensional, complex space into a
simpler, linearly separable space.

As is the case with any multivariate data analysis, start simple and
then build up into more complex solutions only when/if the simpler
methods do not show promise. To aid in the selection of the
parameters used for fine tuning the class boundaries in SVM, many
software applications provide a grid search approach in which
various combinations of parameters are assessed to show the quality
of the final discrimination solution. When evaluating kernel
parameters, always use the smallest numbers possible to avoid
overfitting. As the linear function is not always able to model complex
separation problems, data are sometimes mapped into a new feature
space and a dual representation is used with the data objects
represented by their dot product.
After mapping the original data space into the simpler Kernel
space, the SVMC algorithm will search for the objects that lie on the
borderline between the classes, i.e. to find the objects that are ideal
for separating the classes; these objects are named Support Vectors.
The support vector is defined as the reduced training data from the
kernel.
Figure 10.26 shows this principle where objects marked with + for
the two classes are used to generate the rule for classifying new
objects.
Unlike LDA and SIMCA, where the statistical boundaries are
defined by the standard deviation (or manifestations thereof) of the
samples from the class centres or the local PCA models, SVMC does
not take into account unequal class variances that can result in class
overlap and potential ambiguity between classes, but only takes into
account important objects close to and along the class separation
boundaries, the important support vectors.
To assess the quality of the separation, the SVMC also
establishes margins around the line of best separation (these are the
lines labelled H in Figure 10.26). If the margins are clear of any
objects, then the separator is said to be of “high quality” and is
suitable to discriminate between classes. When objects still exist in
the margins, then there is the likelihood of ambiguities when the
SVMC model is applied to new data.
SVM is a classification method based on statistical learning
wherein a function that describes a hyperplane for optimal separation
of classes is determined. It has advantages over other classification
methods such as neural networks, as its outputs are more
transparent, and has less tendency of overfitting when compared to
other non-linear classification methodologies.
For these more complex modelling algorithms, which are based
on more special assumptions, model validation is progressively the
critical success factor to avoid the dangerous likelihood of overfitting.
In the case of SVMC, a few objects are singled out for the crucial
definition of margins—these better be beyond reproach regarding
representativity! However, SVMs are particularly effective for the
modelling of non-linear data, and they are also relatively insensitive to
variation in the model parameters. SVM uses an iterative training
algorithm to achieve separation of different classes. In the case of
similar classification rates for the training data, the model with the
most linear parameters and minimum number of support vectors
should be chosen.

Figure 10.26: Separation of data using Support Vectors and definition of margins.

Two SVM classification types are available in The Unscrambler®


which are based on different means of minimising the error function
of the classification.
c-SVC: also known as Classification SVM Type 1.
nu-SVC: also known as Classification SVM Type 2.
In the c-SVM classification, a capacity factor, C, can be defined.
The value of C should be chosen based on knowledge of the noise in
the data being modelled. Its specific value can be optimised through
cross-validation procedures.* When using nu-SVM classification, the
nu value must be defined (default value = 0.5). nu serves as the upper
bound of the fraction of errors and is the lower bound for the fraction
of support vectors. Increasing nu will allow more errors, while
increasing the margin of class separation.
Unlike LDA and SIMCA, SVMC is not influenced by the
inhomogeneous class variance (as defined above) as it uses samples
close to the separation boundary to define class boundaries and
margins. Like LDA, SVMC is a single model process, but it suffers
from the disadvantage that if an object cannot be fit into a unique
class, it is fit to the closest class. It is, therefore, most useful for the
two-class separation problem when a more powerful method, such
as SIMCA has already established that the sample belongs to one of
the two classes.
The above disadvantages of LDA and SVMC for developing global
classification libraries can result in potentially fatal outcomes if these
powerful methods are used incorrectly by the undiscerning data
analyst. For example, if a spectroscopic method of material
identification is being used in the pharmaceutical industry and there
are two materials with similar, sometime indistinguishable spectra (or
even an unknown material is analysed with a similar structure to a
known material in the library). Out of the two materials, one is used as
a pain reliever and one is used to treat diabetes, if the pain relief
material is mistaken for the diabetes material and a product was
accidentally made (although this would usually be picked up in final
end testing of the product), then the situation could prove fatal to the
end user. Although this case is a constructed case and extremely rare
for an Active Pharmaceutical Ingredient (API), the misclassification or
non-identification of an excipient could result in a similar, potentially
fatal situation. This case, admittedly construed to the limit, serves
well, however, to get the novice data analyst aware of the (many)
potential pitfalls surrounding a too “automated” multivariate
classification approach. It is imperative always, with no exceptions, to
understand fully all of the assumptions behind the method itself
(however complex) and the ever important validation procedure
needed to guarantee that the performance of a specific classifier is
indeed working according to its intended use, i.e. it is fit for its
purpose.
To further demonstrate the outputs and results generated by each
of the supervised methods described in this chapter so far, section
10.9 applies each method to an example used to authenticate
commonly used vegetable oils.

10.8 Advantages of SIMCA over traditional


methods and new methods
There are several powerful advantages of the SIMCA approach
compared to methods like, for example, Linear Discriminant Analysis
(LDA), Support Vector Machine Classification (SVMC) and
unsupervised cluster analysis.
First, SIMCA is not restricted to situations in which the number of
objects is significantly larger than the number of variables, as is
invariably the case with classical statistical techniques (this has to be
so in order for the pertinent model parameters to be estimated with
statistical certainty). Not so with bilinear methods such as PCA,
which are stable with respect to any significant imbalance in the ratio
of objects/variables, be it either (very) many objects with respect to
variables or vice versa. Because of the score-loading outer product
nature of bilinear models, the entire data structure in a particular data
matrix will be modelled well even in the case where the one
dimension of the data matrix is (very) much smaller than the other,
within reasonable limits of course.
SIMCA is a one-class-membership methodology, but it is in fact
equally applicable to cases where an object belongs to more than
one class. For instance, a food-component compound may be both
salty and sweet by sensory evaluation, and should thus, by rights,
simultaneously fall into both of these two classes. It is relatively easy
to use SIMCA also for this multi-class membership purpose, either by
itself or in connection with using dummy classification Y-variables, as
in the PLS-DA approach. This last issue is treated in full detail in
section 10.4; see also Vong et al. [13].
Another advantage is that all the pertinent results can be
displayed graphically with an exceptional insight regarding the
specific data structures behind the modelled patterns. This was
demonstrated by the many diagnostics tools associated with both
PCA and the SIMCA method itself.

10.9 Application of supervised classification


methods to authentication of vegetable oils using
FTIR
In this data set, a number or representative samples of each of four
classes of vegetable oil of varieties Corn, Olive, Safflower and Corn
Margarine were obtained and their spectra collected using the
method of Fourier Transform Infrared (FTIR) spectroscopy. It was the
purpose of the study to determine whether FTIR could be used to
distinguish between the oil classes and if successful, be used to
classify a test set of samples from the four classes measured and
some other oil types not used for model development.

10.9.1 Data visualisation and pre-processing


As always, a good data analyst will plot the data to get a feel for what
is being analysed. Spectra are best plotted as line plots of
absorbance vs wavelengths (wavenumbers) to look for consistency in
sample trends. Figure 10.27 shows the raw spectra for the four
classes of oil scanned for the training data set.
Figure 10.27: Raw FTIR spectra of four different classes of vegetable oil.

The data in Figure 10.27 shows that a consistent spectral profile is


evident from the samples, only a small correction for baseline offset
and possibly scatter may be required. The Standard Normal Variate
(SNV) preprocessing method is an ideal candidate as it keeps the
spectral profile intact and reduces both additive and multiplicative
effects (refer to chapter 5 for more details on the SNV algorithm).
The data analyst knew from past experience that the entire
spectrum is not required for extracting all of the information available
for the purposes of classification. It was, therefore, decided to use a
reduced region and apply the SNV preprocessing method. This new
data is shown in Figure 10.28.
The preprocessed spectra now look more consistent in their
profiles and distinct chemical features are now visible, particularly
around 960 cm–1 for Corn Margarine. This data is now ready for EDA
and the most appropriate method in this case is PCA.

10.9.2 Exploratory data analysis

Using the method of PCA on the data shown in Figure 10.28 resulted
in the PCA overview of Figure 10.29.
The main features of the PCA overview are listed as follows:
1) Two PCs describes 99% of the total data variability.
2) The score plot shows distinct clusters of the oil varieties.
3) The loading plots for PCs 1 and 2 show some similar and some
distinct chemical features along both PCs.
4) The influence plot indicates that the corn margarine samples are
extreme. This is because they are the most different and least
represented samples in the training set.
This PCA now justifies that the FTIR method is capable of
classifying oil varieties of the four classes under investigation. The
process of Supervised Classification can now begin.
Figure 10.28: SNV transformed FTIR data of vegetable oils using a reduced spectral region.

10.9.3 Developing a SIMCA library and application to a test


set

For each class present in the PCA score plot, an individual PCA
model was developed and validated for use as a classification rule.
These models were compiled into a library for use as a SIMCA
classification routine in The Unscrambler®. The test set comprised of
samples from the four classes investigated and some negative
challenge samples of the following oil types:
1) Walnut Oil,
2) Peanut Oil,
3) Sesame Oil,
4) Soybean Oil.
Negative challenge samples are always suggested when testing a
classification method as they assess the libraries discrimating ability
of classes not present in the library. If these samples positive identify
as a known class in the library, then the library is not sensitive enough
to distinguish between some classes.The SIMCA library was applied
with a significance level set to 5%. The classification table is shown in
Figure 10.30 for this analysis.
The main features to note from the Classification Table are the
following:
1) In no cases were there more than one asterisk per row. This
indicates that there were no ambiguous classification results
obtained.
2) In each case, the sample in the test set was correctly classified as
the class it belonged to.
3) There are, however, some samples that did not classify into their
own class. This is an indication that the training model does not
have all of the variability it needs to be robust. This will be
confirmed using model diagnostics.
4) Of the samples that had no library available, Soybean Oil classified
as Corn Oil. This result indicates to the developer that in the future,
when developing a Soybean Oil classification model, there could
be ambiguities with Corn Oil. The topic of Hierarchical Models is
discussed in chapter 13 section 13.6.
Figure 10.29: PCA overview of SNV pre-processed vegetable oil FTIR spectra.

10.9.4 SIMCA model diagnostics

The first model diagnostics plot to view is the Coomans’ plot. This is
shown in Figure 10.31 for the closest groups from the original PCA,
being Safflower and Corn oil (refer to the scores plot in Figure 10.29).
Coomans’ plot reveals the following information based on the
pairwise comparison of Corn and Safflower oils:
1) Since there are no Corn and Safflower Oil samples in the region
bounded by the model limits and the origin, there is no suggestion
that these oils cannot be separated from each other.
2) In the case of Corn Oil, when a Soybean Oil model is to be
considered, then it may result in some ambiguity issues, as already
detected in the Classification Table.
3) All other Oil types are distinctly different from Corn and Safflower
Oils as evidenced by them lying outside the model limit bounds, i.e.
in the top-right quadrant.
Similar results to these were found for all pairwise oil class
comparisons. Figure 10.32 shows the Si vs Hi plot for the oil type
Olive.
Figure 10.32 is focused on the Olive Oil samples only and it shows
that the nearest class to it is Peanut Oil. It is now apparent why one
of the samples did not classify for Olive Oil in the Classification Table
of Figure 10.30. One sample has a greater leverage than the model
limit at 5% significance. This suggests that more samples are
required to make the model robust. If this sample lay outside both
model limits, if could be justified as being an outlier and would be
investigated in a root cause analysis.
Figure 10.30: SIMCA Classification Table for Test Set of vegetable oils scanned by FTIR
spectroscopy.

It can now be justified that the SIMCA model is capable of


classifying new oil samples into the classes within the library and
(with the exception of soybean oil) is able to reject the negative test
set applied to it. At this stage, justification is provided to continue
with the development in an attempt to make the library more robust
and to add more class models as new oil types become available
(provided those samples have been well characterised by a reference
method prior to library development).

10.9.5 Developing a PLS-DA method and application to a


test set
Based on the success of the SIMCA method, it was decided to see if
any improvements could be made using PLS-DA. As PLS-DA uses a
dummy variable to model class differences, it is expected that the
PLSR decomposition of the data may yield loading weights that are
better able to discriminate the oil samples. A [0, 1] labelling
convention was applied to each oil class in the training data set and
the PLS-2 algorithm applied as the method of analysis. The PLSR
overview is shown in Figure 10.33 for the training data.
The main observations made from the PLSR overview are as
follows:
1) It requires five PLS Factors to model the data. Since there are four
classes being modelled, this number of factors is not too
unreasonable, however, it may be indicating a situation of over-
fitting.
2) The Factor 1 vs Factor 2 scores show complete data separation in
the first two factors.
3) The Correlation X–Y Loadings show that Olive, Corn Margarine and
Safflower Oils are modelled in the first two PLSR Factors. It was
shown that Factor 3 models Corn Oil (data not shown).
4) Although Olive and Corn Margarine can be modelled well,
Safflower and Corn Oil were not modelled as well. This may be
because there is not enough variability in the data to effectively
model and that two PLS Factors (i.e. Factors 4 and 5) are modelling
noise to achieve better classification. This can be shown by
application of the three-Factor and the five-Factor models to the
Test Set.
To test the suitability of the PLS-DA model, the three-Factor and
the five-Factor models were applied to the Test Set. The classification
results are shown in Table 10.3 for comparison.
Table 10.3 reveals the following regarding the PLS-DA models:
1) The five-Factor model is a better class predictor than the three-
Factor model.
2) All classes in the library are classified into their own unique classes.
3) The negative test samples were classified primarily into the class
Corn Oil. This was the class that fit the poorest to the model.
Figure 10.31: Coomans’ plot of FTIR Vegetable Oil Classification Data.

The PLS-DA method was not found to be superior to SIMCA in


this case and due to its inability to reject unknown classes, it was
decided at this stage that the SIMCA model could not be replaced by
PLS-DA.

10.9.6 Developing a PCA-LDA method and application to a


test set
The original data matrix of spectra contains 147 wavenumber
variables. Even at this small number of variables, there were still not
enough samples to develop a classical LDA model. This is not a
problem as the method of PCA-LDA can be used where a first PCA
model is developed to extract the relevant score vectors from the
model with all samples in the matrix, which results in a system with a
smaller number of variables than samples.
Unlike SIMCA, LDA does not require a class model to be made for
each sample type. In order to determine the optimal rank (i.e. the
number of PCs to set for the analysis), this can be done by first
performing a PCA, as was done earlier in this section, where two PCs
were found to be optimal. This is the number of PCs to set in the
PCA-LDA.
Figure 10.34 provides the classification plot for the LDA and
embedded into this plot is the so-called Confusion Matrix. The
Confusion Matrix displays the positive test results along the main
diagonal of the matrix and any ambiguous results with a particular
class are shown in the off-diagonals.
The results in Figure 10.34 show that the PCA-LDA was capable
(as expected from the initial PCA) to separate the data completely. In
this case, a linear separator was used and this indicates that the
separation is simple. As with the case of PLS-DA, the issue with LDA,
in general, is that when a different sample type from that which was
modelled is classified, the LDA algorithm will fit the data to the
nearest class. This can be seen from the data in Table 10.4.
Figure 10.32: Si vs Hi plot for Olive Oil FTIR spectra.
Figure 10.33: PLS overview of the Vegetable Oil FTIR training set.

Table 10.3: Comparison of PLS-DA models for vegetable oils applied


to a test set for three- and five-Factor PLSR models.

Five-factor PLS-DA modeI Three-factor PLS-DA model

Sample Corn Corn Olive Safflower Corn Corn Olive Safflower


ID Marg Marg

tCornA1 1.0 0.0 0.0 0.0 0.7 0.0 0.1 0.2

tCornA2 0.9 0.0 0.1 0.0 0.8 0.0 0.1 0.1

tCornA3 0.9 0.0 0.0 0.1 0.8 0.0 0.0 0.2

tCornB1 0.7 0.0 0.1 0.2 0.7 0.0 0.1 0.2

tCornB2 0.9 0.0 0.0 0.1 0.7 0.0 0.1 0.2

tCornB3 0.8 0.0 0.0 0.1 0.9 0.0 0.0 0.1

tOliveA1 0.0 0.0 1.0 0.0 0.1 0.0 0.9 –0.1

tOliveA2 0.0 0.0 1.0 0.0 0.2 0.0 0.9 –0.1

tOliveA3 0.1 0.0 0.9 –0.1 0.3 0.0 0.9 –0.2

tOliveB1 0.0 0.0 1.0 0.0 –0.2 –0.1 1.1 0.1

tOliveB2 0.0 0.0 1.0 0.0 –0.3 0.0 1.1 0.2

tSaffA4 0.0 0.0 0.0 1.0 0.3 0.0 –0.1 0.7

tSaffA5 0.0 0.0 0.0 1.0 0.2 0.0 –0.1 0.8

tSaffB1 0.1 0.0 –0.1 0.9 0.1 0.0 0.0 1.0

tSaffB2 0.0 –0.1 0.0 1.0 –0.4 –0.1 0.2 1.3

tSaffB3 0.1 0.0 –0.1 0.9 0.0 0.0 0.0 1.0

tCMarg3 0.0 1.0 0.0 0.0 –0.1 1.0 0.0 0.0

tCMarg4 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0

tWalnut 1.8 –0.1 –0.6 –0.1 1.5 –0.1 –0.5 0.1


tSesame 2.3 0.0 –0.1 –1.3 1.6 0.0 0.2 –0.7

tPeanut 0.5 0.1 0.6 –0.2 0.3 0.1 0.6 0.0

tSoybean 1.0 0.1 –0.1 0.0 0.6 0.1 0.0 0.3

Figure 10.34: PCA-LDA results for Vegetable Oil classification by FTIR with the confusion
matrix for the analysis.

The results in Table 10.4 show that for the positive tests, the PCA-
LDA model works well, but when it is applied to the sample types not
included in the model, it tries to fit these samples to the existing
classes (which is fundamentally inappropriate for a reliable
classification method).

10.9.7 Developing a SVMC method and application to a test


set
The development of a SVMC model can be a very time-consuming
and tedious process when the original structure of the data is not
known a priori. The questions to be answered when developing a
SVMC model include:
1) Which SVM model type to use?
2) Which Kernel function to use?
3) What are the optimal parameters for the model?
Fortunately, in the present case, linear models have been used to
separate the vegetable oil classes and this will be the starting point
for this SVMC. The C-type algorithm will be employed for its
simplicity. Now all that is left is to optimise are the model parameters.
This will be performed using a Grid Search, Figure 10.35 shows the
grid search output for a linear kernel and indicates that a C-value of 1
will result in a model with a training accuracy of 100% and a cross-
validated accuracy of 100%.
The results of the classification are presented in the confusion
matrix shown in Table 10.5.
The results in Table 10.5 show that the SVMC is capable of
classifying the vegetable oil types in the training set using a simple
linear Kernel function. The classification of the test set using this
SVMC model is provided in Table 10.6.
The main observations to be made from Table 10.6 are:
1) As with PCA-LDA, all samples that had class models in the library
were correctly classified.
2) The samples that did not have any class models were classified
into the nearest class and in all cases, these resulted in
misclassifications.

10.9.8 Conclusions from the Vegetable Oil classification

This example investigated the use of four multivariate classification


techniques commonly used in industry and research. Overall, the
SIMCA method performed best even though it requires more effort to
develop (this is a classic case: the easiest path is not always the best
path).
Table 10.4: Results of the classification of new vegetable oil samples
(positive and negative test) using a PCA-LDA model.

Sample ID Classified As Sample ID Classified As

tCornA1 Corn tSaffA4 Safflower

tCornA2 Corn tSaffA5 Safflower

tCornA3 Corn tSaffB1 Safflower

tCornB1 Corn tSaffB2 Safflower

tCornB2 Corn tSaffB3 Safflower

tCornB3 Corn tCMarg3 Corn Marg

tOliveA1 Olive tCMarg4 Corn Marg

tOliveA2 Olive tWalnut Safflower

tOliveA3 Olive tSesame Corn

tOliveB1 Olive tPeanut Olive

tOliveB2 Olive tSoybean Corn

Figure 10.35: Grid search dialog for SVMC as implemented in The Unscrambler®.
The three other methods used, PLS-DA, PCA-LDA and SVMC all
make use of single-step training models (with SVMC requiring a
method choice, a Kernel and optimisation of model parameters).
Although the three methods resulted in no misclassifications of test
samples from the class models used to develop the library, these
methods fundamentally failed when challenged by samples not
belonging to the library.

Table 10.5: Confusion Matrix for vegetable oil SVMC FT-IR data.

Corn Olive Safflower Corn Marg

Corn 9 0 0 0

Olive 0 15 0 0

Safflower 0 0 11 0

Corn Marg 0 0 0 6

In general, SIMCA has the advantage of being able to classify


samples that do not fit any class models into the null class. In chapter
13, the concept of Hierarchical Models is introduced and situations
where a global SIMCA model results in two class ambiguities, a
hierarchical model using PLS-DA, PCA-LDA or SVMC may be a
better option to resolve complex ambiguities, as long as the global
SIMCA model shows that the ambiguity is expected, i.e. a known
ambiguity status is given to the samples.

10.10 Chapter summary


The tools of Multivariate Classification are a very powerful addition to
the data analyst’s weaponry. The ability to detect patterns in complex
data sets allows for greater insights and understanding. As human
beings, there is an instinctive desire to sort things into classes, this is
known as taxonomy. Once sorted, the next step is always to find out
what classes are similar to each other and what classes are different.
This is then extended to objects that do not fit any known class. The
last situation may be representative of what is known as an outlier,
i.e. an object that does not fit its intended class due to some
modification or significant difference from the rest of the population—
or it may in fact be a new class that has been identified for the first
time. The latter situation is emblematic of a scientific approach to
explorative data analysis, to exploratory classification and to
explorative regression modelling. In such cases chemometric
methods are paving the way for new scientific taxonomic and
classification insight.

Table 10.6: Results of the classification of new vegetable oil samples


(positive and negative test) using a SVMC model.

Sample ID Classified as Sample ID Classified as

tCornA1 Corn tSaffA4 Safflower

tCornA2 Corn tSaffA5 Safflower

tCornA3 Corn tSaffB1 Safflower

tCornB1 Corn tSaffB2 Safflower

tCornB2 Corn tSaffB3 Safflower

tCornB3 Corn tCMarg3 Corn Marg

tOliveA1 Olive tCMarg4 Corn Marg

tOliveA2 Olive tWalnut Safflower

tOliveA3 Olive tSesame Corn

tOliveB1 Olive tPeanut Olive

tOliveB2 Olive tSoybean Corn

There are two main types of classification, unsupervised and


supervised. In a typical situation, a data analyst is faced with the
prospect of an unsupervised classification when first starting out with
a new set of data. The methods of Cluster Analysis allow an analyst
to detect the natural patterns or groupings within a data set and to
determine if class models can, in fact, be developed in the first place.
The ability to distinguish between classes is a two-fold issue:
1) Is there something physical or chemical that distinguishes two
classes in the first place?
2) Do the variables measured on the system correlate to the
differences in a manner that leads to clear and unambiguous
distinction between the classes?
It was pointed out that the best cluster analysis method available
to a data analyst is in fact PCA! This is because PCA provides clear
visual diagnostics of object separation (scores) and the influencing
relationship between the measured variables that result in the object
separations (loadings). PCA is a highly validatable process with great
interpretability that can be used well to classify new samples in the
future; PCA is the most versatile multivariate approach available; all
that an analyst needs to do is to become a (very) experienced!
The ability to use predefined models to classify new objects is
referred to as supervised classification. The Method of Soft
Independent Modelling of Class Analogy (SIMCA) is an excellent
supervised classification approach, where properly validated PCA
models can be collected into a library of models and can be used to
sort out new objects. Each PCA model is validated using a set of
class boundaries, i.e. statistical confidence intervals that are defined
such that when a new object lies in these boundaries, they are
classified with respect to a predefined measure of statistical
significance.
Three additional supervised classification methods were also
introduced. Partial Least Squares Discriminant Analysis (PLS-DA)
utilises the power of the PLSR regression approach, this time using
dummy variables as Y-responses to sort objects into their respective
classes. This is performed by modelling the variable structures most
related to the dummy class variable(s). In some cases, PLS-DA can
do a better job than SIMCA, especially when the classes to be
modelled are not entirely homogeneous in composition. PLS-DA in its
classical form cannot handle sample classes that have not been
previously modelled and as such, when the model is applied to new
objects not in the library, the method per force classify them into the
closest class. This situation opens up the door for many dangers all
related to the degree of future representativity captured by the
training data set. Advances in the method may be able to provide a
statistical rejection system based on confidence intervals placed
around the –1/+1 (or 0/1) extremes of the model in Y-space, but there
are many provisos.
Linear Discriminant Analysis (LDA) is the classification equivalent
to Multiple Linear Regression (MLR) in regression analysis. It uses
linear, quadratic or other separators to form boundaries between
classes. Its major pitfall is that if it cannot uniquely classify an object
to a specific class, it also reverts to assigning it to the nearest class.
This is clearly an undesirable situation, particularly if a life or death
decision has to be made. The other method of supervised
classification introduced was Support Vector Machines (SVM) which
use advanced non-linear optimisation routines to define boundaries
between classes. The major advantage SVM has over LDA is that
SVM defines class boundaries based on objects close to the interface
of two sets, whereas LDA is influenced by the variability of the entire
set of objects. However, SVM also has the same major pitfall that
LDA has in that for the situation where an object is not uniquely
classified into a distinct class, it will per force have to be assigned to
the nearest class, see cautionary comments above.
SIMCA does not have the pitfalls of false classification as the
other methods do, however, maintenance of a SIMCA library can be a
laborious task when the number of class models becomes large. If an
object cannot be uniquely assigned to a SIMCA library class model, it
gets assigned to the null class, i.e. no classification. A SIMCA library
may also become inefficient when too many class models are added
to it where, in some cases, its ability to uniquely classify new objects
may be compromised. In this case, classification ambiguities may
arise. In chapter 13, a model utilisation method known as Hierarchical
Modelling is introduced that allows decision logic to be placed on
classification results in the event of known ambiguities. Unknown
ambiguities are always treated as misclassifications.
There are as many applications for multivariate classification as
there are for multivariate calibration; together the two approaches can
be used to solve many complex classification and discrimination
problems in research and in applied technology and industrial
sectors.

10.11 References
[1] Miller, J.N. and Miller, J.C. (2005). Statistics and Chemometrics
for Analytical Chemistry, 5th Edn. Prentice Hall.
[2] Adams, M.J. (1995). Chemometrics in Analytical Spectrosopy,
Ed by Barnett, N.W. RSC Analytical Spectroscopy Monographs.
[3] Everitt, B.S., Landau, S. and Leese, M. (2001). Cluster Analysis,
4th Edn. John Wiley & Sons Inc.
https://doi.org/10.1201/9781420057492.ch10
[4] Romersburg. H.C. (1984). Cluster Analysis for Researchers.
Lifetime Learning Publications, Belmont, California. Reprint
edition Feb 1990, Krieger Publishing Company.
[5] Fisher, R.A. (1936). “The use of multiple measurements in
taxonomic problems”, Ann. Eugenics 7, 179–188.
https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
[6] Wold, S. (1976). “Pattern recognition by means of disjoint
principal components models”, Pattern Recogn. 8, 127–139.
https://doi.org/10.1016/0031-3203(76)90014-5
[7] Massart, D.L., Kaufman, L. and Esbensen, K.H. (1982).
“Hierarchical nonhierarchical clustering strategy and application
to classification of iron meteorites according to their trace
element patterns”, Anal. Chem. 54(6), 911–917.
https://doi.org/10.1021/ac00243a017
[8] Esbensen, K.H., Kaufmann, L and Massart, D.L. (1984).
“Interobject structure of ungrouped iron meteorites as revealed
by advanced clustering: Methods and chemical features”,
Meteorit. Planet. Sci. 19(2), 95–109.
https://doi.org/10.1111/j.1945-5100.1984.tb00032.x
[9] Eriksson, L., Johansson, E., Kettaneh-Wold, N., Trygg, J.,
Wikstrøm, C. and Wold, S. (2006). Multi- and Megavariate Data
Analysis Part I: Basic Principles and Applications. Umetrics Inc.,
Umeå, Sweden.
[10] Brereton, R.G. and Loyd, G.R. (2014). “Partial least squares
discriminant analysis—taking the magic away”, J. Chemometr.
28, 213–225. https://doi.org/10.1002/cem.2609
[11] Naes, T., Issakson, T., Fearn, T. and Davies, T. (2017). A User
Friendly Guide to Multivariate Calibration and Classification. IM
Publications, Chichester, UK.
[12] Luts, J., Ojeda, F., Van de Plas, R., De Moor, B., Van Huffel, S.
and Suykens, J.A.K. (2010). “A tutorial on support vector
machine-based methods for classification problems in
chemometrics”, Anal. Chim. Acta 665, 129–145.
https://doi.org/10.1016/j.aca.2010.03.030
[13] Vong, R., Geladi, P., Wold, S. and Esbensen, K. (1988). “Source
contributions to ambient aerosol calculated by discriminant
partial least squares regression (PLS)”, J. Chemometr. 2, 281–
296. https://doi.org/10.1002/cem.1180020406

* Yes, you read correctly: In the presence of a significant noise component in the data, the C
is to be fixed by a cross-validation procedure. Based on comprehension of the
representativity issues detailed in chapters 3 and 8 (see also chapter 9), one should feel
suitably concerned placing this critical parameter on the vagaries of this mandated validation
approach.
Chapter 11. Introduction to Design of
Experiment (DoE) Methodology

Carefully selected samples (derived from well-planned sampling


strategies, chapter 3) increase the chances of extracting useful
information from data sets. When the possibility to actively perturb a
system exists, the chances of extracting useful information increases.
The critical part of this process is to decide which variables to
change, the intervals for this variation, and the pattern of the
experimental points.
Experimental objectives include, but are not limited to, the
investigation of some business or research critical phenomenon,
create a new/improve an existing product or optimise a process.
Whatever the objective, the only way this can be achieved is by
performing experimentation which leads to knowledge about the way
things work. Typically, there are no specific theoretical models that
explain what happens when, for example, 20 ingredients are mixed,
stirred, heated up then cooled down. Thus, there is a reliance on
Empirical Modelling approaches to gain an understanding of the
system.
Experiments are usually costly and time-consuming; therefore, the
goal is to minimise the total number of runs performed, while ensuring
that each single run provides as much value for money as possible.
This is the realm of Design of Experiments (DoE).
11.1 Experimental design
Design of Experiments (DoE) also referred to as Experimental Design
encompasses a large number of methods, each specific to a
particular experimental situation. This chapter aims to provide a short
comprehensive overview of the various methods and approaches that
are available in the Design-Expert® section of The Unscrambler®. For
classical overviews of DoE methodology, the interested reader is
strongly urged to read the works of Box, Hunter and Hunter [1] and
Montgomery [2].

11.1.1 Why is experimental design useful?

DoE is a powerful set of techniques whose main purpose is to design


efficient experimental strategies, instead of varying one variable at a
time and keeping the rest constant (which is the traditional way of
performing experiments). In DoE many factors are varied
simultaneously in a systematic and smart way using the concept of
factorial designs. The purposes of DoE are:
Efficiency: get more information from fewer experiments
Focusing: collect only the information that is important to the
objective of the design

11.1.2 The ad hoc approach


In so many situations, a single experiment is performed, and based
on the outcome, subject matter expertise and experience are used to
decide what happens next. Sometimes an experimenter gets lucky
obtaining the results they want in the chosen set of parameters;
however, this situation is rare and they usually don’t get second time
lucky.
There are three major problems with this “scatter gun” approach.
First, experimenters who apply this approach will rarely understand
how their system really works, so it may be difficult to transfer the
knowledge to a new application. Second, since there is usually some
amount of variability in the outcome of each experiment, interpreting
the results of just two successive experiments can be misleading
because a difference due to chance only can be mistaken for a true,
informative difference. Last, there is a risk that their solution is not
optimal. In some situations, this does not matter, but if the aim is to
find a solution close to optimum, or to gain better understanding of
the system, an alternative strategy is recommended.

11.1.3 The traditional approach—vary one variable at a time

At university, experimenters have traditionally been taught to vary


only one variable at a time as this is the safest way to control
experimental conditions. This approach is typically advocated as the
best way of interpreting the results from their experiments. This is
often a delusion and is unfortunately called the “Scientific Method”.
Using a simple example from an investigation of the conditions of
bread baking, the dangers of varying one variable at time are
illustrated. Listing the experimental variables, such as ingredients and
process parameters, the experimenter must have some preconceived
idea of what may have an influence on the volume of the final baked
bread, then study each of them separately. An initial study has
indicated the following input parameters to be potentially important:
Type of yeast,
Amount of yeast,
Resting time and
Resting temperature.
First, set the type of yeast, amount of yeast and resting
temperature to arbitrary values (for instance, those most commonly
used: e.g. the traditional yeast, 15 g per kg of flour, with a resting
temperature of 37°C); then study how bread volume varies with time,
by changing resting time from 30 to 60 minutes. A rough graph drawn
through the points leads to the conclusion that under these fixed
conditions, the best resting time is 40 minutes, which gives a volume
of 52 cL for 100 g of dough (Figure 11.1).
The next step may be to start working on the amount of yeast,
with resting time set at its “best” value (40 minutes) and the other
settings unchanged, while changing the amount of yeast from 12 g to
20 g as in Figure 11.2. At this “best” value of resting time, the best
value of the amount of yeast is close to the 15 g typically used, giving
a volume of about 52 cL. Now the conclusion might seem justified
that an overall maximum volume is achieved with the conditions
“amount of yeast = 15 g, resting time = 40 minutes”.

Figure 11.1: Volume vs resting time. First set of experiments, with the traditional yeast, 15g
per kg of flour and a temperature of 37°C.

The graphs show that, if either amount of yeast or resting time is


individually increased or decreased from these conditions, volume
will be reduced. But they do not reveal what would happen if these
variables were changed together, instead of individually!
Figure 11.2: Volume vs yeast. Second set of experiments, with the traditional yeast, a resting
time of 40 minutes and a resting temperature of 37°C.

To understand the possible nature of the synergy, or interaction,


between the amount of yeast and resting time, a Response Surface,
see Figure 11.3 is required which shows how bread volume varies for
any combination of amount of yeast and resting time within the
investigated ranges. It corresponds to the two individual plots of
Figures 11.1 and 11.2. However, if the contour plot represents the
true relationship between volume and yeast and time, the actual
maximum volume will be about 61 cL, not 52 cL! A volume of 52 cL
would also be achieved at, for example, 50 minutes and 18 g kg–1,
which is quite different from the conditions found by the One-
Variable-at-a-Time method. The maximum volume of 61 cL is
achieved at 45 minutes and 16.5 g kg–1 of yeast.
Figure 11.3 illustrates that the One Variable at a Time (OVAT)
strategy very often fails because it assumes that finding the optimum
value for one variable is independent from the level of any others
investigated. Usually this is not true.
In the bread baking case, the OVAT approach does not guarantee
that the optimal amount of yeast is unchanged when resting time is
modified. On the contrary, it is generally the case that the influence of
one input parameter may influence the others: this phenomenon is
called an interaction. Using another example, if a sports drink is made
that contains both sugar and salt, the perceived sweetness does not
only depend on how much sugar it contains, but also on the amount
of salt. This is because salt interacts on the perception of sweetness
as a function of sugar level.

Figure 11.3: Optimum in response surface. Contour plot showing the values of bread volume
for all possible combinations of resting time and amount of yeast.

To summarise, the classical approach to experimentation typically


leads to the following problems:
Interactions are likely to be missed between two (or more) input
variables.
Random variations cannot be distinguished from true effects.
A prediction of what would happen for an experiment that has not
been run cannot be made.
The number experiments required to achieve a specific goal is not
known in advance, possibly leading to many more runs than are
actually required.

11.1.4 The alternative approach

The previous section described how the traditional approach to


experimentation is flawed by the assumption that causal effects can
only be proven if the potential causes are investigated separately.
DoE is based upon a mathematical theory which makes it possible to
investigate all potential causes together and still draw safe
conclusions about all individual effects, independently from each
other. To further justify its use, this mathematical foundation also
ensures that the impact of the experimental error on the final results
is minimal, provided that all experimental results are interpreted
together (and not sequentially as in the classical approach).
In short, DoE has the following advantages:
The number experiments to perform is known up front.
The individual effects of each potential cause, and the way these
causes interact, can be studied independently from each other
from a single set of designed experiments.
The result of a DoE is a model which enables the prediction of what
would happen for any experiment within the Design Space.
Statistical significance can be assigned to the observed effects,
which allows an experimenter to distinguish between real
experimental changes from random variations.

11.2 Experimental design in practice


The following lists a general workflow for designing rational
experiments.
1) Define the output variables to be studied. These are called
Responses and there will be one value measured per response, per
experimental run.
2) Define the input variables to be varied. These are called
Experimental Factors, or simply Factors and these are typically
controllable, i.e. there is some precision with respect to setting the
levels of the variables.
3) For each factor, define a range of variation for continuous variables
or a list of the levels of categorical variables to be investigated.
4) Define the amount of information to be gained. The alternatives are:
a) find out which variables are the most important (known as
Screening).
b) study the individual effects and interactions of a rather small
number of factors (known as Factor Influence Studies or
Characterisation).
c) find the optimum values of a small number of factors (known as
Optimisation).
5) Choose the type of design which achieves the objective in the most
economical way.
The various types of designs to choose from are detailed in the
next sections. The following is a brief overview of how to analyse the
experimental results:
1) Define which model is compatible with the objective of the
experiment. The alternatives are:
a) to find out which variables are the most important: a linear model
(studies main effects).
b) to study individual effects and interactions: a linear model with
interaction effects.
c) to find an optimum: a quadratic model (including main,
interaction and square effects).
2) Compute the observed effects based on the chosen model, and
conclude on their significance.
3) Interpret the significant effects and use this information to
determine whether the initial goal is achievable.
Building a research and development project should always start
with a brainstorming session. This is where the problem definition and
objectives are outlined in a strategy that will lead to a greater
likelihood of success. The brainstorming session may be the most
important step in the project!
Brainstorming should also be a creative exercise and it is very
useful to involve key subject experts to discuss as many aspects as
possible that can vary in a product or a process. Anything overlooked
at this stage may result in the entire process being repeated from the
beginning. The following is a suggested workflow for defining and
implementing an effective experimental design.

11.2.1 Define stage

What product or processes will the design be applied to?


What output(s) are to be measured? What are the target values for
any response measured?
What is the precision of the devices used to measure the
response(s)?
What can be controlled and what cannot be controlled, for
example, environmental conditions etc.
Consider what main effects are likely to interact with each other up
front based on subject matter expertise.
Do not assume that all chosen levels and combinations will work
and give reasonable responses.

11.2.2 Design stage


If the problem has many factors associated with it, consider a
screening design to isolate a smaller number of influential factors.
Do not waste all of the available budget on the initial designs. Use
the properties of the designs to guide next step experimentation.
As will be seen in later sections, every time a factor can be
eliminated from a design, the power of the design will increase,
thus providing more information about the remaining factors (and
possibly their interactions) without having to perform more
experiments.
If a reasonable number of factors (i.e. between 2 and 7) remains
after screening, either use high resolution fractional factorial
designs or full factorial designs to understand the influence of
factors and their interactions.
Only after a complete factor influence study has been performed
should an optimisation design be considered.
When budget constraints limit the number of experiments that can
be performed, or the levels of some design points cannot be
physically achieved, then consider using a computer-generated
design.

11.2.3 Analyse stage

Since the designs used fit exact mathematical models, it is


important to assess whether there is a lack of fit of the data to the
model. If there is a significant lack of fit, either look for a different
model type or investigate the precision of the measurements using
replicates or centre points.
When a significant model with insignificant lack of fit can be
established, interpret the main effects and any significant
interactions. In some cases, a quadratic model may be a better fit
to the data.
Make sure common sense is used when interpreting the output of
an experimental design.
If there are more than one response variable being measured per
design, analyse the designs for each response and use methods
like Principal Component Analysis (PCA, chapter 6) to understand
the relationship of the responses with each other. This will provide
more interpretability when assessing the model.
Determine if the experimental region contains the optimal point. If
not, use methods such as steepest ascent to determine where the
maximum (or minimum) response lies and perform an optimisation
experiment using the new local maxima/minima as the centre point
of the design.
Validate the model using points not included in the model. This is
the only way of ensuring that the model is fit for purpose.

11.2.4 Improve stage

When a significant model can be validated against new data, then


the next step in the process is to implement the changes that were
found to be optimal in the model.
Monitor the process/product change in the early stages of
implementation to ensure that a stable point was found in the
model.
Use the model as a source of continuous improvement throughout
the lifecycle of a product or process.
By following the Define, Design Analyse and Improve workflow in
a systematic manner, this will more times than not lead an
experimenter to the correct decisions.

11.2.5 The concept of factorial designs


Factorial designs are the heart of DoE and they define a precise
experimental region to be investigated (sometimes known as a
Design Space). Within the Design Space it is investigated whether a
change in an experimental parameter (a factor) results in a response
that is significantly different from random variations. For example, will
the change of an ingredient or process parameter make the product
quality change? Each factor is studied at only a few levels, usually
two: varying them from a low to a high level.

11.2.6 Full factorial designs

A Full Factorial Design, is the easiest type to describe how DoE works
in practice. Consider three factors: the amounts of salt (factor 1),
sugar (factor 2) and lemon (factor 3) used in a sports drink recipe.
Each factor is varied from a low (–) to a high (+) level based on a
simple tree diagram (refer to Figure 11.4).
Starting from the origin, factor 1 (salt) is varied at a high (+) level
and a low (–) level. This results in two experimental runs. The next
factor (sugar) is added to the design. Now, for each level of salt, a
high and low level of sugar must be run. This results in four
experimental runs. Finally, for the third factor (lemon) for each run of
combinations of salt and sugar, a high and a low level of lemon must
be run. This results in eight experimental runs in total.
Figure 11.4: Simple tree diagram for constructing a full factorial design.

Using a geometrical representation, if the experimental runs are


plotted on a Cartesian coordinate system, the eight experimental runs
lie at the corners of a cube, see Figure 11.5.
Three design variables varied at two levels give 23 = 8 experiments
using all combinations. This gives rise to the family of experiments
known as the 2k factorial designs, where k is the number of
experimental factors to be studied. The fundamental property that
makes a factorial design useful for interpreting experimental results is
that the factors in the design are set in such a way that they are
orthogonal to each other. Since the method used to construct models
is Least Squares (in particular Multiple Linear Regression, chapter 7),
the enforced orthogonality of the design means that the model is
uniquely interpretable. The table in Figure 11.5 shows the low (–) and
the high (+) settings in each experimental run in a systematic order
—standard order.
Figure 11.5: Geometric representation of a full factorial design.

Based on a purely mathematical reasoning, as k increases, the


number of runs increase exponentially. Therefore, in practice, the use
of full factorial designs is limited to around five factors (25 = 32
experimental runs) which is already too many runs for many industrial
applications. The advantage of Full Factorial Designs is that they can
estimate the main effects of all design variables and all interaction
effects. There is typically no need to generate designs manually as
programs such as Design-Expert® do this automatically. All an
experimenter has to do is define which factors to vary by defining the
range of low and high levels for each factor.

11.2.7 Naming convention

In Figure 11.5, the names of the experimental runs are presented in


the conventional manner. In a design table, the values for the first
experimental run are always (–1) when listed in standard order. This
run is designated the name (I). The next run always has the factor A
as (+1) and all other main factors have the values (–1). In this case the
run is labelled (a) as it is the only run with a positive value for factor A.
This convention is continued throughout the design until the last run,
which in most cases is the run on all (+1) for the main effects. For the
23 design, this is designated (abc).

Effects

The variation in a response generated by varying a factor from its low


to its high level is called the main effect of that factor on a particular
response. It is computed as the linear variation of the response over
the whole range of variation of the factor. There are several ways to
judge the importance of a main effect, for instance significance
testing or use of Pareto charts, normal or half normal probability plot
of effects (see section 11.2.18 for more details).
Some variables need not have an important impact on a response
by themselves to be called important. The reason is that they can
also be involved in a significant interaction. There is an interaction
between two variables when changing the level of one of those
variables modifies the effect of the second variable on the response.
Interaction effects are computed using the products of several
variables (cross-terms). There can be various orders of interaction:
two-factor interactions involve two factors, three-factor interactions
involve three of them etc. The importance of an interaction can be
assessed with the same tools as for main effects.
In general, factors that have an important main effect are
important variables. Variables that participate in an important
interaction, even if their main effects are negligible, are also important
variables.

Effects—definition and calculation

Consider the bread baking experiment where bread volume was


investigated by performing different combinations of low (35°C) and
high (37°C) temperature and yeast type C1 or C2. Since there are
only two experimental factors to be varied, there are 2k = 4
experimental runs to be performed (k=2). The measured value of
Volume for the four different combinations of the design factors are
written inside the resulting design space (in this case, a square). This
is shown in Figure 11.6.
A main effect shows the mean change in a response variable for a
change in a given factor, while all other factors are kept at their mean
value: This is calculated as follows.

Main effect of Temperature = Mean response at high Temp – Mean


response at low Temp

Figure 11.7 shows the main effects of Temperature and Yeast


Type on Bread Volume.

The main effect of temperature


Volume increases from 35 [= (40 + 30) / 2)] to 40 [= (60 + 20) / 2],
when the temperature is changed from 35°C to 37°C. Thus, the main
effect of Temperature on Volume is +5.
Figure 11.6: Calculation of effects for bread baking example.

Figure 11.7: Volume vs yeast and temperature.

The main effect of yeast

Volume increases from 25 [= (30 + 20) / 2)] to 50 [= (40 + 60) / 2],


when the yeast type is changed from C1 to C2. Thus, the main effect
of Yeast on Volume is +25.
An interaction reflects how much the effect of a first factor
changes when a second factor is changed from its average value to
its high level, (which amounts to the same as shifting it halfway
between low and high):

The interaction effect between yeast and temperature

The Volume increases by +20 (= 60 – 40) when temperature is


changed from 35°C to 37°C using yeast C2.
But Volume decreases by 10 (–10 = 20 – 30) when temperature is
changed from 35°C to 37°C, using yeast C1.
The interaction effects are shown in Figure 11.8.
The overall interaction effect is an increase with one yeast, but a
decrease with another; the effect of temperature depends on which
yeast is used. So, the interaction effect is
Figure 11.8: Interaction effect.

(1 / 2) × [20 – (–10)] = 30 / 2 = +15.

The interaction effect is illustrated in Figure 11.9.

11.2.8 Calculating effects when there are many experiments


In the case of three variables, using a Full Factorial Design the full
design table, with interactions is shown in Table 11.1. This table
shows the main effects and interactions in terms of (+) and (–) values.
The design is constructed so that the overall orthogonality of the
design is maintained. Table 11.1 is a continuation of the sports drink
example and now shows the values of sweetness as the response.
This was assessed by a trained sensory panel on a scale between 0
and 9.
Figure 11.9: Interaction effect.

The main effect of Salt (A) on response Sweetness in the table


above is –1.34. This is calculated as follows,
First, take all responses for salt with a (+) sign and take the average
value, (2.1 + 2.8 + 3.5 + 4.8) / 4 = 3.3
Next, take all responses for salt with a (–) sign and take the average
value, (1.7 + 5.2 + 4.5 + 7.2) / 4 = 4.64
Finally take the difference between the average (+) value and the
average (–) value resulting in the main effect for salt, Main Effect of
Salt = 3.3 – 4.64 = –1.34
Interpretation: This means that by increasing Salt from its low to
its high level, the response Sweetness will decrease by 1.35 units.
This also makes common sense (for those who have experienced the
difference between salt, sugar and lemon tastes).
All other main effects and interactions are calculated in a similar
way by grouping (+) and (–) signs and taking the average difference
for each factor and interaction term in the design table.
The effects and interactions may also be found by fitting the
experimental data to the common regression equation 11.1

i.e. by finding the regression coefficients bi. This can be done using
several methods. MLR is the most usual, but PLSR or PCR (chapter
7) may also be used. If there are three design variables and it is
desired to investigate one response, the following equation will be
used (equation 11.2).

Table 11.1: Complete design for sports drink sweetness.

Effects

Run I A B C AB AC BC ABC Response

1 1 –1 –1 –1 1 1 1 –1 1.7

a 1 1 –1 –1 –1 –1 1 1 2.1
b 1 –1 1 –1 –1 1 –1 1 5.2

ab 1 1 1 –1 1 –1 –1 –1 2.8

c 1 –1 –1 1 1 –1 –1 1 4.5

ac 1 1 –1 1 –1 1 –1 –1 3.5

bc 1 –1 1 1 –1 –1 1 –1 7.2

abc 1 1 1 1 1 1 1 1 4.8

There are eight terms in the above equation. The first term b0 is
the intercept of the model and has a physical value of the mean of all
of the response terms.
The main effects associated with any factor are the average of the
observed difference of the response when that factor is varied from
the low to the high level. The estimated effect equals twice the b-
coefficient for the factor in the regression equation. This is because
the design is run at two levels and by multiplying the coefficient by +1
and –1, this results in a doubling of the range. For the 23 design, there
are three main effects associated with the factors.
An interaction effect for example, AB means that the influence of
changing variable A will depend on the setting of variable B. This is
analysed by comparing the effects of A when B is at different levels. If
these effects are equal, then there is no interaction effect AB. If they
are different, then there is an interaction effect. Estimated interaction
effects again equal twice the corresponding b-coefficients. For the 23
design, there are three two-factor interactions (2FI) and one three-
factor interaction (3FI). In the case where there is no interaction
between variables, when an effects plot is drawn, the lines from low
to high (or vice versa) will be parallel to each other.
In general, to estimate how many main effects and interactions
there are in a factorial design, the use of Pascal’s Triangle is required.
This is shown in Figure 11.10 for designs up to five factors.
Taking the five-factor experiment as an example, the following
main effects and interactions are obtained,
1 × Intercept Term
5 × Main Effects
10 × 2FIs
10 × 3FIs
5 × 4FIs
1 × 5FI13
Pascal’s Triangle shows that for the 25 full factorial design 16 of
the 32 experiments are used to determine three-factor or higher
interactions. The next section on Fractional Factorial designs
provides methods for reducing the number of experiments in the
presence of high order interactions.

11.2.9 The concept of fractional factorial designs

As shown by example using Pascal’s Triangle, Full Factorial Designs


result in many runs when the number of factors increases. These full
factorial experiments generate many experimental runs to determine
primarily insignificant interaction effects, but what justifies
insignificant interactions? Based on a purely probabilistic argument,
the likelihood of a single factor influencing a response is typically high
as too is the likelihood of two factors interacting. A three-factor
interaction is less likely as getting three variables to interact is usually
difficult. Only in rare or special cases do four or more factors interact
with each other, so the general consensus in DoE is that 2FIs are
common, 3FIs are rare and 4FI+ are almost never encountered. This
requires the need for a more economical alternative to experimental
design for larger numbers of factors.
Figure 11.10: Pascal’s Triangle for up to five factors.

It is not necessary to perform all combinations of experiments to


determine the main effects. By choosing a smart subset (a fraction) of
the design, fewer experiments can be performed and the estimation
of both main effects and interactions is still possible.
Figure 11.11 shows diagrammatically how a fraction of the 23
factorial design is constructed.
The subset of combinations shown in Figure 11.11 shows the half
fraction of the full design. By half fraction, it is meant that the final
design results in half of the experimental runs when compared to the
original design. This is represented as 23 – 1 = 4 experiments. This is a
Fractional Factorial Design with a degree of fractionality of one.
There are a number of ways to construct a fractional factorial
design, the simplest is to code all combinations of low and high levels
for the first two design variables (X1 and X2), with a (+) or (–) sign. Then
the sign coding for the third variable, X3 is found by multiplying the
signs for the first two. For instance, in the first experiment, multiply a
negative sign for X1 with a negative sign for X2 to find the sign for X3: a
positive sign (refer to Figure 11,11). For larger designs, this process
can become very laborious and fortunately software does this
process in practice.

11.2.10 Confounding
The price to be paid for performing fewer experiments is called
confounding, which means that some terms will not be able to be
distinguished from each other. This happens because of the way the
fractional designs are built; using some of the resources that would
otherwise have been devoted to the study of interactions, these are
now used to study main effects of more variables instead.
Confounding is sometimes also referred to as Aliasing.

Run X1: Sugar X2: Salt X3: Sugar × Salt

a +1 –1 –1

b –1 +1 –1

c –1 –1 +1

abc +1 +1 +1

Figure 11.11: Construction of a fractional factorial design.

To illustrate confounding, an alternative method for constructing a


fractional factorial design using the half fraction of the 24 design will
be presented. Table 11.2 provides the full 24 design and its
interaction terms.
To find the half fraction of the 24 design, the design is sorted by
the highest interaction term, in this case, the 4FI X1X2X3X4 column.
The primary fraction is the column of all (+) signs while the alternative
fraction contains all (–) signs. The primary fraction is provided in Table
11.3.
Inspection of Table 11.3 shows that column X1X2 is the same as
column X3X4, i.e. they are confounded or aliased with each other. The
following confounding pattern is observed in Table 11.4.
One thing to note about the half fraction of the 24 design is that in
the confounding pattern, main effects are confounded with 3FIs only
and 2FIs are confounded with each other. As a consequence, if a
main effect is found to be significant, then a decision has to be made
whether the main effect is important, or the interaction. In many
cases, 3FIs are usually not important and therefore a decision can be
made to justify the importance of the main effect over the interaction.
The situation is more complicated when a 2FI is found to be
significant, as they cannot be separated from each other. There are
two alternatives in this case,
1) Use Subject Matter Expertise to decide which one of the
interactions is likely to be important or
2) Use the half fraction as justification to run the alternative fraction
and determine which of the interactions is important from the full
factorial design.
The degree of confounding is also known as Design Resolution,
with the higher resolution designs being less confounded. Table 11.5
provides a description of some common design resolutions and the
terminology used to describe them.
Table 11.4 provides examples of other types of fractional factorial
designs. When the degree of fractionality is 2, for example the
design, this is the quarter fraction of the full design, i.e. five factors
analysed in eight experimental runs. This is the reason why it is a
Resolution III design since there are so few runs to fully analyse the
design space. These lower resolution designs can be useful for
isolating main effects only and ignoring any interactions, however,
their widespread use is discouraged and they should be used mainly
for robustness testing. In a low-resolution screening design, for
example, seven factors can be analysed in eight experiments using
the design. In the event that more information is required, a fold
over design can be used that performs the next fraction of the design,
i.e. changes the sixteenth fraction to an eighth fraction, which is a
resolution IV design. Thus, all main effects will be confounded with
3FIs and 2FIs will be confounded with each other.

Table 11.2: Complete 24 full factorial design.


Term I X1 X2 X3 X4 X1X2 X1X3 X1X4 X2X3 X2X4 X3X4 X1X2X3 X1X2X4 X1X3X4 X2X3X4 X1X2X3X4

1 + – – – – + + + + + + – – – – +

a + + – – – – – – + + + + + + – –

b + – + – – – + + – – + + + – + –

ab + + + – – + – – – – + – – + + +

c + – – + – + – + – + – + – + + –
ac + + – + – – + – – + – – + – + +

bc + – + + – – – + + – – – + + – +

abc + + + + – + + – + – – + – – – –

d + – – – + + + – + – – – + + + –

ad + + – – + – – + + – – + – – + +

bd + – + – + – + – – + – + – + – +

abd + + + – + + – + – + – – + – – –

cd + – – + + + – – – – + + + – – +

acd + + – + + – + + – – + – – + – –

bcd + – + + + – – – + + + – – – + –

abcd + + + + + + + + + + + + + + + +

Table 11.3: Primary fraction of the 24 full factorial design.


Term I X1 X2 X3 X4 X1X2 X1X3 X1X4 X2X3 X2X4 X3X4 X1X2X3 X1X2X4 X1X3X4 X2X3X4 X1X2X3X4

1 + – – – – + + + + + + – – – – +

ab + + + – – + – – – – + – – + + +

ac + + – + – – + – – + – – + – + +

bc + – + + – – – + + – – – + + – +

ad + + – – + – – + + – – + – – + +

bd + – + – + – + – – + – + – + – +

cd + – – + + + – – – – + + + – – +

abcd + + + + + + + + + + + + + + + +

An even more effective design for analysing many factors in few


experiments is the Plackett–Burman designs. These are all part of a
family of designs called Screening Designs. These will be discussed
in more detail in section 11.8.1.

Table 11.4: Confounding pattern of the 24 Full Factorial Design.

Main effects Interactions

X1 = X2X3X4 X1X2 = X3X4

X2 = X1X2X4 X2X3 = X1X4

X3 = X1X2X4 X1X3 = X2X4

X4 = X1X2X3 X1X2X3X4 = I
11.2.11 Types of variables encountered in DoE

There are typically two main types of variable encountered in DoE


and one intermediate variable type,
1) Continuous Variables.
2) Categorical Variables.
3) Discrete Numeric (for example, diameters of balls during a milling
operation).
Only the first two variable types are discussed in more details in
the following.
Continuous Variables: Have numerical values that are divisible on
an infinite scale. Examples of continuous variables are: temperature,
concentrations of ingredients (in g kg–1 or %…), pH, length (in mm),
age (in years) etc.
The variations of continuous factors are usually set within a pre-
defined range. For two-level factorial designs, the ranges of the
factors are set from a lower level to an upper level. At least those two
levels have to be specified when defining a continuous design
variable and these should be set based on subject matter expertise
such that they induce a significant change in a response variable,
wherever possible.
Since continuous variables can be infinitely divided, the mid-point
between the two factor levels can serve as a goodness of fit point
when applying a linear model to the data. These points are known as
Centre points and are used to diagnose non-linearities. Centre points
allow the comparison of the response values of these points with the
calculated average response over all of the experiments in the design.
If they are not equal (statistically), this may indicate that the
relationship is non-linear. Centre points are only used for diagnostics
checks in linear models and do not contribute to calculating model
terms. In the case of high curvature, a new design build may have to
be considered for describing a quadratic relationship (or higher order
polynomial), for example the Central Composite or Box–Behnken
designs described in section 11.2.16. An alternative is to reduce the
range of the continuous variable such that the fitted model is more
linear in nature.
Figure 11.12 shows how centre points can be used to detect non-
linearities when fitting a linear model to the data. If the centre point
lies on the line defined by the two levels of the design point, then the
assumption of linear modelling cannot be rejected. In the case where
the point lies above or below the line and the distance can be shown
to be different from random variations, then the model form is not
linear and an alternative fit is required.
Centre points also have another important role in DoE. Since
replicating every point in the experimental space can be expensive or
time consuming, a replicated centre point, can be added to calculate
the experimental variability. It can thus be used to check both the
reproducibility of the experiments (at least in the centre of the design)
and possible non-linearities. The assumption is that the variance is
constant over the entire experimental space, and the variance of the
centre points is likely to be close to the average variability of the
design.

Table 11.5: Design resolution and definitions.

Design resolution Definition Example

Resolution III Main Effects are confounded or


with 2FIs.

Resolution IV Main Effects are not or


confounded with 2FIs. 2FIs are
confounded with each other.

Resolution V Main Effects are not or


confounded with 2FIs. 2FIs are
not confounded with each
other, but are confounded with
3FIs.
Figure 11.12: Centre samples.

The practice of including three to four centre points in a design is


highly recommended. Not only is it good practice to add randomised
centre points to a design to understand the experimental variability,
but also, if the experiments are to be performed over a number of
days or raw material lots etc., centre points are a useful check of
bias. Experimental bias can also be assessed using the method of
Blocking and will be explained in section 11.2.15. In the case where
centre points are used in blocks of experiments, using methods like
the t-test or ANOVA (chapter 2) can help isolate whether a significant
change in conditions has occurred while performing the experiment.
Categorical Variables: In general, all non-continuous variables are
called categorical variables. Their levels can be named, but not
measured quantitatively. Examples of category variables are: colour
(Blue, Red, Green), type of texture agent (starch, xanthan gum, corn
starch), supplier (Hoechst, Dow Chemicals, Unilever).
A special case of category variables is represented by binary
variables, which have only two levels. Binary variables symbolise an
alternative. Examples of binary variables are: use of a catalyst
(Yes/No), recipe (New/Old), type of sweetener (Artificial/Natural).
Ordinal variables are categorical in nature, the main difference is they
have an order associated with them, i.e. Level 1, Level 2 and Level 3
etc.
For each categorical factor included in a design, all levels of this
factor must be specified. Since there is a kind of quantum jump from
one level to another (there is no intermediate level in-between), centre
points can only be specified if there is a least one continuous variable
present, otherwise no centre points are possible.
The following examples discuss the use of centre points in a
design when a combination of continuous and categorical variables is
present.

Case 1: 22 design with one categorical variable


In this case, one variable is continuous and one variable is
categorical. Only the continuous variable is allowed to have a centre
point, so the overall centre of the design cannot be assessed, this is
shown in Figure 11.13.
In the presence of the categorical variable, the number of
minimum centre points has doubled. Replicating the centre points
can add to the cost of experimentation in this case in order to assess
whether the variance is equivalent (statistically) between the two
levels of the categorical variable.

Case 2: 23 design with one categorical variable

When going to the 23 design with one categorical variable, it is now


possible to define an overall centre point for the two continuous
variables. This is shown in Figure 11.14.
As was the situation in case 1, the single categorical variable
results in the need for two sets of centre points, one for each level of
the categorical variable.

Case 3: 23 design with two categorical variables


When the 23 design has two categorical variables, there is now the
requirement for two sets of centre points for each of the categorical
variables. This is shown in Figure 11.15.
Every time a new categorical variable is added to a design, the
number of centre points required in the design doubles. This must be
taken into account when designing experiments in order not to use
the entire project budget just measuring these points. The concept of
categorical centre points extends to all designs and is typically
handled by software during the design process.

Figure 11.13: Centre samples for the 22 design with one categorical factor.

11.2.12 Ranges of variation for experimental factors

Setting the ranges of variation for all factors in the design is a critical
setup that will govern the success of the experimentation. In all
cases, make the range of variation large enough to induce an effect
and small enough to be realistic. If it is suspected that two of the
designed experimental runs will give extreme, opposite results,
perform those runs first (although this breaks the principles of random
sampling, it makes practical sense to do this). If the two results are
indeed different from each other, this means that enough variation
has been generated. If they are too far apart, too much variation has
been generated, and reduction of the ranges must be considered. If
they are too close, try a centre sample: this will indicate the presence
of a very strong curvature!

Figure 11.14: Centre samples for the 23 design with one categorical factor.

Figure 11.15: Centre samples for the 23 design with two categorical factors.

In later sections of this chapter, the various types of experimental


design will be presented in more detail. The following describe the
typical procedure for setting ranges for the design types.
For screening designs, covering as large a region as possible is the
main objective. Since there is no information available in the
regions between the levels of the experimental factors, curvature
cannot be assessed. Selecting the adequate levels is a trade-off
between these two aspects. However, do not choose only a range
that gives “good” quality! Bad results are useful to understand how
the system works. The range for each factor must be chosen that
spans all important variations!
For factor influence designs, the main objective is to fit a linear
model to the data in order to fully understand main effects and
important interactions. In this case, the range of variation must be
chosen such that there is no significant curvature detected in the
model. Model curvature is discussed in more detail in section
11.9.7.
For optimisation designs, these are usually built after a screening or
factor influence study has been performed. At this stage, an
experimenter usually knows in what region of the design the
optimum lies. Depending on the optimisation design chosen the
region of variation may need to be increased (when looking at
models that fit higher order polynomials) or kept the same. This is
highly situation specific.
When in doubt and wherever possible, perform a few pre-trials to
get a better understanding of how the factors behave when they are
varied. Sometimes there are combinations of low and high values for
some variables that cannot be accomplished. In this way, it is easy to
check that the chosen range is wide enough. If the responses are
about the same for the extreme experiments, these factors may have
no effect on the responses, or the range is too narrow. Pre-trials
should also be conducted and responses measured as planned so
that the possibility of altering the procedure or the test plan can be
done before wasting any resources. These initial experiments should
thus also be used to check the reproducibility and the measurement
errors.

11.2.13 Replicates

Replicates in the sense of DoE, are experiments performed several


times and should not be confused with repeated measurements,
where the samples are only prepared once but the measurements are
performed several times on each.
Each observed point in the design has a certain amount of
imprecision that will reflect on the parameters of the fitted line and
can thus lead to an imprecise model. Replicates allow the opportunity
to collect data to estimate the variability in the response and therefore
place statistical significance on the final model. This is shown in
Figure 11.16.
By replicating a design point the precision of the measured
response can be assessed. “One replicate” means that each
experiment is performed only once, while “two replicates” means that
the experiment has been performed twice (duplicated). Whether the
whole design is replicated or not depends on cost and reproducibility.
As discussed previously, replicated centre points are also useful for
the assessment of the precision of a response. If there is a lot of
uncontrolled or unexplained variability in the experiments, it may be
wise to replicate the whole design.
Overall, the aim of replicate measurements is to provide an
estimation of the experimental error (or pure error). This is useful for
the following reasons:
It gives information about the average experimental error.
It enables the comparison of response variations due to controlled
causes (i.e. due to variation in the factors) with uncontrolled
response variations. If the “explainable” variation in a response is
no larger than its random variation, this means that the changes in
this response cannot be related to the levels of the design
variables.
When no centre samples can be defined (because of category
variables) consideration must be given to replicating the entire design
in order to gain an estimate of experimental uncertainty.

11.2.14 Randomisation

Randomisation of experiments is important in order to avoid


modelling cumulative effects. For example, if the experiments in the
design were performed in standard order and the experimentation
was started in the morning. If an uncontrollable variable, such as the
outside temperature is affecting the experiments, i.e. the laboratory or
factory is heating up as the sun reaches its highest point, then this
may be confused in the model as being an effect, when in fact it is
not.
By randomising the experimental order, external effects, such as
the temperature increase in the laboratory can be “confused” among
the experimental runs and therefore, this minimises the risk of the
external influence becoming a significant effect. In the case where
uncontrollable effects are confused by randomisation, this will also
decrease the precision of the responses as it has to be partitioned
into the noise component of the model.

Figure 11.16: Using replicates to estimate model uncertainty.

In the theoretical sense, randomisation is performed in a similar


way to drawing “balls out of a hat”. However, in some industrial
situations, setting the process parameters in random combinations
may not be practical, for example, if the equipment being studied is a
blast furnace, changing the temperature from a low to a high value
may also correspond to long waiting times for the furnace to
equilibrate. If the randomisation requires fluctuating the furnace
temperature, this may not be economical in itself. In these cases, the
method of Incomplete Randomisation is used whereby, taking the
example above, the temperature of the furnace is set to its low level
and all other factors are randomised at low temperature. Then the
furnace temperature is raised to its high level and all other factors are
randomised around the high value of temperature.
It should also be noted that blocking is an example of incomplete
randomisation. Blocking is covered in more detail in the next section.

11.2.15 Blocking in designed experiments

In many cases, a complete design cannot be performed in one day,


or using one lot of raw material and therefore has to be split over
these situations. The question then arises, “how do days or material
influence the model?” To be able to quantify if any such effect will
influence the model, blocking is used.
In order to partition a design into rational blocks, a similar process
is used as for generating a fractional factorial design. Taking the 23
full factorial design as an example, the first step is to set the blocking
variable as the highest order interaction. In this case, it will be the 3FI.
Table 11.5 shows the full 23 design with the (+) signs in the term
X1X2X3 blocked (i.e. the block is highlighted in the table).
The shaded rows in Table 11.5 show the experimental runs that
will be performed in block 1. The un-shaded rows are to be
performed in block 2. The column X1X2X3 now becomes the block
effect. Based on the principles of statistical analysis, blocking violates
the laws of randomisation and as such, cannot be tested for
significance. In order to assess if blocking has an influence on the
design, the magnitude of the Mean Square value (section 11.9.1)
must be compared to the effects found to be significant. If the
blocking variable is found to be of the same order of magnitude as
noise, it can be considered insignificant and in this case, the
combined blocks may be analysed as one design. However, if the
blocking effect is of the same order of magnitude as significant model
effects, the combined design cannot be analysed as the results have
been shown to be biased.
When the number of experiments becomes larger, the need for
more blocks may be required. Design-Expert® is programmed to
provide the blocked design with the least amount of confounding in
the design with the block effect. Again, the magnitude of the block
effects must be compared to the magnitude of significant model
effects in order to assess whether there is any bias caused by
blocking.

11.2.16 Types of experimental design

Screening designs
At the start of a new project, there is usually a large number of
potentially important variables to consider. At this stage, the aim of
any experimentation is to find out which are the most important
variables. For a first screening, the most important rule is: do not
leave out a variable that might have an influence on the responses,
unless it is known beforehand that it cannot be controlled in practice.
It would be costlier to have to include one more variable at a later
stage, than to include one more into the first screening design.
For a more extensive screening, variables that are known not to
interact with other variables can be left out. If those variables have a
negligible linear effect, these variables can be set to a constant value
(like for instance: the least expensive). If those variables have a
significant linear effect, then they should be fixed to the most suitable
level to get the desired effect on the response.

Table 11.5: 23 design table showing blocking pattern.

Term I X1 X2 X3 X1X2 X1X3 X2X3 X1X2X3

1 + – – – + + + –
a + + – – – – + +

b + – + – – + – +

ab + + + – + – – –

c + – – + + – – +

ac + + – + – + – –

bc + – + + – – + –

abc + + + + + + + +

The main purpose of the screening design is to isolate main


effects only. This means that designs that put much emphasis on
interaction terms are not desired. This leaves design types such as
resolution III fractional factorial designs, some resolution IV designs
and the Plackett–Burman designs. These are discussed in more detail
in the following sections.

Low-resolution fractional factorial designs


These designs typically have a high degree of confounding; however,
this makes them ideal for use in screening designs. For example, if 10
factors are to be assessed for importance, a full factorial design of 210
will result in 1028 experiments being performed. Using Pascal’s
Triangle, the design would be partitioned as follows,

1 × Intercept Term
10 × Main Effects
45 × 2FIs
120 × 3FIs
210 × 4FIs
252 × 5FIs
210 × 6FIs
120 × 7FIS
45 × 8FIs
10 × 9FIs
1 × 10FIs

This model would use 968 experimental runs in order to calculate


potentially insignificant effects. Using the design, 10 factors can
be screened in 16 experiments (or in one sixty-fourth of the effort).
This would then provide information regarding main effects, however,
the biggest assumption made with low resolution screening designs
is that the main effect may not be significant, but rather a higher order
interaction. This again is the price to be paid for performing fewer
experiments.
An example of the use of a resolution IV design for screening is for
the analysis of 15 factors in 32 experimental runs using the
design in which all main effects are confounded with 3FIs or greater
and all 2FIs are confounded with each other.

Design redundancy
One of the biggest advantages of DoE over other approaches to
experimentation is known as Design Redundancy. If an experimenter
takes a risk by using a low resolution fractional factorial design, the
more factors that can be eliminated as insignificant, the more
powerful the design becomes. To illustrate design redundancy, the 23
design of main effects only will be used. This is shown in Table 11.6.
Looking at column X3, the first 4 rows are (–) signs and the last
four rows are (+) signs. Now, looking at column X2, the first 4 rows
consist of 2 (–) signs and 2 (+) signs all carried out at X3 at its (–) level.
The same pattern of (–) and (+) signs is repeated in column X2 for X3 at
its (+) level, i.e. X2 is replicated for the levels of X3 present. But what if
X3 is insignificant? If a factor is found to be insignificant, it can be
removed from the design, thus the design table now looks like the
one shown in Table 11.7.
When X3 (or for that matter any factor) can be removed from a
design, the resulting table is replicated and can be analysed without
the insignificant factor(s), thus providing more confidence in the
precision of the measurements. Taking the example further, if any of
X1 or X2 is also found to be insignificant, then the resulting design
table is replicated twice.
Relating this back to a low-resolution screening design, the same
principles as above hold when insignificant factors can be removed
from the design. For example, if a MSA = SSA / νA design is selected,
this assesses seven factors in eight experiments. Now if three factors
are found to be insignificant, then the resulting design becomes four
factors in eight experiments, or the design is now the half fraction
design, where main effects can be assessed and 2FIs are
confounded with each other. In this case, should there be a need to
resolve any 2FI confounding patterns, the design can be extended by
performing the alternate fraction of the half fraction design.

Table 11.6: 23 design table for main effects only.

Term X1 X2 X3

1 – – –

a + – –

b – + –

ab + + –

c – – +

ac + – +

bc – + +

abc + + +

Table 11.7: Replicated 22 design table after X3 is removed.

Term X1 X2

1 – –

a + –
b – +

ab + +

c – –

ac + –

bc – +

abc + +

Plackett–Burman designs

Plackett–Burman designs are a special class of low resolution design


built on a cyclic pattern of (–) and (+) signs. They are based on a
mathematical theory which makes it possible to study the main
effects of a large number (n – 1) of factors in n experiments, where n
is a multiple of four: 8, 12, 16, ... Plackett–Burman designs are used
when,
a) There is the need to study the effects of a very large number of
variables (up to 47), with as few experiments as possible.
b) All design variables have two levels (either Low and High limits of a
continuous range, or two categories, e.g. Starch A / Starch B).
As with factorial designs, each design variable is combined with
the others in a balanced way. Unlike Fractional Factorial Designs,
however, they have complex confounding patterns where each main
effect can be confounded with an irregular combination of several
interactions. Thus, there may be some doubt about the interpretation
of significant effects. Therefore, it is recommended not to base final
conclusions on a Plackett–Burman design only, but follow it up with a
more precise investigation with a Fractional Factorial Design of
Resolution IV or higher.
A general word of caution is iterated about using Plackett–Burman
designs is to use them only when there are no other alternatives
available. Always aim to use a Fractional Factorial Design, ideally
Resolution IV, in preference to a Plackett–Burman Design.
Factor influence designs

After a large number of factors have been screened down to a


manageable situation, a Factor Influence Study is performed using
either High Resolution Fractional Factorial Designs or Full Factorial
Designs. The purpose of a factor influence study is to fully understand
the main effects and interactions of the variables found to be
important.
For example, if the original screening design was able to reduce
10 potentially significant factors down to 5 important factors, there
are a number of designs available for better understanding these
factors. These designs are,
1) The 25 Full Factorial Design in 32 experimental runs.
2) The 25 – 1 Resolution V half fraction in 16 experimental runs.
3) The 25 – 2 Resolution III quarter fraction in 8 experimental runs.
The Resolution III design will only be useful if at least one more
factor can be shown to be insignificant and therefore is a risky option,
but one that will pay off if there are more insignificant factors to be
eliminated. The full factorial design uses many terms to analyse
potentially insignificant factors. Therefore, the best choice in this
situation would be the Resolution V design where all main effects and
2FIs can be analysed free of major confounding. Again, with the 25 – 1
design, if at least one factor is found to be insignificant, then the
design becomes the 24 full factorial design.
Factor influence studies are the precursor to optimisation designs.
They are typically used for generating a predictive model that can be
used to map the design space to a high level of confidence. They can
be used to extrapolate outside of the original design space using the
method of Steepest Ascent. Using the model generated, values are
plugged into the model equation of the variables in the direction of
maximum (or minimum) response, depending on the objective of the
design and for each predicted point, a new experiment is run. If the
difference between the reference and the predicted values are close,
then select another point and assess the difference. When the
reference and predicted values differ, this indicates that a new local
maxima (or minima) has been reached and now a new factor
influence or an optimisation design can be performed to better
understand the stability of the new design centre.
The principle of Steepest Ascent is shown in Figure 11.17.

Optimisation designs
Once the majority of important variables have been isolated and
analysed, the next step in the DoE process is usually to find the
“sweet spot” of the process/product. The purpose of an optimisation
design is to investigate the remaining variables at more than two
levels, so that more complex models, which account for curvature,
can reveal the nature and stability of the optimal point.
Since the purpose of such designs is to study what happens
anywhere within a given range of variation, optimisation designs can
only investigate factors which vary over a continuous range. As a
consequence, if any categorical variables have been found to be
significant, in the optimisation design, the best level of the variable
will be chosen at the optimisation stage.
Optimisation designs consist of a set of experiments which study
at least three levels of the factors in a balanced way. The two most
common approaches to optimisation designs are,
Figure 11.17: The principle of steepest ascent.

Central Composite Designs (CCD) which use five levels of each


factor.
Box–Behnken (BB) designs which use three levels of each factor.
The reason optimisation designs contain three to five levels is
because the intent is to fit models such as quadratic, cubic or quartic
polynomials to generate a Response Surface. The model contains the
following elements:
A linear part which consists of the main effects of the design
variables.
An interaction part which consists mainly of the two-variable
interactions.
A higher order polynomial part which describes the curvature of the
response surface.
A Response Surface is a map of the variables that can reveal
where local and global maxima (or minima) response values lie. They
are the key diagnostic plot of optimisation designs and can also be
used for optimising multiple responses simultaneously.
Central composite designs

By their definition, Central Composite Designs (CCD) are a composite


of a regular factorial design augmented with what are known as Star
Points that extend the design to five levels. Since the design can fit
quadratic through to quartic models, the centre points of the model
now take part in the model equation. This type of design extends the
factorial region beyond the original boundaries; therefore, pre-
planning must be done before a CCD is to be constructed. In
particular, if mass is a factor in the original factorial design and a
lower boundary of zero is placed on mass in the original factorial
design, then it will be impossible to extend this design to a CCD since
mass can never be negative.
A CCD is useful in the following situations,
a) For the optimisation of one or several responses with respect to
two to six design factors.
b) To extend the experiments performed using an existing full factorial
or high resolution fractional factorial design.
As mentioned previously, the design consists of two sets of
experiments:
The Cube and Centre samples from a Full Factorial or High-
Resolution Factorial Design.
And Star samples which provide the additional levels necessary to
compute a higher order polynomial model.
The star samples combine the centre level of all variables but one,
with an extreme level of the last variable. The star levels (Low star and
High star) are respectively lower than Low cube and higher than High
cube. Usually, these star levels are such that the star samples have
the same distance from the centre as the cube samples. In other
words, all experiments in a CCD are located on the surface of a
circle/sphere around the centre points.
This property is called rotatability. It ensures that each experiment
contributes equally to the total information. As a consequence, the
model will have the same precision in all directions from the centre.
Figure 11.18 shows how the CCD is built.
The number of experiments in a CCD is fixed according to the
number of factors in the factorial design part. The number of centre
samples can be tuned between an “economical” value and an
“optimal” value. Table 11.8 provides the number of experimental runs
for some common CCD’s.
As mentioned above, there may be situations where, either due to
poor planning or by design, the factorial design cannot be extended
to a CCD due to a constraint regarding the possible values of the
factors. There is an optimisation design type, where star levels will be
the same as the cube levels. The star samples are located on the
centre of the faces of the cube. This is known as a Face Centred
Central Composite Design (FCC) and an example is shown in Figure
11.19.
There is a disadvantage, however, to changing the distance
between star samples and the model centre. Since the star samples
are no longer on the surface of a circle/sphere defined by the cube
samples, the design is no longer rotatable and as a result, the
precision of the face centred points is different from the cube points.
The two design types (CCD and FCC) offer a unique advantage to
the experimenter in the fact that they can be sequentially built. The
factorial design part is usually built first as part of a screening or
factor influence study. To extend the designs to a CCD (or FCC) the
star points can be added as a new block to the design. Blocking was
previously discussed in section 11.2.15.
Figure 11.18: Geometrical representation of a central composite design with three variables.

When considering extending a factorial design into an


optimisation design, it is important to add at least three centre points
into the original design. The method for building the optimisation is as
follows,
The first block contains all Cube samples and half of the Centre
samples.
The second block contains all Star samples and the other half of
the Centre samples.
This is shown in Figure 11.20.

Table 11.8: Number of runs in a central composite design.

Design Total
variables Cube Star Centre runs

2 4 4 3–5 11–13

3 8 6 3–6 17–20
4 16 8 3–7 27–31

5 32 10 3–10 45–52

6 64 12 3–15 79–91

Figure 11.19: Central composite design where low star = low cube, high star = high cube.

Box–Behnken designs

The Box–Behnken (BB) designs are an economical optimisation


method that is used primarily where extreme situations cannot be
physically measured in a full factorial design. The BB designs have
the “corners cut off” to allow extreme situations to be avoided. These
designs are built on three levels of the factors and therefore the
highest order polynomial fit is a quadratic model.
A BB Design is useful in the following situations
a) For the optimisation of one or several responses with respect to 3
to 21 factors.
b) For remaining inside the “cube” and still have a rotatable design.
c) For building an economical optimisation design up front, knowing
that only quadratic effects are the maximum order fit possible.
Figure 11.21 shows that the corners of the cube are not included
in the design. All experiments actually lie on the centres of the edges
of the cube. As a consequence, all experiments lie on a sphere
around the centre and this is why the design is rotatable.
The number of experiments in a BB design is fixed according to
the number of factors. The number of centre samples can be tuned
between an “economical” value and an “optimal” value. Table 11.9
provides the number of experiments to run for some common BB
Designs.

Figure 11.20: Blocking with a central composite design.

11.2.17 Which optimisation design to choose in practice

The following provide some useful rules when deciding which


optimisation design to use.
In many cases, the data from a previous factorial design can be
extended on by using the Central Composite Design. In this case,
blocking is required.
In order to avoid extreme situations (because they are likely to be
difficult to handle, or because the optimum is known not to reside
at the extremes of a cube design) the Box–Behnken design is
preferable.

Table 11.9: Number of runs in a Box–Behnken design.

Design Total
variables “Cube” Centre Runs

3 12 5 17

4 24 5 29

5 40 3–6 43–46

6 48 3–6 51–54

Figure 11.21: Geometrical representation of a Box–Behnken design with three design


variables.

In the case where the factors are limited to reside within a cube but
a rotatable design is desired, the Box–Behnken design is the only
one with that combination of properties.
The Box–Behnken design is more economical than the Central
Composite Design in terms of experiments that have to be
performed.
In the case where a factorial design was first constructed, but
provision was not previously made to extend to an optimisation
design (due to a lower boundary on one or more factors) then a
Face Centred Cubic (FCC) design may be the only choice available.

Analysis of factorial designs


Analysis of variance (ANOVA)
Analysis of Variance (ANOVA) is the main diagnostic table generated
by nearly all designed experiments. In the case of DoE, the ANOVA
model is the regression model [in particular the Multiple Linear
Regression (MLR) model] of equation 11.3.

where SSTotal = Total Sum of Squares, the total variability that can be
explained in the data set; SSReg = Regression Sum of Squares, the
proportion of the total variation described by fitting a model to the
data; and SSError = Residual Sum of Squares, the variability remaining
after fitting a regression model.
In general, the total variability must be partitioned over what can
be explained in the data (SSReg) and what cannot be explained by the
model (SSError). Significant models have a much higher value of SSReg
compared to SSError.
Since a regression model is made up of the individual regressors
(i.e. the effects of the experimental factors), SSReg is further broken
down into terms for each effect in the ANOVA table. To test the
significance of a particular effect, the response’s variance accounted
for by that effect must be compared to the residual variance which
summarises experimental error. If the “structured” variance (due to
the effect) is no larger than the “random” variance (error), then the
effect can be considered negligible. Otherwise it is regarded as
significant.
In practice, the generation and analysis of an ANOVA table is
achieved through a series of successive computations.
First, several sources of variation are defined. For instance, if the
purpose of the ANOVA model is to study the main effects of all
factors, each factor is a source of variation. Experimental error is
also a source of variation.
Each source of variation has a limited number of independent ways
to cause variation in the data. This number is called the degrees of
freedom (ν).
Response variation associated to a specific source is measured by
a Sum of Squares (SS).
Response variance associated to the same source is then
computed by dividing the sum of squares by the number of
degrees of freedom. This ratio is called Mean Square (MS).
Once mean squares have been determined for all sources of
variation, F-ratios associated to every tested effect are computed
as the ratio of MSeffect to MSerror. These ratios, which compare
structured variance to residual variance, have a statistical
distribution which is used for significance testing. The higher the
ratio, the more important the effect.
Under the null hypothesis (chapter 2) that an effect’s true value is
zero, the F-ratio has a Fisher distribution. This makes it possible to
estimate the probability of getting such a high F-ratio under the null
hypothesis. This probability is called the p-value; the smaller the p-
value, the more likely it is that the observed effect is not due to
chance. Usually, an effect is declared significant if the p-value <
0.05 (significant at the 5% level). Other classical thresholds are
0.01 and 0.005.
The ANOVA results are traditionally presented in the format shown
in Table 11.10.
The calculations associated with the ANOVA table are discussed
in later sections of this chapter, however, before these calculations
can be described, a discussion on important effects is provided in the
next section.

11.2.18 Important effects

Before the ANOVA table can be constructed, a prior analysis of


effects is typically performed. There are three common tools used to
assess the importance of effects,
The Pareto Chart.
The Normal Probability Plot of Effects.
The Half Normal Probability Plot of Effects.

The Pareto chart

The Pareto Chart shows the t-values of each effect as a bar plot.
Figure 11.22 provides an example Pareto Chart for the Sports Drink
data provided in Table 11.1. In this chart, the t-value limit for the
effects is shown as the lower limit and the Bonferroni Limit (a family-
wise adjustment to the t-values for taking into account the
dependency of the t-test for multiple variables) is plotted as the upper
limit.
Effects that lie between the two limits should be tested for
significance when selected for addition to the model. In fact, effects
that exceed the Bonferroni limit are almost certainly significant. In the
case of the Sports Drink example, the response is sweetness and
from the Pareto Chart, Sugar is the most contributing factor to
sweetness. Even to a non-expert, this outcome makes sense.
Furthermore, all three main effects and the Salt × Sugar
interaction contribute to sweetness. All other interaction terms do not
contribute to the model. This means that only the ingredients Salt and
Sugar antagonistically combine to enhance flavour (since the AB
interaction term is negative).

Table 11.10: Schematic representation of the ANOVA table.


Source of DoF (ν) SS MS F-Ratio p-Value
variation

A νa = # levels SSA Calculated


–1 from
standard
B νb = # levels SSB tables
–1

C νc = # levels SSC
–1

Residual νError = N – na SSError


– nb – nc – 1 =SSTotal –
SSA – SSB
– SSC

Note: In the Pareto Chart (and the Normal Probability/Half Normal


Probability plots discussed below, Design-Expert® defines an orange
colour bar when a factor positively contributes to the response and a
blue bar when the factor contributes negatively to the response).
It is suggested to use the Pareto Chart as a confirmatory step only
after assessing the Normal Probability and Half Normal Probability
plots, which are discussed as follows.

Normal probability plot of effects

When the ordered values of a sample versus the expected ordered


values from the true population are normally distributed, these will lie
approximately on a straight line (chapter 2). Usually only a few effects
turn out to be important. They show up as outliers on the normal
probability plot, i.e. they do not lie on the straight line formed by
insignificant effects.
For two-level factorial designs, this plot can be used to choose
significant effects. An example Normal Probability Plot for the Sports
Drink data are shown in Figure 11.23.
From the Normal Probability Plot, the effects chosen in the Pareto
Chart lie off the straight line, with Sugar the most important
contributor.
Figure 11.22: Pareto chart of effects for sports drink example.

Half normal probability plot of effects


This is the most commonly used tool in DoE for selecting significant
effects. In this plot, effects are plotted on an absolute scale and the
straight-line fit should be adjusted to encompass only those lesser
effects (in the bottom left hand part of the plot). Any effects found to
lie to the right and below the straight line should be selected for
testing in the model. Figure 11.24 provides an example Half Normal
Probability Plot for the Sports Drink example.
To further show how the Half Normal Probability Plot works, a
theoretical Normal Distribution curve is plotted with the data in Figure
11.24. It should be visualised in the following way. The insignificant
effects lie within the 95% confidence intervals of the normal
distribution, i.e. their effects cannot be distinguished from zero. The
selected effects lie outside of the 95% confidence interval of the
normal distribution centred around zero, therefore, these effects are
significantly different from zero and their p-values < 0.05.
With Normal Probability and Half-Normal Probability Plots, a
Shapiro–Wilk test of normality is performed on the insignificant terms
to assess whether they come from a normal distribution of samples.

11.2.19 Hierarchy of effects

In the above example for the Sports Drink analysis, all main effects
and one interaction term were found to be significant, but what if a
significant interaction term was found which contains an insignificant
main effect?
If a significant interaction term contains an insignificant main
effect, the main effect must also be included in the model so as to
calculate the significant interaction term. This is called Hierarchy of
Effects and programs such as Design-Expert® will always include
insignificant main effects into the model when they take part in a
selected significant interaction.

11.2.20 Model significance


When all of the significant terms in the model have been selected
using one of the effects plots previously described, the next step is to
fit a regression model to the data based on the terms selected. For 2-
level factorial designs, there are two main model types that can be
fitted,
Figure 11.23: Normal probability of effects for sports drink example.
Figure 11.24: Half normal probability plot of effects for sports drink example.

Linear Model: Contains only Main Effects.


Linear Model with Interactions: takes into account significant
interactions.
These model types will be described with reference to the 23 full
factorial design.

Linear model: 23 full factorial design

y = b0 + b1X1 + b2X2 + b3X3 + ɛ

where b0 is the intercept (or mean response) of the model; bi are the
regression coefficients to be estimated by least squares for each
factor in the model; Xi are the main terms selected in the model; and ɛ
is the residual term (i.e. what the model cannot explain).

Linear model with interaction terms: 23 full factorial design

y = b0 + b1X1 + b2X2 + b3X3 + b12X1X2 + b13X1X3 + b23X2X3 + b123X1X2X3 +


ɛ

where all terms are the same as for the linear model, except for, bij
are the regression coefficients to be estimated by least squares for
each interaction term and XiXj are the interaction terms selected in the
model.
Note: In the interaction terms, i ≠ j (in the case of the 3FI, i ≠ j ≠ k)
The following sections describe the general calculations involved
in generating the ANOVA table.

11.2.21 Total sum of squares (SStotal)

SStotal is the total variation around the mean response that can be
explained by a model. The larger SStotal is, the more likely a significant
model will result. If SStotal is small, this may be an indication that,
1) The range of the factors used was not large enough to induce a
detectable response or
2) The factors used contain no information related to the response.
SStotal is calculated using equation 11.4,

This value represents the 100% unexplained information in the


model.
11.2.22 Sum of squares regression (SSReg)

SSReg is the variation described by the fitted model and accounts for a
specific proportion of SStotal. The nature of the fitted model can
change, based on the number of terms added to the equation. In this
case, each term in the model accounts for a specific proportion of
SSReg. The form of the model is determined by the design and the fit
of the model to the data. In particular,
1) The form of a screening design is linear as only the most significant
main effects are sought.
2) The form of a factor influence design is either purely linear or linear
with interaction terms.
3) The form of an optimisation design is anything from linear to quartic
in nature.
SSReg is calculated using equation 11.5,

In equation 11.5, btXty is the model fitted to the data, therefore if


the variation described by the model compared to the mean result
[(Sni = 1yi)2 / n] is large, then the fitted model describes the data.
However, if the fitted model only describes as much as the mean of
the data, the difference is close to zero, therefore the model does not
adequately describe the data.
If model terms are added (or removed from the model), SSReg will
also change accordingly (as the terms in bt change in equation 11.5).
This is why SSReg is highly dependent on the terms used to fit the
model. Adding terms that do not contribute to the model may
decrease SSReg and conversely, not adding enough terms may also
decrease SSReg. This is why SSReg is further broken down into
individual sum of squares for each term in the model. This is
performed according to equation 11.6,

This process is performed for all terms used in the fitted model
and the following balance defined in equation 11.7 should exist if all
calculations are performed correctly,

More details on SSReg will be presented by example in a later


section of this chapter.

11.2.23 Residual sum of squares (SSError)

The variance in the data set that cannot be described by SSReg is


known as the Residual. In general, the residual is the difference
between SStotal and SSReg. This value is known as the residual sum of
squares SSError. The size of SSReg is compared to SSError and the ratio
of the two values defines the signal-to-noise ratio of the model
compared to random noise.
SSError is calculated according to equation 11.8,

11.2.24 Model degrees of freedom (ν)


Degrees of Freedom (ν) are defined as the number of independent
ways a data set can vary. For a two-level factorial design (2k) each
term in the model can vary between a high level (+) and a low level (–).
If the average response for a model factor is known and all but one of
the levels is unknown, then the missing level can be estimated. This
means that for each term (factor) in the model, there is associated
with it one degree of freedom.
Taking as an example the 23 Full Factorial design, there are eight
experimental runs in total. Again, if the average response and all but
one response is known, then the missing response can be calculated.
This means that there are N – 1 independent ways the data can vary
and in this case, ν = 8 – 1 = 7 degrees of freedom for the un-
replicated 23 design. The full model is repeated below for
convenience,

y = b0 + b1X1 + b2X2 + b3X3 + b12X1X2 + b13X1X3 + b23X2X3 + b123X1X2X3

Excluding the intercept term, there are seven model terms and
each take up one degree of freedom. This results in a saturated
design, i.e. there are no degrees of freedom left to estimate the model
error term. So, how can more degrees of freedom be generated?
There are three options available to the experimenter,
1) Replicate the entire design: For small designs (i.e. k ≤ 4) this may
be possible, however, for large designs, or where there is cost
sensitivity, this option may not be a viable solution. For the 23
design, a full replicate will result in 16 experiments, with 15 total
degrees of freedom and 15 – 7 = 8 degrees of freedom to estimate
the error contribution.
2) Add Centre Points: As described earlier in section 11.2.11, centre
points do not contribute to calculating model terms in factorial
designs, but serve the dual purpose of allowing the detection of
model curvature and providing degrees of freedom for calculating
error terms in saturated designs. In the case of the 23 design,
adding three centre points provides 3 – 1 = 2 degrees of freedom
for estimating model error.
3) Removing insignificant model terms: When a model term is found
to be insignificant, it does not contribute to estimating SSReg. In this
case, by removing the term from the model, the model becomes
simpler and the excluded term has to be partitioned into SSError.
This is the simplest way to add degrees of freedom to a saturated
model as it does not require additional experimentation.
The only drawback to point 3 above occurs when every term
contributes to the model. In this case one of the approaches of points
1 and 2 can be performed to estimate model error and therefore
model significance. This is the topic of the next section.

Using F-test and p-values to determine significant effects

The F-test
Fishers F-test was described in detail in chapter 2 for determining
whether one source of variance was significantly different from
another. In the case of DoE, the F-test is set up to determine whether
the variance described by the fitted model is significantly different
from the residual (i.e. the portion of the variance not described by the
model). This situation is defined by equation 11.9,

Freg is compared with the F-distribution with three parameters;


nReg, νError and significance level (typically 5%). The critical F-value
(Fcrit) is found in a statistical table, or is calculated by Design-Expert®.
If Freg > Fcrit, then the model is regarded to be significant with
respect to random noise. The terms in equation 7, SSReg / nReg and
SSError / νError are known as the Mean Squares of Regression and
Residual, respectively. Mean squares convert sums of squares into
variances so that they can be compared statistically to each other
using parametric tests such as the F-test.
Since each model term has one degree of freedom associated
with it, the mean square of a model term is its sum of squares. In a
similar approach described above, each model term mean square
can be compared to SSError to determine the significance of the terms
used in the model for describing the response. When a model term is
found to be insignificant (in the statistical sense), it can be removed
from the fitted model and partitioned into the degrees of freedom for
error.

p-value
A complementary measure to the F-values is the p-value. The p-value
is the probability that the model or a particular model term = 0. For
example, pTerm A = 0.05 means that the effect of term A is significant
with a probability of 95%. p-values are best described with reference
to a histogram (see Figure 11.25).
If the p-value obtained is less than the significance level stated,
then the effect is considered to be significantly different from zero on
a statistical basis. Unfortunately, there are a few pitfalls that are
generally not known by many practitioners, these include,
The p-values are totally irrelevant if the number of degrees of
freedom for the estimate of the error is small, i.e. there were too
few replicates to estimate the error.
All p-values become low, i.e. all effects appear to be significant, if
the error is very small. This may occur if the reference method is
very accurate.

11.2.25 Example: building the ANOVA table for a 23 full


factorial design

The un-replicated 23 Full Factorial design has seven model terms and
an intercept term. The model terms use all of the degrees of freedom
available (if all terms are used in the model). The sports drink example
(refer to Table 11.1) will be used to demonstrate the construction of
an ANOVA table.
In Figure 11.24, the half normal plot of effects showed that the
Salt, Sugar and Lemon main effects are important and the AB
interaction term was also important, the remaining terms were
statistically insignificant. This means that the final model is the
interaction model consisting of four terms and the other model terms
provide degrees of freedom for calculating SSError. The form of the
model is provided as follows,

Sweetness = b0 + b1 × Salt + b2 × Sugar + b3 × Lemon + b12 × Salt ×


Sugar

Figure 11.25: p-values as related to the significance of an effect.

It is now a matter of applying Least Squares fitting to solve for the


parameters b0, b1, b2, b3 and b12. The total degrees of freedom is
equal to seven and the model terms account for four degrees of
freedom, this means that three degrees of freedom are available to
calculate the residual sum of squares.
Using equation 11.2, the total sum of squares is calculated to be,

SSTotal = 23.16

This value represents the total corrected sum of squares that can
be explained around the mean of the response value (corrected infers
that the data are mean centred). To calculate SSReg, equation 11.3 is
used and based on the normal probability plot, only four of the
available seven model terms are used to calculate this value,
SSReg = 22.68

When compared to SSTotal, the value of SSReg accounts for a high


proportion of the total variability that can be explained. This is a first
indication that the fitted model is adequate the describing the
response variable. SSReg can be further broken down into the
contributions made by each term to the model. This is performed
using a modification of equation 11.5, (i.e. equation 11.6) where only
the b-coefficients of the effect are used to obtain the individual sums
of squares. Table 11.11 provides the sums of squares for each model
term.
Since each term only has one degree of freedom associated with
it, the sums of squares in Table 11.11 are also the mean squares for
each effect. The magnitude of each effect can be compared to each
other in order to determine the effects with the most influence on the
model. In this case, terms B and C (Sugar and Lemon) are the most
important to describing sweetness. Salt is the next important term
and the least important is the Salt–Sugar interaction. The sum of each
effect in the model should add to SSReg, i.e.

SSReg = 3.65 + 8.41 + 8.41 + 2.21 = 22.68

The Residual Sum of Squares (SSError) is calculated as the


difference between SSTotal and SSReg, i.e.

Table 11.11: Model term sums of squares for the sports drink
example.

Model term Sum of squares

A 3.65

B 8.41

C 8.41

AB 2.21
SSError = 23.16 – 22.68 = 0.48

When compared to SSTotal, the residual sum of squares is small,


therefore it can be concluded that the model describes the majority of
the variance in the data as evidenced by SSReg.
Now that the sums of squares have been calculated, the next step
is to calculate the Mean Squares for each term in the ANOVA table.
The degrees of freedom associated with each term is listed as
follows,
Total: 7 (8 experiments – 1).
Model: 4 (one for each term in the model).
Residual: 3 (Total ν – Model ν)
The respective Mean Squares for the model terms are listed in
Table 11.12.
Since Mean Squares can be considered as variances, the model
and its corresponding terms can be tested for statistical significance
using an F-Test. The F-Test compares the models and its terms,
means squares to the mean square of the residual. If this ratio is
significantly different, then the null hypothesis that variances are
equal is rejected in favour of the alternative hypothesis that the
variances are different, therefore the model describes a significant
amount of the total variance with respect to noise.
The corresponding p-values are calculated for each F-Ratio using
software such as Design-Expert® and are typically displayed
alongside the F-statistics in the ANOVA table. Table 11.13 shows the
complete ANOVA table for the sports drink example.
Any p-value < 0.05 is considered to be statistically significant at
the 95% confidence level.

11.2.26 Supplementary statistics

The ANOVA table is the central point of DoE as it defines how


information and noise are partitioned within the data. To help in the
interpretation of the model, there are a number of supplementary
statistics and graphical tools available to further evaluate and
interpret the model. A full discussion of all of the statistics available
would require an entire text on its own and only the most commonly
used statistics will be discussed here. The interested reader is
referred to the excellent textbooks by Montgomery [2] or Anderson
and Whitcomb [3] for more information of the statistics available for
the analysis of a designed experiment.
The output of a software package usually provides the following
statistics along with the ANOVA table,

Table 11.12: Mean squares for the sports drink example.

Term Sum of squares Degrees of freedom Mean square

Model 22.68 4 5.67

A 3.65 1 3.65

B 8.41 1 8.41

C 8.41 1 8.41

AB 2.21 1 2.21

Residual 0.48 3 0.16

Table 11.13: ANOVA table of the sports drink example.

Source of variation Sum of Degrees of Mean F-Ratio p-value


squares freedom squares

Model 22.67 4 5.67 34.69 0.0076

A 3.65 1 3.65 22.32 0.0180

B 8.41 1 8.41 51.46 0.0056

C 8.41 1 8.41 51.46 0.0056

AB 2.21 1 2.21 13.50 0.0349

Error 0.48 3 0.16


Total 23.16 7

Standard Deviation: This is a measure of the spread in the


response data and like the sums of squares, defines the amount of
information available to model.
Mean: This is the arithmetic mean of the response data.
% Coefficient of Variation: This is the ratio of the Standard
Deviation and the Mean of the response expressed as a
percentage of total variation around the mean value.
Predicted Residual Sum of Squares PRESS: This is a measure of
how well a particular model fits each point in the design. The
coefficients for the model are calculated using Leave One Out
cross validation (refer to chapter 8) and the new model’s prediction
is subtracted from the “deleted” observation to find the predicted
residual. PRESS is calculated using equation 11.10, (also refer to
chapter 7 on multivariate regression)

where ei is the residual for sample i and hii is the leverage of sample
i.
R-Squared: This is the proportion of the model sum of squares
(SSReg) divided by the total sum of squares (SSTotal). The general
form of the expression is provided in equation 11.11.

Adjusted R-Squared: This is a measure of the amount of variation


around the mean explained by the model, adjusted for the number
of terms in the model. The adjusted R-Squared decreases as the
number of terms in the model increases if those additional terms
don’t add further information to the model. The general form of the
expression is provided in equation 11.12.
Predicted R-Squared: This is a measure of the amount of variation
in new data explained by the model. Predicted R-Squared is
calculated using equation 11.13.

As a general rule, R2Adj and R2Pred should be within 0.2 of each other.
If this is not the case, then an analyst is advised to check the data
for outliers, or revise the form/fit of the model.
Adequate Precision: This is a signal-to-noise ratio that compares
the range of the predicted values at the design points to the
average prediction error. As a general rule, ratios greater than 4
indicate adequate model discrimination.

where

p = number of model parameters [including intercept (b0) and any


block coefficients], σ2 = residual mean square from the ANOVA table
and n = number of runs in the experiment.
The supplementary statistics provide estimates of goodness of fit
of the selected model to the data and should be used in conjunction
with the ANOVA table such that a decision to use or reject the model
is made.
To this point, the ANOVA table and the statistics associated with
models have not contained curvature or blocking effects. When these
effects are present, the form of the ANOVA table and supplementary
statistics change slightly.

Blocking and the ANOVA table


The concept of Blocking was introduced in section 11.2.15 and its
contribution to the ANOVA table will now be discussed. In the
statistical literature, blocking breaks the rules of Random Sampling,
which is one of the fundamental assumptions of a least squares
model fitting. This is because when a design is blocked, samples are
chosen systematically from the design (based on higher order
interactions) and assigned to the number of blocks chosen. The best
an analyst can do in this case is to randomise the runs within the
block.
Therefore, in the ANOVA table, if a blocking effect is calculated, its
sum of squares (and consequently the Mean Squares) cannot be
tested for statistical significance. Only the size of the Sum of Squares
can be subjectively compared to other effects in the model to
determine whether the magnitude of the blocking effect contributes
as much as any significant effect. If the effect of blocking is found to
be large, then the assumption made is that the experiment could not
be identically performed over the course of the blocks. In this case,
adding the blocks together and analysing the model may lead to bias
in the results and therefore a misleading diagnosis.
To demonstrate the way a blocking effect is calculated, an
example using a 24 full factorial design in two blocks will be used. It
was found in the design of a certain chemical process that a
particular raw material (labelled A) was not available in sufficient
quantity in one batch to perform all experiments. However, if the
experiment was to be performed in two equal sets of runs, then two
lots of material A were available to perform the experimental design.
In this case, the design will be blocked based on raw material lot and
the effect of different lots will be assessed by the block effect.
Some background information about the chemical process is
provided as follows. There are four potentially influencing factors
associated with product yield for this process. These were
determined mainly from past experience with the process, however,
like most industries, process shutdowns and incomplete
understanding of business-critical processes are starting to show up
in a bad light on the bottom line of the company’s financial
statements. Rather than do what the majority of businesses would
do, i.e. bring too many senior managers into the meeting room to talk
the problem to death and then decide to cut a cost somewhere, this
company has decided to let its technical organisation run with the
DoE methodology to improve the process, without the need of
process modification or blindly reducing operational costs.
The process engineer knows that it will require 16 experimental
runs (24) + centre points to perform a full factorial design, however,
the half fraction design will only require 8 experimental runs. The risk
here is that if all four factors are significant, more experiments will
have to be performed. With the mandate of the company to fix the
problem, the engineer has the opportunity to complete the full
factorial experiment.

Experimental approach—Define stage


After performing a brainstorming session with the personnel with the
most intimate knowledge of the process, the four controllable factors
of Temperature, Pump Pressure, pH and Stirring were defined as
those that will have the most impact on product yield. After a
discussion with the control system engineers and production
operators, the ranges of testing of the four factors are provided in
Table 11.14.
In the case of the 24 full factorial design, the highest order
interaction term is the 4FI term ABCD. This term will be aliased with
the block, with the (-) values of this term being partitioned into block 1
and the (+) terms partitioned into block 2. The partitioning of the
experiment and the responses for yield of product are provided in
Table 11.15.
Table 11.15 shows the blocking variable ABCD and its relationship
with the column labelled Block, i.e. when ABCD = –1, then these
experiments are performed in block 1 and vice versa. The half normal
plot of effects for this design is shown in Figure 11.26 where it was
found that the effect of Pressure and its interactions are all
insignificant. This means that all terms containing pressure can now
be partitioned into degrees of freedom for error.
The first thing to note about Figure 11.26 is that the block term
(ABCD) does not have a high sum of squares value. This is the first
step in deciding that the block effect is not influencing the analysis of
the design.
The corresponding ANOVA table for this data are provided in
Table 11.16.
The size of the Block Sum of Squares and the corresponding
Mean Square is 6. When compared to the least significant term (pH =
322), the effect of the block is small, and the conclusion is that the
effect of different lots of raw material has no influence on the yield,
therefore different lots can be used per batch of reaction.
SSTotal is now the total of all sums of squares, i.e.

SSTotal = SSReg + SSError + SSBlock

Whereas before, SSBlock would have been partitioned into SSError, it


is now separated, resulting in a reduction of SSError.

Centre points, curvature and the ANOVA table


The concept of Centre Points and Curvature was introduced in
section 11.2.11 and their contribution to the ANOVA table will now be
discussed. As previously stated, centre points do not contribute to
the design, since they would add an extra level and make it non-
orthogonal. They are used for two main purposes,
1) As a source of degrees of freedom for calculating error and
2) As a Lack of Fit (LoF) test of the linear model.
When centre points are available, an ANOVA table can be
constructed with a Curvature term included for reference only, i.e. the
original model can only fit a linear model and since centre points do
not contribute to the model calculation, any form of curvature (non-
linear) terms will only provide an indicative estimate of lack of fit. The
Sum of Squares due to Curvature is calculated using equation 11.15.

Table 11.14: Ranges of factor variation for the chemical process


investigation.

Factor High level Low level Reasoning

Temperature 100 80 Below 80°C the reaction may start to slow


(°C) down to the point where nothing occurs, while
above 100°C, impurities are likely to form.

Pressure 400 300 Below 300 psi, recirculation issues could occur
(psi) and above 400 psi, unnecessary wear and tear
on the pump may occur.

pH 8 6 Above and below these pH ranges may lead to


unwanted impurities.

Stirring rate 250 150 Below 150 rpm, it is believed that not enough
(rpm) mixing energy is available to maintain the
reaction and above 250 puts unnecessary load
on the mixing impeller.

Table 11.15: 24 Full factorial design in two blocks.

Run Temp (C) Pressure pH Stirring ABCD Block Yield (%)


(psi) (rpm)

a 1 –1 –1 –1 –1 1 69.5

b –1 1 –1 –1 –1 1 48.6

c –1 –1 1 –1 –1 1 66.8

abc 1 1 1 –1 –1 1 64.1

d –1 –1 –1 1 –1 1 44.1

abd 1 1 –1 1 –1 1 99.5
acd 1 –1 1 1 –1 1 83.2

bcd –1 1 1 1 –1 1 68.6

1 –1 –1 –1 –1 1 2 45.9

ab 1 1 –1 –1 1 2 64.1

ac 1 –1 1 –1 1 2 59.5

bc –1 1 1 –1 1 2 77.7

ad 1 –1 –1 1 1 2 95.9

bd –1 1 –1 1 1 2 45.9

cd –1 –1 1 1 1 2 73.2

abcd 1 1 1 1 1 2 92.3

where nF is the number of Factorial Points in the design, nC is the


number of Centre Points, F is the mean response for all Factorial
Points and C is the mean response for all Centre Points.
Equation 11.15 aims to detect if there is a significant difference
between the average of the factorial points and the centre points by
estimating a quadratic term. This is different from interaction terms in
the model that introduce “twisting” of the model surface out of plane.
The calculation of curvature uses a single degree of freedom and
therefore a Mean Square for Curvature can be calculated.
For the Chemical Process data in Table 11.15, six centre points
were added to the design, three were assigned to each of the two
blocks. This is done to ensure that a comparison between the two
blocks can be made using real, replicated data. The values of the
centre points are provided in Table 11.17.
A first inspection of the data in Table 11.17 shows that the centre
points within the blocks have similar variance. The mean response for
block 1 is 68.2 with a standard deviation of 2.95 and the mean
response for block 2 is 67.5 with a standard deviation of 2.35. It is
now shown that the mean of the responses lies within one standard
deviation of each other and it can be shown using a t-test (chapter 2)
that these centre point replicates are not significantly different over
blocks.

Figure 11.26: Half normal probability plot of effects, chemical reaction data.

Table 11.16: ANOVA table of the chemical process data with one
blocking variable.

Source of variation Sum of Degrees of Mean F-ratio p-value


squares freedom squares

Block 6.25 1 6

Model 4575 5 915 53.13 <0.0001


Temp (A) 1546 1 1546 89.76 <0.0001

pH (C) 322 1 322 18.72 0.0019

Stirring (D) 707 1 707 41.05 0.0001

AC 1086 1 1086 63.05 <0.0001

AD 913 1 913 53.05 <0.0001

Error 155 9 17

Total 4736 15

The SSCurvature term is calculated (Equation 11.15) as,

This is also the value of MSCurvature since only one degree of


freedom is used to calculate curvature. For this particular analysis,
the Total Sum of Squares is the summation of the following,

SSTotal = SSReg + SSBlock + SSCurvature + SSError

SSError is calculated by difference to be 187.8. The next step is to


determine the number of degrees of freedom for residual. This
changes from the original design since centre points have been
added and these points also contribute to SSTotal as they provide an
estimate of pure error in the analysis. The following is a systematic
workflow for calculating degrees of freedom for the chemical process
data.

Step 1: Total degrees of freedom.


Factorial Points = 16
Centre Points = 6
Total Degrees of Freedom = 16 + 6 – 1 = 21

Step 2: Model degrees of freedom.


From Table 10.16, the terms used in the model are, Temperature, pH,
Stirring and the two interaction terms of Temperature pH and
Temperature Stirring. This results in a total of five degrees of freedom
for regression.

Table 11.17: Centre points collected on the chemical process data


with one blocking variable.

Block Centre point replicate (yield %)

1 71.2

68.1

65.3

2 65.2

67.3

69.9

Step 3: Block degrees of freedom.


There are two blocks used in the model and these use the four-factor
interaction term ABCD to assess the block affect. Each term in the
model uses one degree of freedom. This results in a total of one
degree of freedom for blocks.

Step 4: Curvature degrees of freedom.


There is a single degree of freedom used to assess curvature.

Step 5: Calculate total non-residual degrees of freedom.

νNon-residual = νReg + νBlock + νCurvature = 5 + 1 + 1 = 7

Note: If any of the Block or Curvature terms are not required in the
calculation, they can be removed as necessary. In the case of no
blocks and curvature, the standard partitioning of sum of squares
between model and residuals holds.
Step 6: Residual degrees of freedom.
The residual degrees of freedom can now be calculated by difference,

νError = νTotal – νNon-residual = 21 – 7 = 14

This results in 14 degrees of freedom to estimate the uncertainty


of the fitted model.
The ANOVA table, including curvature can now be constructed
and is provided in Table 11.18.
In Table 11.18, the p-value for curvature was calculated to be
0.6314. This indicates that curvature is not significant and there is no
reason to suggest that the linear model is not an appropriate fit to the
data. Since the curvature term cannot be formally used when fitting a
purely linear model, the assessment of curvature is a first lack of fit
test. If the curvature was found to be significant, then the analysis
would stop at this point and either the construction of a quadratic
model would be considered, or the ranges of the factor values would
be reconsidered in order to produce a linear response model.
In this example, since curvature is insignificant, the formal ANOVA
table for this model is provided in Table 11.19, with curvature
removed.
In Table 11.19, when the curvature term is removed, its
contribution cannot be repartitioned into the model. In this case, it
inflates the error term and adds an extra degree of freedom to the
residual.
11.2.27 Pure error and lack of fit assessment

When a design is either replicated, or a number of centre points have


been added to a model, the opportunity exists to calculate the
replicate error in the design. In many practical situations, replicating
an entire design may not be possible due to budget or experimental
condition constraints and this is another reason why centre points are
a popular design choice. The main assumption regarding centre
points is that the variance at the centre of the design is the same as
the extremes of the design. If that assumption holds, then by
performing a number of replicated centre points, a term for Pure Error
(PE) can be calculated.
Note: Pure Error can only be calculated for points that have been
replicated in the design.
To this point, when an ANOVA table has been presented, the
residual has been presented as a single error term. When replicates
are present, the residual can be further partitioned as follows,

SSError = SSPE + SSLoF

where SSPE is the Sum of Squares due to PE and is calculated using


equation 11.16.

It is important to note for equation 11.16, this only holds for centre
point calculations of pure error for non-blocked designs. For Pure
Error calculations involving replicated or blocked designs, the
interested reader is referred to the text by Myers and Montgomery [4].
SSLoF is the Sum of Squares due to the model not fitting its
intended form, i.e. if a linear model is fitted to quadratic data, then
this would result in an inflation in the error due to Lack of Fit (LoF).
The Sum of Squares due to Lack of Fit is calculated using equation
11.17.

Table 11.18: ANOVA table of the chemical process data with one
blocking variable and curvature.

Source of variation Sum of Degrees of Mean F-ratio p-value


squares freedom squares

Block 2.77 1 2.77

Model 4575.0 5 915.0 68.2 <0.0001

Temp (A) 1546.0 1 1546.0 115.2 <0.0001

pH (C) 322.0 1 322.0 24.0 0.0002

Stirring (D) 707.0 1 707.0 52.7 <0.0001

AC 1086.0 1 1086.0 81.0 <0.0001

AD 914.0 1 914.0 68.1 <0.0001

Curvature 3.23 1 3.23 0.24 0.6314

Error 188.0 14 13.41

Total 4769.0 21

Table 11.19: ANOVA table of the chemical process with curvature


removed.

Source of variation Sum of Degrees of Mean F-ratio p-value


squares freedom squares

Block 2.77 1 2.77

Model 4575.0 5 915.0 68.2 <0.0001

Temp (A) 1546.0 1 1546.0 115.2 <0.0001


pH (C) 322.0 1 322.0 24.0 0.0002

Stirring (D) 707.0 1 707.0 52.7 <0.0001

AC 1086.0 1 1086.0 81.0 <0.0001

AD 914.0 1 914.0 68.1 <0.0001

Error 191.0 15 12.74

Total 4769.0 21

This is the weighted Sum of Squares between the mean response


i at each level of the design (m) and the predicted values i using the
fitted model.
The concepts of PE and Lack of Fit are shown in Figure 11.27.
From Figure 11.27, the main goal is to determine whether the pure
error portion is too large to detect a real lack of fit in the model. If the
PE term is too large, it will cloud any real deviations of the predictions
from the model and this will result in an inflation of the R2 and related
terms. If PE is small with respect to LoF, then objective assessment
of model fit can be made.
In order to perform a formal test of significance for error that
determines the highest contribution to its source, the degrees of
freedom for PE and LoF must be calculated. For PE resulting from
centre points in an unblocked design, the degrees of freedom are just
the number of points minus 1, i.e.

νPE = nC – 1

Since the total degrees of freedom for residual can be calculated


by difference of SSTotal and SSReg, so too can the degrees of freedom
for SSLoF be calculated by difference,

νLoF = νError – νPE

The formal statistical test for LoF is defined in equation 11.17a.


where MSLoF and MSPE are the respective mean squares for LoF and
PE. It can be seen from equation 11.15 that the LoF test compares
the deviation from the model versus the within deviation of replicate
measurements.
The calculation of LoF and PE will be demonstrated using the
chemical process data, this time with no blocking variable and
incorporating the six centre points into a single block. The ANOVA
table for this design (this time excluding the block effect calculated
previously) is presented in Table 11.20.
From Table 11.20, the p-value for LoF is 0.1557, which is
insignificant. The conclusion is that there is no reason to suggest that
the linear model is not an appropriate fit to the data and that the five
terms in the model significant contribute to predicting reaction yield.
The Mean Square of PE is calculated as follows,

Figure 11.27: Distinction between pure error and lack of fit.


MSPE = 6.0

The Mean Square of LoF is calculated as,

SSLoF = SSError - SSPE = 194.0 – 29.0 = 164.0

The formal test of significance is calculated as follows,

This section described how to construct an ANOVA table for the


analysis of a linear model, based on a full factorial design, under a
number of different circumstances. To describe every situation is
outside of the scope of this introductory text, however, with a bit of
experience, the fundamentals described in this chapter can easily be
adapted to more complex situations. Of course, software is used to
perform these calculations on a routine basis, however, a
fundamental understanding of how the calculations are performed is
a necessary.
The next sections discuss the graphical tools and diagnostics
used to confirm the models fit to data after the ANOVA table has
been assessed.

Table 11.20: ANOVA table of the chemical process showing LoF and
PE.

Source of variation Sum of Degrees of Mean F-ratio p-value


squares freedom squares

Model 4575.0 5 915.0 75.5 <0.0001

Temp (A) 1546.0 1 1546.0 127.6 <0.0001

pH (C) 322.0 1 322.0 26.6 <0.0001

Stirring (D) 707.0 1 707.0 58.4 <0.0001

AC 1086.0 1 1086.0 89.7 <0.0001

AD 914.0 1 914.0 75.4 <0.0001

Error 194.0 16 12.0

LoF 164.0 11 15.0 2.55 0.1557

PE 29.0 5 6.0

Total 4769.0 21

11.2.28 Graphical tools used for assessing designed


experiments

To date, the ANOVA table has been described as the main point of
the analysis of a designed experiment. There are a number of highly
important graphical tools also available to supplement the numerical
outputs of the ANOVA table. For the purpose of this text, the
following diagnostics tools will be described,
1) The Normal Probability plot of Residuals;
2) The Residual vs Predicted plot;
3) The Residual vs Run plot; and
4) The Predicted vs Actual plot.
To supplement the above diagnostics tools, there are a number of
model interpretation tools available. The following will be discussed in
more details in this section,
1) One-Factor and Interaction plots;
2) Contour and Response Surface plots; and
3) Cube plots.
Model diagnostics plots

Since the analysis of designed experiment data is based on least


squares model fitting, there are a number of statistical assumptions
that must hold when fitting data, none as important as the
assumption of Normality of Residuals. In fact, the residuals should be
normally distributed around zero with constant variance along the
entire regression line. If this assumption can be proven, then it is said
that the least squares fit of the data is appropriate for the form of the
model chosen.

Normal probability plot of residuals


To assess the normality of residuals, there are a number of statistical
tests available (Kolmogorov–Smirnov or Shapiro–Wilk etc., see
chapter 2). While these tests provide a numerical value for the test of
normality, the human brain is more attuned to graphical outputs. The
simplest graphical tool available for assessing the normality of
residuals is the Normal Probability plot. In section 11.9.2. the Normal
Probability plot was introduced as a means of selecting experimental
factors that significantly differed from zero effect. In the case of
model diagnostics, deviations from zero indicate lack of fit of the data
to the model. The Normal Probability plot was also described in
chapter 2 and it is a plot of the Expected Probability of the residual
(on a log scale) vs either the residual, or some transformation of the
residuals (known as standardisation or Studentisation).
When the actual residuals are plotted against their expected
values, if a straight line is observed, then the assumption of normality
cannot be rejected. The Normal Probability plot of residuals for the
Chemical Process data is shown in Figure 11.28.
When it can be shown that the residuals are normally (or near
normally) distributed, other diagnostics tools can then be assessed to
further understand and interpret the model. In the case where there is
an apparent non-normality, the following diagnostics tools are useful
for identifying the cause.

Residuals vs predicted plot


The assumption of normality should hold over the entire regression
line. In order to test that this assumption is valid, the Residuals vs
Predicted plot is used to detect deviations. When the variance is
constant over the entire line, this is known as homoscedasticity.
Since the optimal value of a residual is zero (i.e. perfect fit of the
model to the data), a good model fit should display small (yet
constant) residuals for all fitted data. When the residuals start to “fan”
out at the extremes of the line, this may indicate a violation of the
principles of least squares fitting. This is known as heteroscedasticity
and when this occurs, it is usually an indication of the measurement
equipment’s lack of precision over the range of values assessed by
the model. In this case, it is the job of the scientist or engineer to
either replace the current measurement system with a more suitable
one, or use multiple measurement systems that are optimised for
specific regions.
The Residuals vs Predicted plot for the chemical process data are
provided in Figure 11.29.
In Figure 11.29, there is no suggestion that the residuals have
non-constant variance and Figure 11.30 confirms this by plotting the
Studentised Residuals vs Predicted values. In this plot, limits are
placed at ± 3 standard deviations around the zero residual. If any
residual were to exceed this limit, there would be evidence enough to
suggest that there is either a violation of the least squares fit of the
data, or that at least one outlier is present in the data set.
Figure 11.28: Normal probability plot of residuals, chemical reaction data.

Residuals vs run plot


Another important assumption of least squares fitting is the
assumption of random sampling. This is why designed experiments
are typically performed in a random (or near random) order. In section
11.2.14 some reasons for performing experimental runs in a random
order were given and these are repeated here for clarity. If the
experimental runs in a design were carried out in a purely systematic
order, effects such as operator fatigue, material carryover etc. may be
confused with real effects. This is also true regarding temperature
changes in a laboratory or a factory, for example, if a designed
experiment is run in standard order during the morning of a hot day,
then as the day progresses to noon, the outside temperature will
usually rise to its maximum value. If the measuring equipment is
temperature sensitive, then this rise in temperature may be
confounded with a main model effect, thus a false interpretation may
occur.
Even when the experimental runs are performed in random order,
it is still very important to assess whether a systematic effect has
found its way into the data. The most effective tool available to the
analyst is the Residuals vs Run plot for detecting systematic trends.
This plots the residuals versus the actual run order of the experiment
and it should display a random distribution of points around the zero
line. If any trending up, down or even cyclic is apparent, then the
analyst must be careful when interpreting the results and also must
put in a careful effort to isolate and understand the cause of the
trending.
Figure 11.31 shows the Residual vs Run plot for the chemical
process data. For this plot, the residuals have been Studentised.
The plot of Residuals vs Run shows that there was no systematic
trending in the data for the chemical process and that the points are
randomly distributed around zero.

Predicted vs actual plot


The final diagnostics plot to review after it can be assumed that the
principles of least squares have not been violated is to look at the
Predicted vs Actual plot. Ideally this plot should have a slope close to
1 for a well fitted model and the R2 value should also be close to 1. In
section 11.2.26, R2 was calculated based on the ratio of SSReg and
SSTotal. If SSReg describes most of SSTotal this means that SSError (i.e. the
residuals) should be small. When this is the case, the points will fit
close to the regression line and the model can be considered as
being reliable for future predictions of new values.
Figure 11.29: Residuals vs predicted plot, chemical reaction data.
Figure 11.30: Studentised residuals vs predicted plot, chemical reaction data.
Figure 11.31: Studentised residuals vs run plot, chemical reaction data.

The Predicted vs Actual plot is shown in Figure 11.32 for the


chemical process data. The fit of the data seems to be adequate for
the model selected.
Overall, model diagnostics plots are used to assess whether the
fitted model does not violate the principles of least squares fitting.
This is important since the intent of the model development is to use
the results for making business critical or research based decisions.
This also provides better assurance that when interpretations are
made, they are made based on sound data and correctly fitted
models. The next section describes the plots used for interpreting the
main effects and interactions found to be significant in the ANOVA
table.
11.2.29 Model interpretation plots

Once normality (or near normality) of the model has been established,
the next step is to interpret the model, understand how it works and
finally put it into practice. Model interpretation plots form the basis of
what most people have come to expect from DoE. In particular, the
Response Surface plot is typically the flagship plot used to show how
DoE works. The following subsections discuss the various
interpretation plots and their usage.

One-factor and interaction plots

After the significant main effects and interactions have been isolated
and the final model is built, the degree of change in various factors
may be investigated when moving from the low level to the high level
of that factor. The One-Factor plots in Design-Expert® allow an
analyst to investigate the influence each factor has on the response
variable, but most importantly, it allows the visualisation of how one
factor changes when other factors have been changed.
Figure 11.32: Predicted vs actual plot, chemical reaction data.

Using the Chemical Process data, the final model used factors
Temperature, pH and Stirring and the two-factor interactions of each
factor with Temperature. This means that if Temperature is to be
studied by itself, since it is influenced also by pH and Stirring, its
impact on Yield should also change as these other factors are
changed. Three scenarios are provided in Figure 11.33.
Since there are so many permutations only the effect of pH on
temperature is shown in Figure 11.33. In Figure 11.33a) the effect of
temperature is shown when factors pH and Stirring are at their centre
values. It shows that when all other factors are constant, the effect of
changing Temperature from a low to a high value has a positive
influence on yield.
Figure 11.33b) shows how temperature effects yield when pH is
moved to its low level (keeping stirring constant). This change in pH
has a detrimental effect on yield when temperature is at its low level,
but has little effect on yield when temperature is at its high level.
Figure 11.33c) shows how temperature effects yield when pH has
moved to its high level (keeping stirring constant). In this case, slightly
improved yields at low temperature are achieved compared to pH at
its centre point and when both temperature and pH are high, this
slightly reduces the yield.
One-Factor plots are typically useful when main effects are not
involved in significant interactions. Factors involving significant
interactions are best interpreted using the Interaction Plot.
The same information as shown in Figure 33 (a–c) can be
summarised in the Interaction Plot of Figure 11.34.
Figure 11.35 shows the two-factor interaction plot for the
Temperature Stirring interaction.
The interpretation of Figure 11.35 is as follows, when stirring is set
to its low value, its interaction with temperature is also low and there
is no real change in yield, however, when temperature and stirring are
at their high level, this has a marked synergistic effect where yield is
increased far greater than the two factors acting independently.
When the number of significant interactions becomes large, the
use of One-Factor and Interaction plots can become difficult and
sometimes confusing. What is required is a map that shows an
analyst the path to the optimal point. This is usually achieved with
Contour Plots and Response Surfaces.

Figure 11.33: One-factor plots for factor temperature, chemical reaction data.
Contour plots and response surfaces

These plots are the heart and soul of DoE. They provide a map, much
the same way as a mountaineer would use a map to navigate through
certain terrains, the Contour plots (also known as Response Surfaces)
are used to provide a “top down” view of the experimental space.
The size and shapes of the contours are dictated solely by the form of
the model used. In particular,
1) When there are no interactions between factors, the response
surface is a flat plane with straight contours.
2) When there are significant interaction terms, the plane becomes
twisted and the response surface starts to introduce curved
contours into the system.
3) When higher order models are used in Optimisation Experiments,
the response surfaces can become highly curved (since the models
are now based on quadratic, cubic or even quartic polynomials).
The response surface is a plot that shows how the response
variable changes over the experimental space studied for two
variables only. However, software packages such as Design-Expert®
allow the visualisation of response surfaces for the two variables
plotted and show how these change when other variables that are not
plotted are changed.
Figure 11.34: Interaction plot for factors temperature and pH, chemical reaction data.

Returning to the chemical process data, based on the two-factor


interaction plots, it was found that the best yields are obtained when
Temperature is set to its high value, pH to is low value and Stirring to
its high value. The response surface plot is shown in Figure 11.36 for
this data, plotting Temperature and pH together, while maintaining
stirring at its high level.
Figure 11.35: Interaction plot for factors temperature and stirring, chemical reaction data.

From Figure 11.36, there is a stable region where >90% yield can
be obtained when Temperature and Stirring are kept high and pH is
kept below its mid-point. Since response surfaces are mainly used for
Optimisation designs, much emphasis will be placed on their
interpretation. When used in a Factor Influence Study, these plots
show the best direction to take for locating the optimal point.

Cube plots

Cube Plots are a unique plot for the 2k designs as a consequence of


the orthogonality of the design. They are built based on the number of
factors studied, i.e. a one-factor model would result in a straight line,
two factors in a square, three factors a cube etc.
They are particularly useful for studying up to four factors
simultaneously, but beyond that, they can become tedious very
quickly. The Cube Plot for the chemical process data are shown in
Figure 11.37.

Figure 11.36: Response surface plot, chemical reaction data.

In Figure 11.37, the arrow shows the direction of maximum


response for the chemical process data. This plot also confirms the
results found using the other model interpretation plots (i.e. High
Temperature, Low pH and High Stirring) and can serve as a quick
summary of the entire model.
There are numerous other statistics and model diagnostics
available for an analyst to assess the quality of a model, however, for
the purposes of an introductory text, these are left to the interested
reader to follow up after the tools in this section have been mastered.

11.2.30 The chemical process as a fractional factorial design

Design stage
Following on from the chemical process design of Table 11.15, the
process engineer, being a clever person, has decided use DoE to
build the design and perform the analysis. After opening the Design-
Expert® package a number of design options were presented.
However, this process engineer has performed all of the necessary
homework and decides on using a design. The design table is
presented in Table 11.21.
Figure 11.37: Cube plot, chemical reaction data.

The Half-Normal plot of Effects is provided in Figure 11.38.


Comparing this figure with Figure 11.26 for the full factorial design, it
can be seen that the same main effects and interactions can be found
using the fractional factorial design.
There are three main effects that are significant and two
interaction terms, these are Temperature, pH, Stirring, Temperature ×
pH and Temperature × Stirring. The engineer now recalls the risk of
using the resolution IV fractional factorial design, i.e. all 2FIs are
confounded. Using sound logic, since the main effect of Pressure is
insignificant and all interaction terms contain important main effects,
it is highly likely that the Temperature × pH interaction is the
significant term when compared to the Pressure × Stirring interaction
and the Temperature × Stirring interaction is most likely the significant
interaction when compared to the Pressure × pH interaction. In this
case, the process engineer has been successful with the risk taken.
The factor Pressure and its interactions now form degrees of freedom
for the residual sum of squares and the design will now have the
required power to analyse the results.
Frame 11.1 shows the ANOVA table generated by Design-Expert®
for this analysis.
Since the curvature term in insignificant, the ANOVA table without
curvature can now be interpreted. This is provided in Frame 11.2.
From the above ANOVA table, the process engineer concludes
that a significant model has been developed with an insignificant lack
of fit term. The supplementary statistics for this analysis are provided
in Frame 11.3.

Table 11.21: fractional factorial design table for chemical process


data.

Run Temp (C) Pressure pH Stirring Yield (%)


(psi) (rpm)

a 1 –1 –1 –1 69.5

b –1 1 –1 –1 48.6

c –1 –1 1 –1 66.8

abc 1 1 1 –1 64.1

d –1 –1 –1 1 44.1

abd 1 1 –1 1 99.5

acd 1 –1 1 1 83.2

bcd –1 1 1 1 68.6
Figure 11.38: Half-normal plot of effects for the design for chemical process data.

The R-Squared, Adj R-Squared and the Pred-R-squared are all


similar and within 0.2 of each other. This is an indication of good
model fit. This is also supported by an Adequate Precision of 30.843,
which is much larger than 4. Figure 11.39 provides the Normal Plot of
Residuals for this analysis.
There is no reason to suggest that there are any outliers in the
dataset, so the next step is to assess the Residuals vs. Predicted and
the Residuals vs Run plots. These are shown in Figures 11.40a) and
b), respectively.
The Residuals vs Predicted plot shows that no outliers are present
in the data and although there looks like a small downward trend in
the Residuals vs Run plot, there is still a random distribution of the
points around the mean and no residual exceeds the approximate
three-standard deviation limits.
The Predicted vs Actual plot (Figure 11.41) shows that the fit of
the data is well described by the linear model, therefore there is no
reason to suggest that there is lack of fit and based on the R-
Squared, Adjusted R-Squared and Predicted R-Squared values, the
model could be used as a predictor of future responses.

Frame 11.1: ANOVA table for reduced chemical process model with measure of curvature.
Frame 11.2: ANOVA table for reduced chemical process model with measure of curvature
removed.

Frame 11.3: Supplementary statistics for reduced chemical process model.

Now that a valid model has been established, the process


engineer’s job is to interpret the results, hopefully optimise the
process and implement the changes into the manufacturing plant. To
do this, a Response Surface was used. By setting Pressure to its mid
value, and setting Stirring to its minimum position, the Response
Surface of Figure 11.42 was generated.
The response surface shows that when Stirring Rate is at its
lowest value, the surface is very stable and a maximum yield of 70%
is achievable. However, based on the centre point (i.e. the normal
operating conditions), this maximum is only what can be achieved
now (the mean of the centre points is 68.2%). Setting stirring to its
maximum position, the response surface of Figure 11.43 was
obtained.
When Stirring Rate is maximised, there is a stable region at high
temperature and low pH where yields well above 90% are possible.

Figure 11.39: Normal plot of residuals for chemical process data.

Improve stage
The process engineer has been successful in finding a set of
conditions that will maximise the yield of a chemical process. The risk
of performing a fraction of the full design has paid off and a predictive
model has been established. The regression coefficients obtained
from the model can be used as a predictive model and the output
from Design-Expert® are shown in Frame 11.4.
The response surface for this model indicates that 95% yield is
possible when the following conditions are set for the process.
Temperature = 100°C
pH = 6
Stirring = 250 rpm
Pressure = 300 psi
By keeping the insignificant factor pressure at its low level, the
company can lower its energy costs and also reduce the scheduled
maintenance of the pump due to its lower workload. The process
engineer would now set these conditions up, run a few batches with
the improved settings to assess the batch to batch variability and
when satisfied with the results, would then monitor the future batch
yields using Statistical Process Control (SPC) tools and maybe
implement some Multivariate Statistical Process Control (MSPC) tools
to ensure the process set points for the input variables are always
maintained to produce maximum yield.

Figure 11.40: a) Residuals vs predicted and b) residuals vs run plots for chemical
process data.

Figure 11.41: Predicted vs actual plot for chemical process data.


Figure 11.42: Response surface of chemical process data with stirring set to its minimum
value.
Figure 11.43: Response surface of chemical process data with stirring set to its maximum
value.

Analysis of optimisation designs

In section 11.2.16 the constructions of three common types of design


for optimisation were discussed. These are,
1) The Central Composite Design (CCD): useful for extending full and
high resolution fractional designs.
2) The Face Centred Cube Design (FCC): useful when a lower
boundary on a factor cannot be extended beyond that already
specified for the design.
3) Box–Behnken (BB) designs: useful when extreme conditions
cannot be measured using a full or fractional factorial design and
an optimisation is required up front.
The main difference in the analysis of optimisation designs
compared to factorial designs is the order of polynomial fitted.
Indeed, in some cases, a linear model is the result of an optimisation,
but in a lot of cases, a quadratic or a cubic model is the result.
The general form of the quadratic model for a CCD is provided in
equation 11.18,

Frame 11.4

There are three parts to the above equation, the linear part, the
quadratic part and the interaction part. After the data for an
optimisation design have been collected, it is the task of the analyst
to use the information from the ANOVA table to determine which
model form (if any) best fits the data in a similar way to the factorial
designs discussed previously in this chapter.
The analysis of an optimisation design will be presented by
example using data from a bread baking process. There is something
special about the smell and taste of freshly baked bread and if the
loaf is plump and soft, it also has visual appeal. A baker wishes to
optimise a process to simultaneously get the maximum volume of loaf
possible, but at the same time, retain a certain level of firmness. After
thinking about this problem, both objectives are diametrically
opposed to each other, i.e. if the volume of the loaf is too large, the
bread will contain a lot of air and therefore be softer, while a loaf of
low volume will be denser and possibly much harder. The steps taken
by the baker to optimise the loaf characteristics are described in the
following sections.

Optimisation—define stage
With subject matter knowledge, the baker was able to define which
factors would most likely influence loaf volume and firmness. These
are listed in Table 11.22 along with a short reasoning for their
selection.
The factors in Table 11.22 are all continuous and an overall centre
point is available for the design. Since it is possible to make many
loaves of bread in a single day, it was decided to run the experiment
in one block, using a CCD with the following objective for the two
response variables defined in Table 11.23.
Firmness is measured using a sensory scale, so it is prone to
more error compared to volume. With these objectives defined, an
appropriate design can be constructed.

Optimisation—design stage
Using Design-Expert®, a Response Surface model was selected and
the Central Composite Design was used. The three experimental
factors were setup as defined in Table 11.22 using three centre
points. This results in a total of 17 experimental runs. Two responses
were added to the design table.
The design table in standard order with all of the responses
entered after performing the experimental runs is provided in Figure
11.44.
The structure of the design table in standard order is described as
follows,
1) Experimental runs 1 to 8 define the 23 full factorial section of the
design.
2) Experimental runs 9–14 define the star points added to the factorial
design.
3) Experimental runs 15–17 define the centre points added to the
design.

Optimisation—analyse stage
The first step in the analysis stage is to determine the form of the
model to use, software such as Design-Expert® provides a summary
table for each response being analysed. The results of this summary
for the response Volume is provided in Figure 11.45. Since there are
two responses being analysed, the second response, Firmness will
be summarised after Volume has been analysed.
The software suggests a quadratic model is the best fit, based on
the p-value for the model being significant using the Whitcomb
Score, the lack of fit being just insignificant and this model has the
highest Adjusted and Predicted R-Squared values.

Table 11.22: Ranges of factor variation for bread baking investigation.

Factor High level Low level Reasoning

Fermentation time 95 25 This is the amount


(min) of time the yeast has
to work through the
dough in order to
break down the
starches into simpler
sugars and liberate
carbon dioxide for
volume purposes.

Dough mixing speed 130 70 The speed of the


(rpm) mixer dictates how
the yeast and
moisture is
distributed through
the dough.

Work input 14.5 6.5 Influences the


elasticity of the
dough and it ability
to rise during the
resting stage.

Table 11.23: Experimental objectives of the bread baking


investigation.

Experimental objectives Target

Volume Maximum (300 cL is desired)

Firmness Average (between 2.2 and 2.7)

The full ANOVA table for the analysis of Volume is provided in


Figure 11.46.
From the ANOVA table, the model is significant. Even though the
main effect of Work Input is insignificant, it is involved in significant
interactions and its square term is significant. There is one
insignificant interaction, AB (Fermentation Time × Mixing Speed).
Removing this term made the lack of fit significant, so it was left in
the design. The Residuals and in particular, the lack of fit of the model
is insignificant. This means that the diagnostic plots can now be
investigated.
The Normal Probability Plot of Residuals is shown in Figure 11.47.
There is no reason to suggest that the residuals are not normally
distributed.
The Residuals vs Predicted and the Residuals vs Run plots [Figure
11.48a) and b)] also show that there is no deviation from normality.
Figure 11.44: Central composite design (CCD) generated by Design-Expert for the bread
baking process.

The Predicted vs Actual plot for the response Volume is shown in


Figure 11.49.
The supplementary statistics for the analysis of Volume are
provided in Frame 11.5.
The R-Squared value of 0.9080 shows a good fit of the points to
the line. The Adjusted R-Squared value is lower and is indicative of
the terms used to fit the model. The Predicted R-Squared value,
however, indicates that this model may not be reliable as a predictor
of future results and that the model may only be used as an indicative
model of the response Volume, however, the Adequate Precision
value is > 4, so the baker takes the risk that the model will be useful
for its purpose.

Figure 11.45: Model fit statistics for the response volume in the bread baking process.

Figure 11.46: Full ANOVA table for bread baking process fitting a quadratic model.

The Response Surface is provided in Figure 11.50 for the


response Volume.
To generate this surface, the factor Work Input was set at its low
level. It can be seen from the Response Surface that the objective of
baking bread with a volume > 300 cL is possible at low fermentation
times, as long as the mixing speed is kept high.
There are two responses to be simultaneously optimised for this
problem, the analysis of the response Firmness is provided in Figure
11.51.
The ANOVA table indicates that the main effect B (Mixing Speed),
its interactions and square term were all insignificant, so these were
all removed from the analysis, as too were all two-factor interaction
terms. The model was found to be significant, residuals and lack of fit
were insignificant and all diagnostics plots indicated that the
residuals were normally distributed. The Response Surface for the
response Firmness is shown in Figure 11.52.
The shape of this surface is what is known as a Saddle Point and
this is best described when viewing the Response Surface in three-
dimensions (Figure 11.53).
Figure 11.47: Half normal probability plot of bread volume.
Figure 11.48: a) Residuals vs predicted and b) residuals vs run plots for bread baking
process.

Figure 11.49: Predicted vs actual plot of bread volume, bread baking process data.

The surface can be viewed as being similar to the shape of a


saddle used for horse riding. From this surface, it is possible to meet
the criteria for firmness in the bread baking process.

An introduction to graphical optimisation


In many industrial and research applications, there is a requirement
more often than not to optimise two or more responses
simultaneously. In the case of the baking process optimisation, there
is a need to solve the joint criteria of volume and firmness, but what is
the process of finding this joint optimal solution? There are two main
methods available to solve such problems, these being,

Frame 11.5: Regression coefficients of reduced chemical process model.

1) Numerical optimisation, which utilises desirability functions to find


the best solution, should it exist and,
2) Graphical optimisation that places constraints on the developed
Response Surfaces and shows the area on the surface where the
joint solution exists.
For the purposes of this introductory text, the method of Graphical
Optimisation will be presented using the baking process data.
The optimisation criteria for the bread baking experiments were
listed in Table 11.23. In Design-Expert®, the ability to graphically
optimise a response is provided. To optimise the loaf size, a lower
limit only was set on Volume of 300, since the aim is to maximise
Volume, an upper limit is not required. The response Firmness was
set such that the lower limit was 2.2 and the upper limit was 2.7.
This results in what is known as the overlay plot. By adjusting the
level of Work Input to a value of around 9 Watts, the maximum area of
Response Surface overlay is obtained for the joint criteria to be met,
as shown in Figure 11.54.
Figure 11.50: Response surface plot of volume, bread baking process.

The yellow shaded area in Figure 11.54 represents the region of


joint optimisation for the two responses Volume and Firmness. The
best solution is one that has a high fermentation time, mixing set to
above its midpoint and work input at 9.
Graphical optimisation is simple in its approach and it is highly
intuitive for a new user, or a user that wants a fast answer. Numerical
optimisation is a much more mathematical approach using numerical
optimisation routines and the interested reader is referred to the
literature for more information, Montgomery [2].
Figure 11.51: ANOVA table for response firmness, bread baking process.

Optimisation—improve stage
When there is more than one response being measured, methods
such as numerical and graphical optimisation are required. For an
even more in-depth analysis of the joint optimisation procedure, the
method of Partial Least Squares Regression (PLSR, chapter 7) is also
suggested as it relates the responses directly to the input factors and
to each other.
In the case of the bread baking process, the optimal solution for
the joint optimisation of loaf volume and loaf firmness was found to
occur when the parameters in Table 11.24 were set for the process.
The baker would now run a number of trial batches at these
settings to establish consistency over batches, raw materials and
other bakers making the same bread as part of robustness testing.
Figure 11.52: Response surface of the response firmness, bread baking process.
Figure 11.53: Three-dimensional response surface of the response firmness, bread baking
process.
Figure 11.54: Overlay plot for joint optimisation of volume and firmness, bread baking
process.

This section provided an introduction to optimisation designs and


showed how the entire process of data collection and analysis could
be applied to a jointly optimised system. The final sections of this
chapter deal with situations that are limited by physical constraints on
both the input and output variables of a design.

11.2.31 An introduction to constrained designs


The topic of constrained designs would warrant a textbook all on its
own to fully describe the designs and their analysis, so this text will
only focus on a small section of constrained factorial design
configurations and will put more emphasis on a special class of
designs, known as mixture designs.
These topics will be presented using an example to show how
they work with emphasis placed on the analysis of such designs.

The situation when a factorial design won’t fit


To illustrate when factorial design do not fit a particular constrained
situation, an example from the consumer food market will be taken.
“Freezer meals” are a generic term for convenience foods brought
from a supermarket. They are typically pre-prepared such that they
can either be reheated in a microwave oven or prepared in a
sequence of easy to follow steps, allowing busy families easy ways to
serve food without having to spend the time cooking.

Table 11.24: Experimental objectives and optimal process settings


for the bread baking investigation.

Experimental Predicted Fermentation Mixing speed Work input


objectives values time (min) (rpm) (Watts)

Volume 301.76 cL 90.6 105 9

Firmness 2.24

A manufacturer of prepared foods wants to investigate the impact


of several processing parameters on the sensory properties of
cooked, marinated meat. The meat is to be first immersed in a
marinade, then steam-cooked and finally deep-fried. The steaming
and frying temperatures are fixed; the marinating and cooking times
are the process parameters of interest.
The food scientist wants to investigate the effect of the three
process variables within the following ranges of variation defined in
Table 11.25.
A full factorial design would lead to the following “cube”
experiments defined in Table 11.26.
After investigating the factor settings in the full factorial design,
the food scientist expresses strong doubts that the design can be of
any help. This is because if the meat is steamed then fried for 5 min
each it will not be cooked, and at 15 min each it will be overcooked
and burned on the surface. In either case, no valid sensory ratings are
possible, because the products will be far beyond the ranges of
acceptability.
This is where a constrained design may help. In order for the meat
to be suitably cooked, the sum of the two cooking times was
determined to be optimal between 16 and 24 minutes for all
experiments. This type of restriction is called a multi-linear
constraint. In the current case, it can be written in a mathematical
form requiring two equations, as follows:

Steam + Fry ≥ 16 and Steam + Fry ≤ 24

The impact of these constraints on the shape of the experimental


region is shown in Figures 11.55a and b.
The constrained experimental region is no longer a cube! As a
consequence, it cannot be analysed as a factorial design and points
must be chosen that simultaneously,
1) Span the maximum design space and
2) Maintain is close as possible near orthogonality of the design
points.

Table 11.25: Ranges of the process variables for the cooked meat
design.

Process variable Low High

Marinating time 6 hours 18 hours

Steaming time 5 min 15 min

Frying time 5 min 15 min

The new design table, taking into account the constraints imposed
on the system is provided in Table 11.27.
Figure 11.55 b) shows that the “corners” of the factorial design
have been cut off for steaming and frying compared to the
experimental region represented by the full factorial design.
Depending on the number and complexity of the multi-linear
constraints to be taken into account, the shape of the experimental
region can be more or less complex. In the worst case, it may be
almost impossible to imagine! Therefore, building a design to screen
or optimise variables linked by multi-linear constraints requires
special methods of construction in order to be analysed using
methods such as MLR.

Table 11.26: The cooked meat full factorial design.

Sample Marinating time Steam time Fry time

1 6 5 5

2 18 5 5

3 6 15 5

4 18 15 5

5 6 5 15

6 18 5 15

7 6 15 15

8 18 15 15
Figure 11.55: The cooked meat experimental region. a) Without and b) with constraints.

The principles of design optimality


Alphabetical Optimal designs are generated by a class of
mathematical optimisation routines, generally based on row-
exchange algorithms. These methods result in design spaces that
cover the widest range of the experimental space, for a preselected
number of samples that maintains the highest degree of orthogonality
between the points.

Table 11.27: The cooked meat constrained design.

Sample Marinating time Steam time Fry time

1 6 5 11

2 6 5 15

3 6 9 15

4 6 11 5

5 6 15 5

6 6 15 9

7 18 5 11
8 18 5 15

9 18 9 15

10 18 11 5

11 18 15 5

12 18 15 9

There are a number of common alphabetical designs available,


three of which are supported in Design Expert. These are briefly
described as follows,
1) I-Optimal Designs: This algorithm chooses runs that minimise the
integral of the prediction variance across the design space. This
criterion is recommended for building response surface designs
where the goal is to optimise the factor settings, requiring greater
precision in the estimated model.
2) D-Optimal Designs: This algorithm chooses runs that minimise the
determinant of the variance–covariance matrix. This has the effect
of minimising the volume of the joint confidence ellipsoid for the
coefficients. These criteria are recommended for building factorial
designs where the goal is to find factors important to the system
being studied. It is commonly used to create fractional general
factorial experiments.
3) A-Optimal Designs: This design minimises the trace of the
variance–covariance matrix. This has the effect of minimising the
average prediction variance of the polynomial model coefficients.
These designs are built algorithmically to meet this criterion.
For the purposes of this introductory text, only the D-Optimal
designs will be described.

D-Optimal designs
Returning to the freezer meal example introduced earlier in this
section, where two constraints were placed on the variables steaming
time and frying time. Referring to Figure 11.55 b), the design has the
corners cut off the cube due to the two constraints, but how was this
design generated? In these cases, one of the alphabet optimal
designs are required to select the best possible subset of points to
include. The D-optimal principle, which consists in enclosing the
maximum volume into the selected points is described here, where it
is known that a selection of points will not exactly re-constitute the
entire experimental region of interest, but only the smallest possible
volume will be excluded.
When multi-linear constraints are introduced among the design
variables, it is no longer possible to build an orthogonal design. This
introduces a certain degree of dependency amongst the design
factors and as soon as the variations in one of the design variables
are linked to those of another design variable, orthogonality cannot
be achieved. These designs require a computer to generate them and
the analyst must have some idea of the number of points to include in
the design, up-front, based on an assessment of the shape that the
constraints impose on the design space.
The design is generated based on a grid of candidate points that
are spaced over the design region. Since the D-Optimal design aims
to maximise the volume that the candidate points are supposed to
enclose, mathematically, this is equivalent to minimising the
determinant of equation 11.18.

From equation 15, it can be shown that a D-Optimal design


minimises the volume of the joint confidence region of the regression
coefficients, Montgomery [2]. A measure of the relative efficiency,
based on the volume the candidate points enclose if known as the D-
Efficiency when two designs are compared. This is defined by
equation 11.19.
where X1 and X2 are the X matrices for the two designs being
compared and p is the number of model parameters.
Computer designs employ algorithms such as exchange
algorithms by starting with a first, sometimes randomly selected
design with the desired number of points and then it exchanges
points in the design with other potential candidate points. Using
equation 11.19, it compares the efficiency of the exchange with the
previous design in order to improve the optimality (and therefore the
orthogonality).
In the ideal case, if all extreme vertices are included into the
design, it has the highest D-efficiency. If that solution is too
expensive, i.e. the design requires too many runs to generate the
optimal design, then a smaller number of points has to be
considered. The automatic consequence is that the D-efficiency will
decrease and the enclosed volume will decrease. This is illustrated by
Figure 11.56.

D-Optimal designs for screening and factor influence stages


If the purpose of the design is to evaluate the main effects, and
optionally to describe some or all of the interactions among them
when constraints are present, a linear model or a linear model with
interaction effects must be constructed.
The set of candidate points for the generation of the D-optimal
design will then consist mostly of the extreme vertices of the
constrained experimental region. If the number of variables is small
enough, edge centres and higher order centroids can also be
included. In addition, centre samples can be included in the design
(whenever they apply).

D-Optimal designs for optimisation purposes


When the purpose of the design is to investigate the effects of factors
with enough precision to describe a response surface accurately, a
quadratic model is required. This model will consist of intermediate
points (situated somewhere between the extreme vertices) providing
enough levels in the design so that square effects can be computed.
The set of candidate points for a D-optimal optimisation design
will thus include:
all extreme vertices;
all edge centres;
all face centres and constraint plane centroids.

Figure 11.56: With only eight points, the enclosed volume is not optimal.

To imagine the result in three dimensions, a combination of a


Box–Behnken design (which includes all edge centres) and a Face
Centred Cubic design (with all corners and all face centres) could be
envisioned. The main difference is that the constrained region is not a
cube, but a more complex polyhedron, which may (or may not) be
rotatable.

Mixture designs

A Mixture Design is a special class of constrained design, commonly


related to the amounts of components making up a mixture. In these
designs, the response is no longer a function of e.g. process
variables, but is now a function of components making up the
mixture. An example situation may be taken from pharmaceutical
development studies where three polymers are considered for a slow
release coating for a tablet. Slow release coatings are an essential
part of drug formulation for materials that may react with stomach
acid in undesirable ways. The formulation of coatings that remain
intact in the stomach but can break down slowly in the alkaline
environment of the small intestine provide the drug with the
characteristics required for treating a particular ailment. By adjusting
the proportions of each polymer in the mixture, the rate of release can
be studied.
So, the question is, “what makes a mixture design different from
factorial designs?” In a mixture of physically weighed out
components, there is one major constraint; mass can never be less
than zero, so the first constraint can be stated as,

0 ≤ xi ≤ 1 I = 1, 2, …p

where xi are the individual components (p) making up the mixture. In


this case, the upper bound represents a proportion of the mixture and
a single component on its own will always have a proportion = 1. It is
also possible to state the upper bound of the mixture in terms of
percentages or a target final mass. The next constraint on a mixture
design is a mass balance constraint. When dealing in proportions, the
total sum of mixture components must equal to 1 (or in the case of
percentages, 100%). This can be expressed mathematically as,

These two constraints confine the design space into what is


known as a Simplex. A simplex is the simplest geometry in space that
defines p-components in p – 1 space, i.e. in a binary mixture, if the
proportion of one component is known, the other component can be
determined by difference, see Figure 11.57.
In the case of a binary mixture, p = 2, the dimension it can be
described in is p– 1 = 1. Therefore, a straight line is the simplex that
describes a binary mixture. What happens for a ternary mixture, i.e.
three components; these have to be described in p – 1 = 2
dimensions, but how? To show how the simplex is built for this
system, reference is made to the 23 full factorial design shown in
Figure 11.58.
Using the basis of the 23 full factorial design, the maximum
proportions for any one component are constrained to 1. The (–1)
levels for any design cannot exist and the lower constraint on the
axes are set to zero. The result is the equilateral triangle space
defined. Within this space, all mixture compositions are defined and
this, by definition, means that the mixture components are dependent
on each other, i.e. if one component in the mixture changes, it
simultaneously changes all other components to maintain mass
balance. This is why the simplex for p = 3 components can be
described by a two-dimensional space.
It is interesting to observe from Figure 11.58 that when one
component is set to zero, the simplex reduces to a binary mixture.
Consider component x3 in the right-hand figure. At its apex, it has a
proportion of 1, which naturally implies that x1 and x2 are at zero
proportions. If the vertical line joining x3 to the centroid and then to
the midpoint of the base of the triangle is followed (this is also known
as the Piepel Direction), x3 gradually decreases and at the overall
centroid of the design, there are equal proportions of all three
components. At the base of the triangle, x3 = 0 and if the point
chosen is the midpoint, then x1 and x2 are present in the mixture in
0.5 : 0.5 proportions. This principle applies to all components that
make up the mixture.
Figure 11.57: Defining mixture proportions in a simplex.

Figure 11.58: The construction of a simplex for p = 3 components.


Figure 11.59: The p = 4 simplex design space.

In the case of p = 4, the simplex has a three-dimensional (p – 1 =


3) shape and this is shown in Figure 11.59.
The first thing to note in this design is that each face of the
triangular prism is the p = 3 simplex, i.e. when one component is set
to zero, the design collapses to the simplex for the p – 1 case. One
triangular face is shaded in Figure 11.59 to show this and also show
that each face has a centroid of 1/3 : 1/3 : 1/3 for the three
components. The overall centroid for the design space now changes
to ¼ : ¼ : ¼ : 1/4 to account for the extra component added to the
design. Beyond p = 4 components, designs cannot be shown as a
single graphic, however, the same principles of construction apply for
p > 4.

General mixture models


Due to the dependency of the mixture components on each other, the
standard models used for factorial designs cannot be applied directly
to mixture designs, because of the constraint, Σxi = 1. To
accommodate the design spaces associated with mixture designs,
the so called Shéffe models are employed to analyse the resulting
data. For a more detailed discussion of Shéffe models and mixture
designs in general the interested reader is referred to the texts by
Cornell [5] and Smith [6].
There is one major similarity in the way factorial designs and
mixture designs are generated and analysed. This relates to the three
main purposes of the designs,
1) Screening Designs: To isolate which components have a
significant impact on a response.
2) Component Influence Designs: Used to understand the synergistic
(positive) or the antagonistic (negative) interactions of components
on a response.
3) Optimisation Designs: Used to find the best and most robust
mixture of components for a product or formulation.
The so called Schéffe polynomials are presented below and the
designs that they can be applied to are discussed in the sections that
follow.
For mixture designs that are used for screening or when there is
no interaction between components, the linear mixture model of
equation 11.20 can be used.

Note in equation 11.20 and the following equations for Schéffe


models there is no intercept term and the regression coefficients β
are direct multipliers of the components. This is again due to the
dependency of the components with each other as defined by the
constraint Sxi = 1 and the regression coefficients absorb the intercept
term to reflect this dependency.
For component influence studies, linear and interaction terms are
important, particularly for identifying synergistic or antagonistic
blending processes. In a system that displays synergistic blending
characteristics, the response for such a mixture would be recorded to
be significantly higher than just the average of the two components
mixing together and vice versa for antagonistic blending. This
principle is described later in this section.
The component influence model is defined in equation 11.21
where both interaction and linear terms are calculated. Note that in a
mixture design, quadratic terms are defined by mixture interactions,
again due to the mass balance constraint.

Equation 11.21 requires that there be at least three levels of


design variable present in the design to accommodate the quadratic
model.
For optimisation designs the Full Cubic and Special Cubic models
(along with quadratic models) are used to define more complex
response surfaces. These are defined by equations 11.22 and 11.23
respectively.
Note in equation 11.22 there is a cubic term based on only two
components and the difference between them. This is a cubic term
based on two components with the third being the difference
between the components to account for the dependency between
them. The last term in equation 11.22 is the purely cubic interaction
term of the components.
Equation 11.23 is a simplification of equation 11.22 and only
handles the purely cubic interaction terms for each component.
These equations require a minimum of four component levels in the
design in order to calculate cubic effects.
The designs associated with each of the models described above
are defined in the next section.

General mixture designs


Depending on the objective stated in the Define stage, the following
design types are applicable based on the classical simplex shaped
nature of the space required. The designs described in the following
sections are all based on there being p = 3 components available and
thus resulting is a triangular space.

Simplex lattice designs


By their definition, these designs are based on a grid type lattice
structure within the design space. A [p, m] simplex lattice design for p
components is a design with the following coordinates with the
proportions taking up the m + 1 equally spaced values from 0 to 1,
Taking the example for p = 3 and m = 2, the [3, 2] Simplex Lattice
design has coordinates,

What does this mean in practice? For each component p in the


mixture, it will take on a value of 0, 0.5 and 1, i.e. the zero mixture, the
0.5 : 0.5 mixture with another component and the pure component.
Figure 11.60 shows the Simplex Lattice designs for the [3, 1], [3, 2]
and [3, 3] situations.
Consider the [3, 2] design and in particular consider the binary
mixture along the x1, x2 direction. There are three design points in
which a model can be fit and there are a number of possible ways
this fit can occur.

Synergistic blending
Figure 11.61 shows how a theoretical response y might change as a
function of x1, x2 composition for a [3, 2] Simplex Lattice design.
The dotted straight line in Figure 11.61 is the theoretical linear fit
of the data based on there being no interaction between the
components x1 and x2 during blending. When the mixture
components blend in a synergistic manner, the components
positively enhance each other’s contribution to the response and thus
a non-linear situation occurs. There are now enough points in this
design to calculate a quadratic model.

Antagonistic blending
Antagonistic blending is the opposite of synergistic blending and its
effect on a theoretical response is shown in Figure 11.62.
Figure 11.60: Some simplex lattice designs for p = 3, linear, quadratic and cubic model
structures.

Cubic blending
There are certain situations where the nature of blending between two
or more components changes from synergistic to antagonistic as the
proportions change. This is an example of cubic (or even higher
polynomial) blending. Figure 11.63 shows the situation of cubic
blending for a theoretical response for the [3, 3] Simplex Lattice
design.
Figure 11.61: Synergistic blending in a [3, 2] simplex lattice design with a quadratic model.

From this discussion, it was described that the number of points


in a model is directly proportional to the model degree that can be fit
to the data. This is a function of the number of components p and the
polynomial degree m. This must be taken into account before any
mixture design problem is to be investigated.
Figure 11.62: Antagonistic blending in a [3, 2] simplex lattice design with a quadratic model.

The simplex centroid designs


Closer investigation of the [3, 2] Simplex Lattice design shows one
major flaw when trying to fit a quadratic model, the design contains
no points in the interior of the design (refer to Figure 11.60). Even
though it supports a quadratic model, the [3, 2] Simplex Lattice
design provides no information about three component blending, it is
purely a component interaction model for binary mixtures. To
overcome this shortfall, a special class of design known as the
Simplex Centroid design was developed. By its own definition, this
design fits points at all centroids in the design space and two
common designs; the usual Simplex Centroid and the Augmented
Simplex Centroid design, which are shown in Figure 11.64 for p = 3.
Figure 11.63: Cubic blending in a [3, 3] simplex lattice design.

The Simplex Centroid design is an economical design for fitting up


to the Special Cubic model defined in equation 11.23. It can be used
for screening, component interaction and optimisation applications
and it is designed to generate 2p – 1 design points. In the case of p =
3, there are seven points in the design, the three pure components,
the three binary mixtures and the overall centroid. For more details
regarding the construction of higher order Simplex Centroid designs,
the interesting reader is referred to the texts by Myers and
Montgomery [4], Cornell [5] and Smith [6].
An extension of the Simplex Centroid design is to add what are
known as axial check blends to the design. These points lie in the
designs interior, midway between the centroid and a pure
component. These points are not used in the model building, but are
used to validate that the model containing the centroid is generating
valid results.

Analysis of mixture designs

Once a design has been selected and the responses collected by


experimentation, the analysis of a mixture design proceeds as was
described for a factorial and optimisation design (section 11.2.16).
The analysis of a three-component mixture design will be used as an
example of how to analyse a dual response problem.

Mixture designs—define stage


A wine producer wants to blend three different types of wines
together: Carignan, Grenache and Syrah. All three types can vary
between 0% and 100% in proportion. The aim is to find out in what
proportion the three blends produce an acceptable wine, based on
sensory assessment. An associated objective is to simplify the
production process by determining if blending only two types of wine
together will produce an acceptable product. The final objective is
economical in nature and therefore the final product should be
produced at the lowest cost possible.
In quantitative terms, the objective of this design is to meet the
criteria listed in Table 11.28.

Figure 11.64: Simplex centroid and augmented simplex centroid designs for p = 3
components.

Mixture designs—design stage


As per the objectives, the aim is to find a binary mixture of wine
blends that best suit the market need and at the same time, is
affordable, w.r.t. to the production cost and consumer affordability.
This naturally leads to the Simplex Lattice designs as they are most
suited to assessing synergistic and antagonistic blending between
two components. This is not an optimisation design in the strict sense
of the word, it is more about finding a combination of two wines that
enhance the sensory appeal of the consumer than just the single wine
types alone.
In this case, the [3, 3] Simplex Lattice design in 11 runs was
chosen with one replicated centre point such that any cubic mixing
could be detected for binary blends. This means that models up to
the Special Cubic model can be used to assess the data.
The preference data recorded for this experimental design was
averaged over 83 male and 30 female consumers aged between 25
and 60. The scale used ranged from 1 to 3. A major group of 99
consumers with a similar preference was clearly identified in a
previously performed principal component analysis (PCA).
The production cost was computed for each sample blended
according to the amount of the individual wine type used with a linear
equation. All response data was collected into Table 11.29 for the
purposes of this analysis.

Table 11.28: Wine blend design objectives.

Response Experimental objective

Preference Greater than 2 is preferable with the aim to


maximise this response.

Cost A production cost under $3 is preferable

The actual design space and the precision of estimates over the
design for this study are provided in Figure 11.65. The precision at
the centroid is lower than the edges of the design but this is to be
expected from a Simplex Lattice design, since it puts little emphasis
on analysing the interior of the design and more emphasis on the
edges of the design.

Mixture designs—analyse stage


After the data were entered into Design-Expert®, the suggested
model form for the response preference was a quadratic model and
for response cost, a linear model is suggested (since cost is a purely
additive response). The ANOVA table for preference is shown in
Figure 11.66.
The ANOVA table for preference shows that the model is
significant and there are at least two quadratic terms that are
significant, both include the blend component Syrah. The
corresponding supplementary statistics for preference are provided in
Frame 11.6.

Table 11.29: Design table for wine blending design.

Std Run Component Component Component Response Response


1 2 3 1 2

A: Carignan B: C: Syrah Preference Cost ($)


Grenache

1 9 1 0 0 2.8 3.50

2 10 0 1 0 1.4 2.50

3 8 0 0 1 1.8 4.00

4 3 0.67 0.33 0 2.4 3.17

5 5 0.67 0 0.33 2.8 3.67

6 1 0.33 0.67 0 1.9 2.83

7 11 0.33 0.33 0.33 2.4 3.33

8 6 0.33 0 0.67 2.7 3.83

9 7 0 0.67 0.33 1.8 3.00

10 2 0 0.33 0.67 2.0 3.50

11 4 0.33 0.33 0.33 2.5 3.33

The various R-Squared values indicate a good model fit to the


data and the Adequate Precision is greater than four so the model
can be taken to the diagnostics phase.
The ANOVA table for the response cost is provided in Figure
11.67.
This model is naturally significant by design as the results were
generated using a linear equation and therefore, no further
diagnostics assessment will be performed.
The Normal Probability plot of Residuals is provided in Figure
11.68 for the response preference.
There is one possible point that is different from the rest of the
population, however, the residuals still form a straight-line fit and an
assessment of the Studentised residual showed that no point was
outside the critical limits, therefore the design can now be
interpreted.
The 3-Dimensional Response Surface plot is shown in Figure
11.69.
This plot shows the quadratic nature of the Response Surface.
Note that the Response Surface is simplex shaped compared to the
corresponding response surfaces for factorial designs. The Predicted
vs Actual plot for this design (Figure 11.70) confirms that this model
can be used as a predictor of new results.
For clarity, the Response Surfaces for preference and cost are
shown together as contour plots in Figures 11.71a) and b).
A number of observations become apparent immediately,
1) Carignan is the highest preferred wine blend and also its blends
with Syrah.
2) Syrah is the highest cost, but not the highest preference.
3) Grenache is “cheap and nasty”.
In order to meet the criteria set out in Table 11.28, Graphical
Optimisation will be used. The results of this analysis are shown in
Figure 11.72.
The Overlay plot of preference and cost shows that there is a
small region of the design space that meets the criteria of Table
11.28. In particular, this plot shows that there is the possibility of
blending a binary mixture of wines to meet the objective. This is
shown in the flag in the Overlay plot and the composition of the wine
blend is listed in Table 11.30.
Figure 11.65: Standard error of design for the [3, 3] simplex lattice for wine blending analysis.

Mixture designs—improve stage


Now that the design has met its objectives, the wine producer will set
the blend headers in the process to mix the desired proportions of
Carignan and Grenache and perform some validation runs to see if
the blending of a number of test batches meets the requirements. If
so, the producer has gained the following benefits of experimental
design,
Figure 11.66: ANOVA table for the response preference, wine mixture design.

Frame 11.6

Figure 11.67: ANOVA table for the response cost, wine mixture design.
Figure 11.68: Normal plot of residuals for response preference, wine mixture design.
Figure 11.69: Response surface for response preference, wine mixture design.

1) The expensive and not so preferable Syrah at $4 a bottle has been


eliminated.
2) Adequate preference and cost were met with a binary blend of
Carignan and Grenache with the lower cost component making up
most of the blend.
3) The production and storage facilities are now more efficient due to
the removal of one component and that component could now be
used in another product offering. This also reduces logistical strain
in the event Syrah is not available during one season.
There are both tangible and intangible gains from the application
of DoE to most problems, these are the points that process engineers
and scientist must emphasise to senior managers in their work
environment that spending a little upfront on quality will lead to better
savings without cutting corners.

Figure 11.70: Predicted vs actual plot for response preference, wine mixture design.
Figure 11.71: Response Surface plots for responses a) preference and b) cost, wine mixture
design.

Table 11.30: Optimal binary composition of wine blends to meet the


required acceptance and cost criteria, wine blend data.

Blend component Composition in Preference Cost ($)


blend (%)

Carignan 40 2.1 2.90

Grenache 60

Constrained mixture spaces


There are certain practical situations that require even further
constraints on a mixture in order for it to be of any practical value to
an end user. A good example is found in gasoline blending. Most
gasoline blends are a mixture of a few common blend stock
components, including:
1) Light Straight Run (LSR): These are unbranched alkanes with low
Octane Numbers but are typically produced in vast quantities, so it
is the objective of the oil refineries to blend as much of these as
possible for storage and other economic reasons.
2) Aromatics: These are high Octane Number feed stocks and are
formed in reasonable quantity. However, environmental and health
& safety regulations stipulate that only a certain proportion of
gasoline can contain aromatics.
3) Isomerates: These are the products of branching of LSR or the
reduction of aromatics into branched hydrocarbons with higher
Octane Number than the original feed stocks. These require an
additional process step and can increase production costs.
4) Alkylates: These are the highest Octane Number components,
typically found in premium gasoline blends and these are produced
under highly dangerous conditions. These are the most expensive
of all the feedstocks.
Figure 11.72: Graphical optimisation plot of responses preference and cost, wine mixture
design.

When designing a new gasoline, considerations must be given to


cost, performance, environmental regulations, availability and
process equipment downtime. Consider a three-component gasoline
to be blended as a regular product for most motorist who don’t have
too much regard for the running of their car. This will contain a large
quantity of LSR, thus a lower and even an upper limit needs to be
placed on this component. Aromatics can only be present to a
maximum of 5% in any gasoline blend under the laws of most
environmentally friendly countries. The remainder has to be made up
of expensive Isomerates to boost the Octane Number such that the
engine performs as it should. The following constraints were placed
on the gasoline blend.

55 ≤ LSR ≤ 70
0 ≤ Aromatics ≤ 5
30 ≤ Isomerate ≤ 40

What does this design space look like? Starting with the p = 3
simplex, the limits imposed on LSR are first plotted in Figure 11.73.
The first thing to note is that the design space with the lower and
upper bounds on LSR imposed is no longer simplex shaped, i.e. the
design region cannot be modelled using the standard Shéffe models.
This is not a concern at the moment, because there are situations
when all constraints are imposed, the design space becomes a
simplex. The upper bound on Aromatics is now shown in Figure
11.74.
Again, the upper bound results in a non-simplex shape. Figure
11.75 shows the design space when the Isomerate bounds are
applied.
The final design space is trapezoidal in shape and thus cannot be
analysed using standard mixture polynomials. More on the
development and analysis of these designs will be provided in a later
section of this chapter. The following present some general rules
regarding the shape of the design space when lower and upper limits
are imposed on the mixture space.

Lower bounds in the mixture space


As the name suggests, a lower bound is a limit that does not allow
the component(s) of interest to become zero. A good example is the
case of making pancake mixture consisting of three components,
Flour, Sugar and Eggs. It would be a pretty awful pancake if it
contained 100% of any ingredient so the entire mixture space must
be bounded in some way so that viable mixtures can be produced
that would result in acceptable pancakes. Theoretically, the
proportion of water used to make the mixture could be added as a
fourth mixture component, but for this example, the p = 3 simplex will
be used for simplicity.

Figure 11.73: Simplex design space for p = 3. LSR with a lower and an upper bound.

Although just putting a lower limit on any one component would


still not result in an acceptable mixture, the application of a number of
lower bounds will imply upper bounds in the design by default. This is
shown in Figure 11.76 for the pancake mixture design where lower
bounds are applied successively to all components.
In Figure 11.76, it is important to note that in all cases, lower
bounds result in a simplex shaped design space. This means that in
any case where only lower bounds are imposed, the Shéffe models
can be applied to the design. Another point to note from Figure 11.76
is that the upper bound of 100% is underlined for each component.
The first bound on Flour is correct, the lower boundary bounds the
space for flour between 60% and 100%. This has implications on
both Egg and Sugar due to the dependency between the
components. If only the Flour is bounded, the following constraints
are immediately imposed on the other components,

Figure 11.74: Simplex design space for p = 3. LSR with a lower and an upper bound,
aromatics with an upper bound.
60% ≤ Flour ≤ 100%
0% ≤ Egg ≤ 40%
0% ≤ Sugar ≤ 40%

Figure 11.75: Simplex design space for p = 3. All bounds applied to the components in the
gasoline design.

This is purely to balance the overall proportions to 100%, i.e. if


Flour = 60%, then Egg could be 40% and Sugar 0% (or any
combination that balances the Egg and Sugar proportions to make up
the remaining 40% of the mixture).
When the Egg lower bound is imposed at 10%, this has an
immediate impact on the upper bound on Flour. In this case, having
10% Egg in the mixture would reduce the maximum amount of Flour
to 90% if Sugar was at its lower bound of 0%. The constraints can
now be recalculated as follows,
Figure 11.76: Successive lower bounds imposed on the p = 3 simplex space, pancake
mixture. Note that in all cases, lower bounds result in a simplex shape.

60% ≤ Flour ≤ 90%


10% ≤ Egg ≤ 40%
0% ≤ Sugar ≤ 30%

The upper bound on Egg remains the same at 40% and this
balances since the lower bound on Sugar is still 0% at this stage. The
Sugar upper bound is now implicitly changed to 30% for mass
balance if Flour is set at 60% and Egg at 10%. When the last lower
bound is placed on Sugar, this changes the entire set of upper
bounds, while all lower bounds remain intact. When Egg and Sugar
are both at their lower bounds, this accounts for 10% + 15% = 25%
of the total mixture, this implies that the upper bound on Flour is now
75% to account for their always being a minimum proportion of Egg
and Sugar in the mixture.
When there is 75% Flour in the mixture and Egg is at its lower
bound of 10%, there is currently a 15% difference between Sugar at
its lower bound, this is the exact lower bound to be imposed. Now, if
Flour is set to 60% and the lower bound on Sugar is 15%, the upper
bound on Egg is implicitly changed to 25%. The upper bound on
Sugar remains the same. The final constraints placed on the design
are provided as follows.
Figure 11.77: Pancake mixture design constrained by three lower bounds in real units.

60% ≤ Flour ≤ 75%


10% ≤ Egg ≤ 25%
15% ≤ Sugar ≤ 30%

These lower (and implied upper) bounds form a simplex shaped


design region which is shown in Figure 11.77 in Real Units.
The centroid of this design is calculated as follows:
Difference between upper and lower bound Flour = 75 – 60 = 15%
Difference between upper and lower bound Egg = 25 – 10 = 15%
Difference between upper and lower bound Sugar = 30 – 15 = 15%
Number of components p = 3.
Average relative proportions at centroid = 15 / 3 = 5%
Centroid Composition

Flour = 60 + 5 = 65%
Egg = 10 + 5 = 15%
Sugar = 15 + 5 = 20%

It is now the task of the experimenter to perform the defined


experiments and determine if there is one formulation that results in
responses such as optimal cooking time, taste, appearance, cost etc.
Upper bounds in the mixture space
In general, upper bounds result in non-simplex design spaces. There
are situations where upper bounds can result in a simplex region, but
this is the exception rather than the norm. Figure 11.78 shows the
mixture design space for p = 3 with one, two and three upper bounds
applied, respectively.
After the first bound is imposed on the design, the design space is
already non-simplex and the entire bounds of the design are,

0% ≤ Flour ≤ 60%
0% ≤ Egg ≤ 100%
0% ≤ Sugar ≤ 100%

When the constraint for Egg is placed on the design, the following
further constraints are imposed,

0% ≤ Flour ≤ 60%
0% ≤ Egg ≤ 20%
20% ≤ Sugar ≤ 100%

When Flour is at 60% and Egg is at 20%, the lower bound of 20%
is imposed on Sugar. Now when the upper bound is imposed on
Sugar, the following constraints are calculated,

0% ≤ Flour ≤ 40%
0% ≤ Egg ≤ 20%
60% ≤ Sugar ≤ 100%
Figure 11.78: Mixture design space for p = 3 components for one, two and three upper
bounds imposed.

A computer is required to generate this design using methods


such as the D-Optimal design algorithm discussed above. An
example of a D-Optimal design space for the constraints listed above
in 11 experimental runs is provided in Figure 11.79.
The design space in Figure 11.79 shows the standard error for the
candidate points selected. The configuration of points is chosen to be
as orthogonal as possible and also to minimise the standard errors
over the entire design space.

The U-simplex: a special case of upper bounds


In some cases, when two or more upper bounds are imposed on
component proportions, a special type of inverted simplex, known as
the U-Simplex, Crosier [7] can be generated. This occurs when the
conditions of equation 11.24 occur,

Where Ui are the individual upper bounds on the p components


and Umin is the minimum upper bound imposed on the design.
Looking at the previous example of the pancake mixture with only
upper bounds, application of equation 21 yields a value of (40 + 20 +
100) – 20 = 160 (or 1.6 in proportions) and as shown graphically, the
design space is not simplex shaped, however, consider the design
space with the following bounds,
Figure 11.79: Design space for the constrained pancake mixture design generated using an
11-point D-Optimal design.

10% ≤ Flour ≤ 40%


30% ≤ Egg ≤ 60%
0% ≤ Sugar ≤ 30%

Application of equation 21 results in the following value (40 + 60 +


30) – 30 = 100 (or 1 in proportions). This design should now have a
simplex shape and this is shown in Figure 11.80.
Now that the design spaces for the various types of mixture
designs have been defined and their analysis has been discussed, the
following example shows how mixture design constraints can be
developed and analysed when formulating a product for commercial
manufacture.
Example: design of a fruit juice beverage

This exercise is an adaptation taken from Cornell [5] to illustrate the


basic principles and specific features of constrained mixture designs.

Define stage
A fruit punch is to be prepared by blending three types of fruit juice:
watermelon, pineapple and orange. The purpose of the manufacturer
is to use their large supplies of watermelons by introducing
watermelon juice, of little value by itself, into a blend of fruit juices.
Therefore, the fruit punch has to contain a substantial amount of
watermelon—at least 30% of the total. Pineapple and orange have
been selected as the other components of the mixture, since juices
from these fruits are easily obtainable and relatively inexpensive.

Figure 11.80: The U-simplex design space for the pancake mixture design.

The objective of this design is to include as much watermelon into


the juice as possible and at the same time, keep the acceptance level
above 6.5.

Design stage
The manufacturer decides to use experimental design to find out
which combination of the three ingredients maximises consumer
acceptance of the taste of the beverage. The ranges of variation
selected and the upper and lower bounds for the experiment are
listed in Table 11.31.
Since it is the purpose of this experiment to optimise the fruit juice
blends, a design that covers the interior, including axial check blends
will be used. The design chosen was the Simplex Centroid design in
10 experimental runs. The design contains only one lower bound, this
means that the design space is simplex shaped and the usual Schéffe
polynomials can be applied. To generate this design, Design-Expert®
was used. The design table was sorted by Standard (Std) Order. The
experiments were run and the design table now includes all of the
responses for analysis (Figure 11.81).
The Real Contour plot of Figure 11.82 shows the actual design
space and the standard errors associated with the design. It also
shows the constrained design space with respect to the overall
unconstrained space.

Table 11.31: Ranges of variation for the fruit beverage design.

Ingredient Low High Centroid

Watermelon 30% 100% 54%

Pineapple 0% 70% 23%

Orange 0% 70% 23%

Analyse stage
The Fit Summary suggests a Quadratic Model is the best fit for the
response Acceptance (Figure 11.83).
The ANOVA table for the quadratic model is provided in Figure
11.84.
It is noted in the ANOVA table that the model is significant. Only
the term Watermelon × Orange (AC) is insignificant. Note that in a
mixture model, main effects cannot be removed from the design due
to their dependencies on each other. This is why the Linear Mixture
effect is stated as the total main component contributions to the
model.
The term AC was removed and the ANOVA table for this modified
quadratic model is provided in Figure 11.85.
All terms are significant in the model and the residual is small
compared to the model sums of squares. This is reflected in the
supplementary statistics provided by the model (see Frame 11.7).
The Normal Probability plot of Residuals is shown in Figure 11.86.

Figure 11.81: Design table for fruit punch example.


Figure 11.82: Real mixture contour plot of fruit beverage design.

This plot indicates that there are no outliers. The Residuals vs


Predicted and the Residuals vs Run plots (data not shown) also
indicated that there were no outliers and trending in the data. The
Predicted vs Actual plot of Figure 11.87 shows that the model is also
highly predictive (as also evidenced by the high R-Squared, Adjusted
R-Squared and Predicted R-Squared values).
The Contour Response Surface plot for this analysis is provided in
Figure 11.88.
Figure 11.83: Model fit details for response, acceptance, fruit beverage design.

Returning to the objective of the experiment, the goal was to


include as much watermelon juice as possible and keep the
acceptance above 6.5. To do this, the method of graphical
optimisation will be used. The results of the optimisation are provided
in Figure 11.89.
From this Overlay plot, a good starting point for the fruit juice
recipe would be the axial check blend point at,

Figure 11.84: ANOVA table for response, acceptance, fruit beverage design.
Figure 11.85: ANOVA table for response, acceptance, after the insignificant term AC was
removed, fruit beverage design.

Frame 11.7: Supplementary statistics for fruit beverage mixture model.

Watermelon = 42%
Pineapple = 12%
Orange = 46%

This maximises the usage of Watermelon, has a good proportion


of the highly desirable Orange flavour and balances out the rest with
the tropical taste of Pineapple. This will produce a product that meets
the minimum acceptance criteria (i.e. Acceptance = 6.7 compared to
the minimum of 6.5).

Improve stage
The power of experimental design has been demonstrated again
using this case study. A constrained design, which forces there to be
a minimum amount (above zero) of a single component has shown
that an acceptable recipe for a fruit beverage is possible and can be
formulated from the three components used.
It is now the task of the company to make this recipe again (and
possibly use a smaller designed experiment around this target
formulation) to test the robustness of the recipe to small, but
deliberate changes. These would be used as a blind taste test
strategy on both a trained and untrained sensory panel to see if the
acceptance data is validatable. From there the company’s marketing
team can promote this new product to the general public.

Using partial least squares regression for analysing non-


standard designs

One of the main issues with analysing computer-generated designs is


that typically, the method of MLR is used. MLR is highly useful when
the X-matrix is orthogonal, as the regression coefficients generated
are independent of each other. Since computer generated designs
are as close as possible to orthogonal, there is still a risk that using
MLR may lead to covariance issues.
Bi-linear methods such as PCR and PLSR (chapter 7) are highly
useful for analysing such non-orthogonal designs as they reduce the
data to orthogonal components/factors. The only real disadvantage
with respect to standard methods of DoE analysis is that PCR/PLSR
do not provide the ANOVA table output that is usually expected.
However, this disadvantage is offset by a major advantage of PLSR
that is observed when multiple responses are being analysed, this
being that in the PLSR loadings plot, not only show the correlations
between the predictors and responses provided, but the correlation
between the responses can also be evaluated. The method of MLR
and other DoE methods, in general, tend to treat all responses as
independent of each other.
Some problems measure in the order of 10–50 responses per
design. This equates to an OVAT situation in responses and therefore
it may be difficult to see the interrelationships between the responses
when this situation occurs. The PLSR loadings (and corresponding
Correlation Loadings) plots provide a map of factors and responses.
Rather than using the traditional statistics of DoE, the statistics of
PLSR are used to make decisions. This approach will be presented
using an example.
A formulator of a pharmaceutical tablet wants to assess the
effects of Granulation Time and Lubrication Time on three Critical
Quality Attributes (CQAs),
Dissolution Time after 30 mins (%)
Friability (%)
Hardness (kP)
In this experiment, running both experimental factors at their high
and low settings will not produce meaningful results. The ranges of
variation for the two factors are provided in Table 11.32.

Figure 11.86: Normal probability plot of residuals for response acceptance, fruit beverage
design.
Figure 11.87: Predicted vs actual plot for response acceptance, fruit beverage design.
Figure 11.88: Response surface plot of response acceptance, fruit beverage design.
Figure 11.89: Graphical optimisation of response acceptance, fruit beverage design.

To avoid these extreme conditions, the following constraint was


placed on the design.

7 ≤ Granulation Time + Lubrication Time ≤ 12

The following design space shown in Figure 11.90 was generated


for this experiment.
The points in Figure 11.90 are chosen to best cover the entire
design space and result in a space that is most orthogonal under the
given constraint. The standard error for this design is spread evenly
over the design space. This design could now be analysed using
standard DoE methods, however, it will be analysed using PLSR to
show how the multiple responses relate to each other under the
constraint placed on the design space.
A first model with only linear terms was calculated and the PLSR
Overview generated by The Unscrambler® is shown in Figure 11.91.
The Explained Variance plot is a first indicator of lack of fit of the
linear model. The model was validated using a test set based on the
points replicated in the design. This plot shows that the validation set
models better than the calibration set and that the most information
to be extracted from the model is about 60%.
The t1 vs t2 scores plot shows that the approximate shape of the
design (refer to Figure 11.91) is apparent in the data. This is a first
indicator that the responses have been measured correctly by the
reference method. The red validation scores show the replicated
points used for the test set. There are no indications that any
replicate run is different from the original calibration run.
The loadings are shown as correlation loadings. This is the major
benefit of PLSR for analysing multiple responses, not only is it
possible to see which X-variables relate to the individual Y-variables,
but the Y-variable covariance structure is revealed. In this case
Friability and Hardness are correlated. Dissolution is not correlated
well to any X-variable for PLSR factors 1 and 2. This is indicated by
the poor model fit in the Predicted vs Reference plot. Since there are
three responses, the Predicted vs Reference plots for Friability and
Hardness for the linear model are provided in Figure 11.92.
Figure 11.90: Design space for the tablet formulation study.
Figure 11.91: PLSR overview of the tablet data for a purely linear model.

The data in Figure 11.92 show a better fit of the linear model for
Friability and Hardness compared to Dissolution, however, there is
still a distinct curvature that requires a higher order model.
The form of the linear model for this analysis is,

where y is one of the responses Dissolution, Friability or Hardness, GT


is the Granulation Time and LT is the Lubrication Time.
It is possible to add Interaction and Square Effects to a PLSR
model where the model takes the form,

The PLSR Overview of this model is provided in Figure 11.93.


In this case, the Explained variance plot shows that a model that
explains >95% of the Y-variables is possible for three PLS factors.
With the inclusion of the interaction and square terms, the number of
variables increases from three to five. Three PLSR factors is therefore
a satisfactory number as this would not lead to overfitting and can
account for the two main effects and the curvature in the system.
The Scores plot now shows a different shape compared to the
linear model to account for samples that lead to curvature. The
Correlation loading plot shows that PLSR factors 1 and 2 are most
important for modelling Friability and Hardness. Friability is mostly
correlated to PLSR factor 1, while Hardness is correlated to PLSR
factors 1 and 2 equally. Since the variable Lubrication Time (and its
square term) lie in the same space as Friability and Hardness,
increasing this time also increases these responses. This is opposed
to Granulation Time, if this is increased, both Friability and Hardness
are decreased.
The Predicted vs Reference plots for Friability and Hardness are
provided in Figure 11.94.
The predicted vs reference plots for Friability and Hardness show
excellent fit to the quadratic model.

Figure 11.92: Predicted vs reference plots for friability and hardness for tablet data.

Dissolution is described by PLSR factor 3. This is shown in the


factor 1 vs factor 3 correlation loadings plot of Figure 11.95.
The correlation loadings plot of Figure 11.95 shows that
Dissolution is correlated to Granulation Time in PLSR factor 3. It can
now be shown that,
Friability is primarily correlated to PLSR factor 1
Hardness is correlated to PLSR factor 1 but mainly correlated to
PLSR factor 2
Dissolution is correlated to PLSR factor 1 but mainly correlated to
PLSR factor 3
This is also an indication of the design being close to orthogonal
as there are three PLS factors to describe three responses.
The design was also analysed using standard DoE methods and in
all cases, the quadratic model was found to be significant for all
responses. The objective of the design was to develop a process that
resulted in tablets with the following performance characteristics
(CQAs),

Dissolution ≥ 75% after 30 mins


Friability ≤ 0.5%
Hardness ≥ 10 kP

The method of Graphical Optimisation was used to determine


whether process conditions could be established to consistently
manufacture tablets with the above CQAs. This is shown in Figure
11.96.
The old saying that time is equal to money should be taken into
account for this example. The ability to reduce both Granulation Time
and Lubrication Time, can, in some situations, lead to better process
equipment utilisation, which leads to improved productivity, whilst
maintaining a high level of quality. A Granulation Time of 5 min and a
Lubrication Time of 4 min would result in a tablet with all performance
characteristics in their acceptable ranges. Table 11.33 provides the
predicted values of the design points in Figure 11.96 that lie in the
optimal (yellow) region and the suggested point defined previously
using the PLSR model.
From Table 11.33, the suggested optimal point number 1 is the
best option as it lowers the overall manufacturing time and results in
a tablet with all of the performance characteristics in specification.
The regression coefficients for each response are provided as
follows,

Figure 11.93: PLSR overview of the interaction and squares model for the tablet data.

These equations contain the interpretable part of the model. For


example, taking the model for Dissolution, the size of the regression
coefficients is of similar magnitude, indicating that all factors
contribute equally. The Granulation Time and Lubrication Time both
have positive influences on Dissolution, while their square terms
negatively impact the response. The interaction term of Granulation
Time Lubrication Time have a synergistic effect on dissolution. The
effects of different Granulation Times and Lubrication Times were
presented in Table 11.33.
When analysing non-orthogonal designs with multiple responses,
it is suggested that the PLSR method be used in conjunction with
traditional DoE analysis. This ensures that the covariance structure
between the responses can easily be visualised and also, the PLSR
algorithm results in orthogonal components which provide better
interpretation of the final model.

11.3 Chapter summary


Design of Experiments (DoE) is a highly valuable set of tools that each
experimental scientist or process engineer should have in their skill
set. Rather than performing ad hoc, one-variable-at-a-time (OVAT)
experiments, which rarely lead to valid conclusions, DoE
simultaneously varies the experimental factors in a systematic way,
allowing for better coverage of the design space in the minimal
number of experiments.
Figure 11.94: Predicted vs reference plots for friability and hardness, interaction and squares
model, tablet data.
Figure 11.95: Correlation loadings plot for PLSR loadings 1 and 3 showing the relationship
between granulation time and dissolution, tablet data.

Not only is the experimental design space covered in a minimum


of experiments, but a maximum amount of information can be
obtained from the data due to the mathematical models used to its
analysis. This means that there are exact models for the design
spaces generated that provide independent assessment of not only
the main effects of the experimentally varied factors, but also their
interactions and higher order terms (quadratic, cubic etc.)
Designed experiments can be used for a number of purposes, the
main purposes being,
Screening Designs: Used to isolate a small number of factors from
a large starting set in a small number of experiments.
Factor Influence Designs: Used to take a small number of
important factors to better understand their interaction with each
other.
Optimisation Designs: Used to provide a detailed understanding of
a localised area of the design space to better understand its
stability when factors are slightly changed in the region of most
importance.
Corresponding to each purpose listed above are specific classes
of designs for reaching the objectives, these are,
Plackett–Burman and Low Resolution Fractional Factorial designs
that put more emphasis on main effects than interactions. There is
a risk of important interactions being mistaken for main effects,
however, when a large number of factors has to be screened, this
is typically a risk most often taken.
Full Factorial and High Resolution Fractional Factorial designs are
used to better understand main effects and Two-Factor
Interactions (2FIs). These typically result in a predictive model that
can be used to isolate local and global maxima and minima in the
design space.
Central Composite and Box–Behnken designs are useful for
optimising responses based on the location of the model maxima
or minima defined by a factor influence study. These models
typically fit quadratic or higher order models to better understand
the shape of the optimal region.
Designed experiments are typically sequential in the way they are
built. This gives them excellent properties in terms of the amount of
experimentation that needs to be performed. For example, a well-
trained professional in DoE will typically start an investigation by
isolating a typically large (maybe 5–10 initial factors, however in some
cases, this could be in the 50s) and take the risk of using a low
resolution fractional factorial design to isolate main effects. In most
cases, many unimportant factors are eliminated, resulting in the
ability to reanalyse the data, without having to perform more
experiments.
Figure 11.96: Graphical optimisation overlay plot of dissolution, friability and hardness
responses based on the performance characteristics defined for the tablet data.

Table 11.33: Predicted performance characteristics for selected


design points and an optimal point using the PLSR model, tablet
data.

Point Granulation Lubrication Total Dissolution Friability Hardness


time (min) time (min) process (%) (%) (kP)
time (min)

1 5.00 4.00 9.00 84.5 0.29 11.36

2 4.12 3.27 7.39 71.2 0.38 12.98

3 4.80 4.87 9.67 82.9 0.38 11.60

4 5.40 3.50 8.90 83.3 0.27 10.78


5 6.52 3.50 10.02 76.4 0.28 10.10

6 5.30 6.70 12.00 80.7 0.62 10.37

7 7.00 5.00 12.00 87.3 0.28 10.83

In some cases, an extra set of experiments may need to be run if


there are interactions that cannot be resolved; however, this is just an
addition to the work already done, not a repeat. When the main
effects and their interactions can be understood, the design can be
extended to an optimisation design, again without having to repeat
any previous work.
When designs are extended, sometimes, the influence of days,
operators, materials etc. may be uncontrollable and a means of
assessing if any systematic effects, not associated with the
experimental factors, should be performed. This is achieved through
blocking, which allows an experimenter to take higher order model
terms (Three-Factor Interactions, or higher) and confound them with
the block effect. If the block effect is found not to influence the
design, the entire combined data can be analysed in one model. If the
block effect is large, then the systematic effect introduced by
performing the experiment in more than one block is too large to
generate an interpretable model. Blocking can also be assessed
using centre points. These do not take part in most factorial models,
but are used to assess experimental error and also curvature in the
response. When curvature is present, the need for a higher order
model (quadratic etc.) is often required.
The ANOVA table is the main diagnostic tool used to assess the
quality of DoE models. It partitions the initially unexplained variability
into a part that can be explained by the fitted model (Sum of Squares
for Regression) and a part that cannot be explained by the fitted
model (Sum of Squares for Residual). Each model term can be
assessed for significance using an F-Test that ratios the Model Mean
Squares with the Residual Mean Squares. Once the model is found to
be statistically significant and the residuals are found to be
statistically insignificant, the model can be diagnosed and interpreted
using the many graphical procedures associated with the DoE
methodology.
One of the most important graphical tools associated with DoE is
the Response Surface. This is either a 2D or a 3D contour plot that
shows how the response changes over the entire design space.
When multiple responses are being modelled, the overlay of a
number of response surfaces using the method of Graphical (or
Numerical) Optimisation can be used, where simultaneous limits can
be put on responses and factors to determine if there is a region
where all responses exist that meets some predefined optimal point
in a process or product. This is another powerful advantage of the
DoE methodology over other types of experimentation and analysis
procedures (in particular the ad hoc “scientific method”).
When the factors are proportions of materials in a formulation, the
standard factorial designs cannot be applied in their current form. In
these cases, the so-called Mixture Designs are used. Like their
factorial counterparts, mixture designs can be used for screening,
component influence and optimisation. The major difference between
mixture designs and factorial designs is that mixture components are
constrained and the component values are dependent on each other,
e.g. in a three component mixture, if the proportions of two
components are known, the third component can be calculated by
difference. There are two main outcomes of these constraints on the
design space,
1) The equations fitted now take into account the dependency of the
components on each other. These are known as Schéffe models.
2) The response surfaces are now simplex shaped, i.e. they are the
simplest shapes that describe p-components in p – 1 space.
These models are interpreted in the same way as factorial models.
The typical ANOVA and diagnostics outputs for factorial models are
also available for mixture models.
When situations occur that require the design to be constrained to
a specific region of space that are not supported by the exact DoE
models (available for orthogonal designs), Computer Generated
designs are used for both factorial and mixture situations that
produce design spaces that are almost orthogonal in structure. This
means that the traditional models of DoE can be fit with a slight risk
of co-dependency between the regression coefficients. If this
situation is likely to cause concern to the analyst, multivariate
regression method such as PCR or PLSR can be used to analyse
single and multiple responses. PLSR offers the distinct advantage
over all other methodologies in that it reveals the covariance structure
between the experimental factors and the response and also the
covariance structure between the responses.
DoE is a methodology that is universal in its approach when
experimental factors are controllable. This is why it has found
widespread usage in the food & beverage sector and in recent times,
has become a fundamental tool in the pharmaceutical and related
industries in the Quality by Design initiative (chapter 13). It is
unfortunate that many universities still preach the OVAT approach to
experimentation and that so many undergraduate and post-graduate
student come out of these institutions with bad habits instilled into
their mindset. The adoption of DoE is a fundamental paradigm shift
from this limited mindset into situations where better decisions can
be made.
Once optimal conditions can be established, the use of
multivariate methods can then be employed in the form of
Multivariate Statistical Process Control (MSPC) to maintain the
conditions set in the design space. In the remaining chapters of this
book a larger synthesis a.o. between DoE, PAT, process sampling
(TOS) and QbD will be presented, carefully laid out for the learning
experience of the reader.

11.4 References
[1] Box, G.E.P., Hunter, W.G. and Hunter, J.S. (1978). Statistics for
Experimenters, An Introduction to Design, Data Analysis and
Model Building. John Wiley & Sons.
[2] Montgomery, D.C. (2005). Design and Analysis of Experiments,
6th Edn. John Wiley & Sons.
[3] Anderson, M.J. and Whitcomb, P.J. (2015). DOE Simplified:
Practical Tools for Effective Experimentation, 3rd Edition. CRC
Press. https://doi.org/10.1201/b18479
[4] Myers, R.H. and Montgomery, D.C. (2002). Response Surface
Methodology, Process and Product Optimization Using
Designed Experiments, 2nd Edn. John Wiley & Sons.
[5] Cornell, J. (2002). Experiments with Mixtures, Designs, Models
and the Analysis of Mixture Data, 3rd Edn. John Wiley & Sons.
https://doi.org/10.1002/9781118204221
[6] Smith, W.F. (2005). Experimental Design for Formulation.
American Statistical Association and the Society for Industrial
and Applied Mathematics.
https://doi.org/10.1137/1.9780898718393
[7] Crosier, R.B. (1984). “Mixture experiments: geometry and
pseudo-components”, Technometrics 28, 209–216.
https://doi.org/10.1080/00401706.1984.10487957
Chapter 12. Factor rotation and
multivariate curve resolution—
introduction to multivariate data
analysis, tier II

This chapter will introduce to the reader selected topics from tier II of
multivariate data analysis; issues and methods that definitely cannot
be understood, far less mastered, without significant experience with
the introductory curriculum in all preceding chapters, especially the
demonstration data analyses and comprehensive hands-on exercise
accomplishments. This chapter can be viewed as a bridge between
the current ideas in this book and more advanced multivariate data
analysis methods, which the authors strongly suggest that the reader
pursue after mastering the basics.

12.1 Simple structure


Possibly the most comprehensive overview of principal component
analysis (PCA) is the book by Jackson [1]. In his book, Jackson
introduces the concept of “simple structure”. In particular, when PCA
is performed, the components may not always be physically or
chemically interpretable in a direct and easy fashion (this is not the
case in very many situations, but it does merit appropriate solutions).
As the components in PCA are estimated on the basis of explaining
the variance in the one data set only, it does not necessarily mean
that the model reflects the true underlying phenomena in a broader
sense. Principal Components (PCs) are in certain cases abstract
variables that may not have real meaning in the physical sense. An
example of this situation arises in the analysis of spectroscopic data
when the spectra are measured on a positive absorbance scale
between 0 and 2 absorbance units. When PCA is applied to such
data and negative loadings are generated, these have no specific
meaning with respect to the physics behind the original data,
however, they certainly relate to real correlations within the variables
that result in the observed variations as seen in scores and loadings.
In some situations, one may get the understanding that the
underlying structure is a combination of, e.g., PC1 and PC2. In this
case, the overall variation in the system cannot be attributed to a
single underlying phenomenon (a single interpretable component) but
to a combination of phenomena that needs to be interpreted
collectively. In such a case, a “rotation” of the primary PCA
components may lead to alternative components that may result in
simpler, yet more intelligible possibilities for causal interpretation of
the system. Such a rotated system will then lead to what is termed
“simple structure”. After the application of a rotation to the PCA
scores, the rotated loadings will express the exact same fraction total
variance modelled as the primary solution, but now rotated along the
PC directions with as little overlap as possible. In this way, the
original variables are as correlated with one-another and/or as
independent from each other as possible.

12.2 PCA rotation


When a primary PCA model is established, it is emphasised that the
model dimensionality should be decided upon before any rotation is
performed. As described in chapter 4, the dimensionality or rank of
the model is to be understood as the number of components that
contain the greatest possible structural information given the
application situation (and is discarding just the right proportion of
noise for this purpose), the optimal number of PCs, Aopt. It is thus
always necessary to first run a standard PCA and use the validation
variance curve, score and loading plots and background knowledge
to decide on the optimal number of PCs. On this background,
optional rotation approaches may then come to the fore.
Harman [2] has listed useful criteria associated with a simple
structure.
1) Each row of the factor matrix should have at least one zero. A
factor matrix is the rotated PCA scores and loadings combination
that leads to simple structure and therefore is different from the
original scores and loadings decomposition. In particular, scores
and loadings are not necessarily independent of each other.
2) If there are m common factors, each column of the factor matrix
should have at least m zeros.
3) For every pair of columns of the factor matrix there should be
several variables whose entries vanish in one column but not in the
other.
4) For every pair of columns of the factor matrix, a large proportion of
the variables should have vanishing entries in both columns when
there are four or more factors.
5) For every pair of columns of the factor matrix there should be only
a small number of variables with non-vanishing entries in both
columns.
Put simply, these criteria relate to a process known as Orthogonal
rotation that takes the original PC data structures and attempts to
find a rotation that better aligns the effective variances with the
original PC axes. Orthogonal rotation is used a lot in the field of factor
analysis. The rotated PCs are here known as factors and since they
are no longer orthogonal with respect to the original PCA space, they
may lead to more useful interpretation in the physical, although not
necessarily in the mathematical sense. Observe that the set of
primary PCs are mutually orthogonal—and so is the new set of
rotated components, hence the name: orthogonal rotation is a rigidly
rotated set of orthogonal Aopt PC axes.
There are other types of rotations, known as oblique rotation
methods that attempt the rotate the PC components independently
from one-another. The method of independent component analysis
(ICA) is one such oblique rotation approach, however, its description
is outside of the scope of the current book and the interested reader
is referred to the excellent chapter by Westad and Kermit [3] on this
topic.
This chapter presents the concepts behind various orthogonal
rotation methods. The concept of orthogonal rotation is shown in
Figure 12.1.
Mathematically, rigid rotations can be performed by multiplying an
original matrix with a matrix that defines, in this case, orthogonal
rotation. Consider a matrix U that is defined to have orthogonal
columns,

Now consider an orthogonal rotation transform matrix, Θ

In which the angles θ represent the rotation to be applied to the


column vectors in U. The resulting rotated matrix U′ can be
calculated as follows,
Figure 12.1: The concept of orthogonal rotation.

Two cases in particular are


When θ = 0, cos(θ) = 1 and sin(θ) = 0, therefore, no rotation is
applied.
When θ = π / 2, cos(θ) = 0 and sin(θ) = 1, therefore , the matrix is
rotated by 90° and the result is,
where U* represents the rotated matrix.

12.3 Orthogonal rotation methods


At the outset, it is essential to state that orthogonal rotation does not
work for every situation and should not form a routine operation after
every primary PCA. It is used to simplify interpretations and the new
rotated components are no longer uncorrelated—so this is to used
only when needed. The orthogonally rotated model explains the same
proportion of the original data set variance as did the primary PCA,
but the individual rotated components will, in general, not explain the
exact same proportions (some will be larger, others smaller). The
partition of the total variance has changed, while addressing the
exact same total model variance explained; it has to because it is
using the same number of PCs that is modelling the data structure.
The rotated variance axes (rotated components) is what is producing
the possibility of augmented interpretability.
In order to derive rotated PCs (factors), four well-known methods
can be used. These are listed below and will be described in more
detail in the following sections,
1) Varimax rotation
2) Quartimax rotation
3) Equimax rotation
4) Parsimax rotation

12.3.1 Varimax rotation


Varimax rotation is the most commonly used method and was first
proposed by Kaiser [4] in 1958. Varimax rotation allows for alignment
of the PCs with the most important variables, by maximising the
variance of the squared loadings along the rotated PCs. Mardia, Kent
and Bibby [5] define Varimax rotation as the process of generating
axes with a few large loadings and as many ~zero loadings in the
rotated direction as possible. Thus, one may directly interpret the
rotated PCs as the directions along which the most significant
variables are to be found.
Rotation can be defined so to maximise the variance of the
squared loadings, given by the variance measure ν in equation 12.1:

where n is the number of samples, t is the scores, h a normalisation


factor and γ a scaling factor defining different types of rotation.
For Varimax rotation, the value of γ is 1 and equation 12.1 thus
becomes,

12.3.2 Quartimax rotation

Quartimax rotation was first introduced by Neuhaus and Wrigley [6]


and is more likely to produce a “general component” than varimax.
This is because Quartimax attempts to simplify the rows in the so-
called pattern matrix. Refer to Darton [7] for more details on this.
For Quartimax rotation, the value of γ is 0 and equation 12.1
becomes,

Malinowski and Howery [8] further describe Quartimax rotation as


a method that preserves the angles between the eigenvector axes
and thus is searched for a set of orthogonal axes such that it groups
clusters of points about each axis with each point having high or low
loadings on each axis. A disadvantage of Quartimax rotation is that it
tends to overload the first factor, producing one large general factor
and many small and insignificant subsequent factors.

12.3.3 Equimax rotation

By changing γ to (Num PCs/2), the Equimax rotation of Saunders [9]


is obtained. Equimax was developed to improve on Varimax rotation
whereby important variables are more evenly distributed over factors,
as opposed to the situation for Quartimax rotation described above
(in a sense, Equimax is a hybrid of Varimax and Quartimax).

12.3.4 Parsimax rotation


The Parsimax rotation was first described by Crawford and Ferguson
[10]. The Parsimax criterion is an extension of the Equimax principle
states that weights can be chosen to maximise parsimony, i.e. simple
structure, regardless of the number of factors rotated.
For an excellent overview of the principles of factor rotation and all
of the methods described above in the behavioural sciences, the
interested reader is referred to the review by Browne [11].

12.4 Interpretation of rotated PCA results


The results of a rotated PCA are to be interpreted in a similar way as
that of normal PCA. In practice, one would first study the original PCA
model and diagnose it with respect to outlier screening, and then find
the optimal number of PCs. Once this has been established, an
orthogonal rotation (one of the methods defined in section 12.3) can
be applied to the data based on some background knowledge of the
application of the particular method. Software such as The
Unscrambler® will present the rotated PCA overview, which typically
contains the same plots as given for a PCA model without rotation,
i.e. scores, loadings and influence plots can be interpreted
collectively, quite as usual. The residuals and variance plots,
however, contain only “Calibration” results, since no validation is
performed at the rotation stage.
Rotation is perhaps best described using the example provided in
the next section.

12.4.1 PCA rotation applied to NIR data of fish samples


Quality control in the fishing industry is a highly important application
as it may determine the market the fish go into, which in some cases,
can demand a premium price depending on the quality. Technologies
such as NIR spectroscopy provide a means of rapid, non-destructive
assessment of fat content, which can be performed, at the point of
receipt, for grading. There are some limitations of the method though,
most prominently one related to non-linearities associated with the fat
signals, primarily influenced by the high moisture content of biological
samples. These effects must be addressed through use of
appropriate preprocessing of the spectra (chapter 5).
In the following analysis, the application of PCA will be used for
understanding the variance structure of the data collected on a
training set of selected fish samples followed by varimax rotation to
see if there are more “interpretable” directions apparent in the same
data set after rotation.
For the development of a calibration method for predicting fat in
fish, an extensive training set of 66 representative samples were
selected and NIR spectra collected in transmission mode over the
wavelength region 700–1100 nm of properly prepared analytical
aliquots. The raw spectra are provided in Figure 12.2.
The data in Figure 12.2 are affected both by scatter and offset
effects. This is evidenced by the use of a scatter effects plot (Figure
12.3) that shows a “fanning” effect in the data and since the plots do
not overlay each other, this is indicative of baseline effects (refer to
chapter 5 for more details on baseline effects and their correction).
Figure 12.2: Raw spectra of fish samples collected by NIR transmission spectroscopy
over the wavelength region 700–1100 nm.

Figure 12.3: Scatter effects plot of the fish NIR spectra.

The scatter effect plot is a useful tool for determining which


preprocessing method to apply to the data. The first choice of
preprocessing was the use of multiplicative scatter correction (MSC,
chapter 5, section 5.3.6) because the general spectral profile of all
samples was similar. The MSC-transformed spectra for this data are
provided in Figure 12.4.
Figure 12.4: MSC preprocessed NIR spectra of fat in fish samples.

In Figure 12.4, it was observed that the MSC preprocessing was


not capable of removing the spectral non-linearity observed in the
scatter plot in the 830–930 nm region.
Another common preprocessing method applied to NIR spectra
are derivatives (refer to chapter 5, section 5.3.5). The Savitzky–Golay
second derivative was applied to the NIR spectra of fish and
preprocessed data are shown in Figure 12.5.
Initial inspection of the second derivative spectra reveal some
highly systematic structure in the data, compared to the raw spectra
and the PCA overview of this data is shown in Figure 12.6.
It is noted in Figure 12.6 that two PCs describe 98% of the total
data variance. The loadings are plotted in two ways in the PCA
overview. The loading line plot shows the spectral weightings of the
two components and shows a high correlation of the absorbance
around 950 nm of the two loadings. This describes why the data are
distributed over PCs 1 and 2 in the score plot (which has been
sample grouped based on the absorbances at 944 nm). The 2-
Dimensional loading plot shows that there are two main directions of
the variables which are in fact oblique w.r.t. PCs 1 and 2 (these have
been highlighted using lines drawn “manually”). These lines of major
variation visually give the impression of being “causally orthogonal to
each other” and it would seem from first sights that an orthogonal
rotation method such as Varimax may perhaps result in a simpler
structure of the loadings, and therefore the scores for easier
interpretation.
After application of the Varimax rotation, the rotated PCA
overview is provided in Figure 12.7.
The Varimax rotation has now rotated the scores and loadings
such that they better line up with the PC1 and PC2 directions, with
rotated PC1 direction now describing the moisture content in the
samples (note the large loading around 950 nm) and the rotated PC2
direction now describing the fat content of the fish samples (as
described by the large loading around 880 nm).

Figure 12.5: Savitzky–Golay second derivative spectra of fat in fish.


Figure 12.6: PCA overview of second derivative fat in fish spectra.

Figure 12.7: Varimax rotated PCA of fat in fish data.

The line rotated loadings plot shows that the original correlation
between the PCs around 944 nm is now minimised and this indicates
that in the original PCA, fat and moisture are correlated. However,
during rotation, these two factors now separate the two effects,
revealing that a simple structure of fat and moisture content is
present in the data set. This example will be described in more detail
in section 12.6, where the method of multivariate curve resolution
(MCR) will be applied to the data.
Overall, rotation methods can lead to simple structure, which in
some cases is demonstrably easier to interpret than the original PCA
space. The rotated loadings may still not have real physical meaning,
however, but in the case of the fat in fish example, the loadings meet
the Varimax criteria of a small number of large loadings and a large
number of near zero loadings for each factor. These rotated factors
have better physical meaning in the sense that fat and moisture are
typically correlated in samples of biological nature.
The next section describes the effect of PC rotation under certain
predefined constraints known as multivariate curve resolution (MCR).

12.5 An introduction to multivariate curve


resolution (MCR)

12.5.1 What is multivariate curve resolution?

Multivariate curve resolution (MCR) methods have been introduced to


implement certain constraints on the scores and loadings, such that
they have an exact physical interpretation [12]. Taking the above
concepts a little further, PCA produces an orthogonal bilinear data set
(matrix) decomposition, where components or factors are obtained in
a sequential way each explaining a maximum, decreasing proportion
of the total data variance. Using these design criteria plus
normalisation (of the loadings), PCA produces unique solutions.
These standard PCA solutions are very helpful in deducing the
number of different sources of variation present in the data and they
usually allow for their identification and interpretation. Most of this
book has so far guided the reader along this standard path. However,
these solutions can be “abstract” in the sense that they are not
necessarily the “true” underlying factors (with an interpretable
physico–chemical meaning) causing the data variation in all similar
data sets.
MCR methods are a group of techniques that intend to recover
concentration profiles and spectral profiles from an unresolved
mixture using a minimal number of assumptions about the nature and
composition of the mixtures. These methods are also known as
“Blind source separation” or “Self-modelling mixture analysis”
methods. MCR methods can be easily extended to the analysis of
many other types of data including, for example, multiway and image
data. Most of the data examples analysed up to now have been
standard two-way “flat” data tables.

12.5.2 How multivariate curve resolution works

PCA is again found as the underlying workhorse, this time also of


MCR. PCA is used to define the rank of the data matrix investigated
and this forms the basis for determining the number of MCR
components to resolve. The master equation for a PCA model is
shown here again for convenience, equation 12.2.

MCR works on an adaptation of equation 12.2 by rotating the


abstract components (under a set of predefined constraints), such
that the data decomposition is described by equation 12.3.

In which, C is the estimated concentration profile of a particular


constituent measured in the data matrix, X; and ST is the estimated
spectral response for the profile measured in C for a particular
constituent measured in the data matrix X.
It is easy to see, by analogy, that the scores T in PCA, when
rotated by MCR become the estimated concentration profiles, C—
and the loadings PT become the estimated spectral profiles ST. Thus,
under the conditions of constrained MCR rotation, the scores and
loadings go from being abstract to being physically interpretable
entities (in the language of spectroscopy). The process of MCR is
shown graphically in Figure 12.8.

Figure 12.8: Graphical representation of the MCR process.

12.5.3 Data types suitable for MCR

Any type of spectral/signal data that is a linear combination of several


individual components can be analysed by MCR, however, the
method works especially well when there is some form of evolution of
the system, characterised by the data over time, for example when
applied to spectroscopic data collected for a developing chemical
reaction, or when elucidating the pure components of a mixture using
high performance liquid chromatography (HPLC) and collecting full
ultra violet/visible (UV/vis) spectra for each sample over time.
MCR has also been used to resolve pure components of
hyperspectral imaging data for the assessment of the components in
the state they exist in the mixture being measured. This has major
impacts regarding interpretation possibilities. For example, if the pure
component spectrum of a single constituent is measured by a
spectroscopic technique of choice, then this is just the as-is
information of the material disregarding the “environment” of the
mixture to be measured, whereas in some cases, when a component
is resolved using MCR in the state it exists in the mixture, then subtle
changes (particularly hydrogen bond shifts) can be detected,
providing greater insights into dynamic reaction chemistry and solids
mixing phenomena.
As an example, the data shown in Figure 12.9 was collected using
UV/vis spectroscopy on a chemical reaction. In this case, the data
were collected on an evolving process with reaction kinetics
described by equation 12.4.

In equation 12.4, the initial constituent A is converted to product C


via a transition state B*. The raw UV/vis spectra for the reaction is
shown in Figure 12.9.
The first observation regarding this data is that the absorbance
scale is positive between 0 and 1 absorbance units. An initial PCA
was applied to the data and the scores and loadings for the important
components are shown in Figure 12.10.
PCA suggests that there are two sources of variation in this data.
PC1 describes the majority of the variability (97%) and PC2 the
remaining variability (3%). Note the shape of the loadings. For each of
the PCs used to describe the data, there are negative loadings. While
these can be used to interpret the relative changes in the wavelengths
in the data as the reaction progresses, there is no real physical
explanation of the loadings since the original scale of the data was on
a positive scale exclusively.
What is shown here is not that PCA is useless, or meaningless.
PCA is an excellent approach for interpreting the relative changes in
the absorbance spectra as the reaction proceeds and can be
interpreted in a mechanistic way. However, the loadings are in this
case clearly “abstract components” with a mathematical, but not a
physical meaning. Before performing the MCR on the same data set,
the next section describes a number of constraints that can be
placed on the data to provide the kind of physical interpretations
required.

12.6 Constraints in MCR


Constraints are mathematical bounds placed on the data analysis
such that the data structure is forced to respect predefined criteria.
Although curve resolution methods are not reliant on external
information in general, the use of external information, when it exists,
can direct the analysis to an exact solution. This is done in order to
minimise ambiguity in the data decomposition and in the results
obtained. Rotational ambiguity is one of the major pitfalls in the use of
MCR and is discussed in greater detail below in section 12.6.5.
Figure 12.9: Raw UV/vis spectra of a chemical reaction.
Figure 12.10: PCA scores and loadings of chemical reaction data.

Constraints should only be applied when there is absolute


certainty about the validity and the resulting effects of the constraint
being applied to a particular data set. Misuse of constraints can in
some cases lead to nonsensical and meaningless resolutions which
can completely damage interpretation. However, when well
implemented constraints will direct the MCR process to the right
solution with correct interpretability.
There exist a number of commonly used constraints in MCR,
based on either chemical or mathematical features of the data set. In
terms of implementation, there is typically a choice between equality
constraints or inequality constraints. An equality constraint sets the
elements in a profile to be equal to a certain value, whereas an
inequality constraint forces the elements in a profile to be unequal
(higher or lower) than a certain threshold. The following define the
three commonly used constraint types.

12.6.1 Non-negativity constraints

The non-negativity constraint is applied when it can be assumed that


the measured values in an experiment will always be non-negative.
This constraint forces the values in a profile to be equal to or greater
than zero. Non-negativity constraints may be applied independently
of each other to:
Concentrations (the elements in each row of the C matrix)
Response profiles (the elements in each row of the ST matrix)
To be physically interpretable, concentration profiles should never
be negative (i.e. mass cannot be negative. The result is a set of
estimated spectral profiles that describe the chemical species in the
system, as they exist in the mixture being analysed.
This constraint, however, must be put into context. For example, if
the raw spectra were preprocessed using a derivative, the data will
now be a combination of positive and negative absorbance bands. In
this case, imposing a non-negativity constraint on the derivatised
data has absolutely no meaning and will definitely lead to nonsensical
results for the estimated spectra.
12.6.2 Uni-modality constraints

The concept of uni-modality relates to the assumption that there


should only be one peak maximum (or plateau) per concentration
profile. A typical example of a uni-modality constraint is when MCR is
applied to complex chromatographic data where unique separations
are not physically possible. By implementing a uni-modality
constraint, the MCR will proceed such that only one peak is resolved
in the concentration profile per component treated. In this way, the
corresponding estimated spectra can be interpreted for physical
meaning and the presence or absence of ambiguity confirmed.
In the case of many monotonic reaction profiles reflecting increase
or decrease of particular species in the reaction, the uni-modality
constraint will often result in a plateau to show that as one species is
consumed by the reaction and another is produced until there is no
more reactants to be converted. Such constraints should always be
considered along with the closure constraint defined below, when
applying MCR to chemical reaction data. Examples of uni-modality
constraints are provided in Figure 12.11.

12.6.3 Closure constraints


The closure constraint is related to a balancing act when the units of
measurement are relative, i.e. they are forced to sum to a constant
(1.0 or 100% is the most often met with constant). Be advised that
the familiar chemical concentrations are not absolute, but relative (the
sum of all concentrations is in fact 100%).
Likewise, the sum of the concentrations of all the species involved
in a chemical reaction is forced to a constant value (e.g. 100%); this
is an example of an equality constraint. This constraint is typically
applied to the concentration profiles as these are related to the mass
balance requirement of a closed system (no material is entering or
leaving the system). It is important to verify that the system is indeed
closed before application of this particular constraint. As an example,
a continuous fed batch fermentation reaction cannot be considered a
closed system as material (nutrients, etc.) is fed into the system while
products (in some cases, e.g. perfusion reactions) are taken out.
However, a fixed volume closed reaction system would represent a
system where mass balance can be assumed.
One should be aware that there is a particular potential danger
regarding multivariate data analysis on so-called closed arrays, i.e.
data matrices in which each row is subject to the specific equality
constraint = 100%.

12.6.4 Other constraints

In addition to the three constraints discussed above, other types can


also be applied. The constraints issue is one that novice data analysts
should not delve into without considerable experience. The interested
reader is referred to the excellent work by Tauler and de Juan [12]
and Maeder [13] in this field of application.

Local rank constraints

Particularly important for the correct resolution of two-way data


systems are the so-called local rank constraints: selectivity and zero-
concentration windows. These types of constraints are associated
with describing how the number and distribution of components
varies locally along the data set time evolution. The key constraint
within this context is selectivity. Selectivity constraints can be used in
concentration and spectral windows where only one component is
present to completely suppress the ambiguity linked to the
complementary profile in the system. Thus, selective concentration
windows provide unique spectra of the associated components and
vice versa. The powerful effect of these types of constraints and their
direct link with the corresponding concept of chemical selectivity
explains their early and wide application in chromatographic
resolution problems [14]. Not so common, but equally recommended,
is the use of other local rank constraints in iterative resolution
methods. These can be used to describe which components are
absent in data set windows by setting the number of components
inside windows smaller than the total rank. This approach always
improves the resolution of profiles and minimises the rotational
ambiguity in the final results. Refer to the literature on evolving factor
analysis (EFA) and other related methods, particularly the work of
Tauler [14].

Figure 12.11: Some examples of the uni-modality constraint.

Physico–chemical constraints
One of the most recent progresses in chemical constraints refers to
the implementation of a physico–chemical model into the multivariate
curve resolution domain. In this approach, the concentration profiles
of compounds involved in a kinetic or a thermodynamic process are
shaped according to a relevant chemical law. Such a strategy has
been used to reconcile the separate worlds of hard- and soft-
modelling and has enabled the mathematical resolution of chemical
systems that could not be successfully tackled by either of the two
pure methodologies alone. The strictness of a hard model constraint
dramatically decreases the ambiguity of the constrained profiles and
provides fitted parameters of physico–chemical and analytical
interest, such as equilibrium constants, kinetic rate constants and
total analyte concentrations. The soft part of the algorithm allows for
modelling of complex systems, where the central reaction system
evolves in the presence, and as a function of absorbing interferences.
Refer to the literature on kinetic modelling and other related methods,
particularly the work of Maeder [13].
Returning to PCA of the chemical reaction data, MCR is now
applied to the data set using the constraints of non-negativity on both
concentration profiles and spectra as well as the constraint of uni--
modality based on subject-matter expertise regarding the underlying
reaction kinetics (as always, the more domain-specific knowledge
available, the better and the more relevant data analysis possible).
This constrained MCR overview is provided in Figure 12.12.
First inspection of the constrained MCR indicates that the three-
component model now results in significantly minimised sample
residuals. The component concentrations plot shows that for
component 1, there is a monotonic decrease (uni-modality) in the
concentration of this component as time progresses. This is
indicative of a starting material being consumed by the reaction. The
blue curve in the component spectra plot provides the estimated
spectra for the starting material. As the reaction progresses, the
appearance of component 3 reaches a unimodal point where no more
product is being generated and its spectrum is defined by the green
curve in the component spectrum plot.
It is the generation of the second component that establishes
MCR as the more relevant data analysis approach for these data vs
PCA. In the PCA overview, Figure 12.10, only two sources of
variability were observed—these can be understood as PCA
attempting to find the greatest sources of variability, in this case, the
consumption of the reactant and the generation of the product,
precisely the design feature of PCA as variance maximisation. In
contrast, a suitably constrained MCR was not only able to describe
the consumption and formation kinetics, but was also able to reveal
the transition state spectrum (the red curve) reflecting a transition
state compound in the reaction. In a sense MCR “discovered” this
third component, which was hidden or swamped by the
overpowering other two component variances.
The appropriately constrained MCR now provides the relevant
basis for detailed mechanistic assessment of the reaction profile and
provides a spectral assessment of transition state that may not have
been able to be separated by any other means, both physical or
mathematical.*

Figure 12.12: MCR overview of the chemical reaction data using UV/vis spectroscopy.

12.6.5 Ambiguities and constraints in MCR

The goal of curve resolution methods is to unravel the “true”


underlying sources of data variation often from complex data “lifted”
from dynamic processes. In this context MCR is a very powerful tool,
that even sometimes can also be used on “similar” types of data,
which may not necessarily be process data in the same fashion, e.g.
hyperspectral image data. However, it must be recognised that MCR
does not necessarily generate unique solutions, unless external
information is provided during the data decomposition.
When the goals and demands of curve resolution are met, the
resulting understanding of the system under study is markedly
improved, and in many cases, avoids the use of enhanced and more
complex experimental techniques. Through MCR methods, the
ubiquitous mixture analysis problem in chemistry (and other scientific
fields) can be solved directly by mathematical and software tools
instead of using costly analytical chemistry and instrumental tools,
e.g. as in sophisticated hyphenated mass spectrometry-
chromatographic methods, but what are the remaining ambiguities
and how are they solved?

Rotational and intensity ambiguities in MCR


From a mathematical perspective, throughout the history of research
into curve resolution, decomposition of a single data matrix has been
known to be subject to ambiguities, no matter the method used.
What does this mean in terms of the results generated? In general,
there will be many alternative C- and ST-type matrices can that
reproduce the original data set with the same quality of fit. This
means that there may be potentially infinitely many component
profiles differing in shape (rotational ambiguity) or in magnitude
(intensity ambiguity) that can reproduce the same original data.
Mathematically, rotational ambiguities can be shown using
Equation 12.5.

where C′ = (CT) and S′T = T–1ST describe the X matrix equally as


correctly as the true C and ST matrices do, though C′ and S′T are not
the desired solutions.
Due to the rotational ambiguity problem, data analysis can provide
as many results as there exist T matrices. The only way to enforce a
unique solution in both C and ST is to apply certain restrictions on
them. The theoretically ideal solution obtained from the application of
MCR to a data set is provided in equation 12.6.

where ki are scalars and n refers to the number of components. Each


concentration profile of the new C′ matrix would have the same
shape as the real one, but being ki times smaller, whereas the related
spectra of the new S′ matrix would be equal in shape to the real
spectra, though ki times more intense.

12.7 Algorithms used in multivariate curve


resolution
Multivariate curve resolution methods can be classified into two
types,
1) Iterative
2) Non-iterative
A general description of two commonly used MCR algorithms will
be presented below, however, focus will be given to the most
commonly used and most versatile method known as multivariate
curve resolution–alternating least squares (MCR–ALS).

12.7.1 Evolving factor analysis (EFA)


Evolving factor analysis (EFA) [15] is a non-iterative method that
forms the basic foundation of the curve resolution methods. EFA is
mainly applicable to sequentially evolving systems, such as
chromatographic profiles or chemical reactions (although it has also
been used with some limited success in the resolution of mixtures
and image data). Known as a local rank method, EFA applies PCA to
an evolving data set by adding one row at a time to the matrix
(reflecting the next time unit) and calculating the singular values (SVs)
for each row addition. The premise of the method is that the highest
SVs relate to the chemical compound being generated and that lesser
SVs have no relation to the component of interest. By performing this
procedure in the direction of process evolution, the aim of EFA is to
detect the (first) appearance of each constituent being eluted/formed.
This is shown in Figure 12.13.
The next step in EFA is to perform the PCA in the reverse
direction, such that the first SV in the forward direction (representing
the formation of the first component) is represented by the (n – 1)th SV
in the reverse direction. In this way, by mathematically fitting the
forward and reverse evolving factors, the underlying concentration
profiles of the system can be found. EFA is only briefly described
here; the interested reader is referred to the important and very
comprehensive work by Tauler and his group [16] for a full and
detailed explanation on EFA. Figure 12.13 provides a graphical
description of the principle behind EFA.

12.7.2 Multivariate curve resolution–alternating least squares


(MCR–ALS)

Multivariate curve resolution–alternating least squares (MCR–ALS)


uses an iterative approach to find the matrices of concentration
profiles and instrumental responses (spectra). In this approach,
neither the C nor ST matrices have priority but both are optimised at
each iterative cycle. In general, MCR–ALS follows the general steps
listed as follows [17].
1) Determine the number of compounds in the original matrix X.
Figure 12.13: Graphical description of the EFA principle (figure taken from reference
[18]).

2) Calculate an initial estimate, typically the estimated concentrations


C.
3) Using this estimate of C, calculate the estimated spectra ST under
the specific constraints chosen.
4) Using the newly calculated estimated spectra ST, recalculate the
estimated concentration profiles C under the specific constraints
chosen.
5) From the product of C and ST found in the iterative cycles above,
calculate an estimate of the original matrix X, i.e. multiply the two
vector estimates with one-another.
6) Repeat steps 3, 4 and 5 until convergence.
Mathematically, the main purpose of MCR–ALS is to solve the
following least squares problem under problem-dependent
appropriate constraints,

In general, minimisation of the norms of the residuals calculated


between PCA on X and MCR on X is first calculated keeping C
constant—and then keeping ST constant. The least squares solution
for ST is provided in equation 12.7.

In this case, Ĉ+ is the pseudoinverse of the concentration matrix,


which for a full rank model is described by equation 12.8,
The least squares solution for Ĉ is provided in equation 12.9.

In this case, (ŜT)+ is the pseudoinverse of the spectra matrix, which


for a full rank model is described by equation 12.10,

Convergence of the alternating least squares optimisation


procedure is based on a comparison of the fit between two
consecutive iterations; when this difference falls below a specified,
low threshold (or in some cases a maximum number of iterations
allowed), the algorithm stops.

12.7.3 Initial estimates for MCR-ALS

MCR requires a set of initial estimates, the iterative optimisation of


the profiles in C or ST requires a matrix or a set of profiles of equal
dimension to the starting matrices with “similar” concentration
profiles or spectra that are to be expected. In general, the use of non-
random estimates helps shorten the number of iterations and can
avoid convergence to local optima different from the objective final
solution, which can be a substantial problem (a hole one cannot get
out of...). Chemically meaningful estimates will always result in better
resolutions and the initial estimates of either a C-type or an ST-type
matrix depend on profiles that show reduced overlap, which direction
of the matrix (rows or columns) has more information, or simply on
the direction of the process guided by the analyst. Obviously,
experience is a must here.

12.7.4 Computational parameters of MCR–ALS


In the MCR–ALS procedure, the constraint settings (non-negative
concentrations, non-negative spectra, uni-modality, closure) have
already been described in section 12.7, and the settings for sensitivity
to pure components are usually set using software packages like The
Unscrambler®. In general, it is advisable to have some background
knowledge available (through domain expertise) such as
understanding of constraints that apply to the application at hand and
the nature of the data before building the MCR model.
In many cases, the nature and application of constraints are
typically unknown, but there is perhaps a little information about the
relative order of magnitude of the estimated pure components upon
the first attempt at curve resolution. For instance, one of the products
of the reaction may be dominating, but detection and identification of
possible by-products may also be of interest. If some of these by-
products are synthesised in a very small amount compared to the
initial chemicals present in the system and the main product of the
reaction, the MCR computations will have trouble distinguishing
these by-products’ “signature” from mere noise in the data.

12.7.5 Tuning the sensitivity of the analysis to pure


components

After developing a first MCR and from a review of PCA from the initial
data set, it is important that the analyst checks the estimated number
of pure components and attempts an initial interpretation of the
profiles of those components. Software programs typically provide an
option to adjust the analysis to be more or less sensitive to the
component estimates added to the analysis. There are a number of
situations that can occur that will require tuning of the sensitivity of
those components such that they do not result in meaningless
results. Some cases are listed as follows.
Case 1: The estimated number of pure components is larger than
expected. In this situation, the best action is to reduce the sensitivity
of the analysis, using software, such that a more meaningful result is
obtained.
Case 2: When there are no prior expectations about the number of
pure components, but some of the extracted profiles look very noisy
and/or two of the estimated spectra are very similar, this indicates
that the actual number of components is probably smaller than the
that estimated. In this case, the best action again is to reduce the
sensitivity.
Case 3: When it is known that there are at least n different
components with varying concentrations in the system, and the
estimated number of pure components is smaller than n, this is a
case where the best action is to increase sensitivity to pure
components such that the analysis results in the expected number of
components.
Cases 1–3 above require the use of software to adjust sensitivities
and this is a feature of The Unscrambler® in the MCR functionality.
The default value is 100% sensitivity and this value can be adjusted
either side of this default depending on the situation encountered
with the data being modelled.

12.8 Main results of MCR


Contrary to what happens when building a PCA model, the number of
components (n) computed in an MCR model cannot be chosen. The
optimal number of components necessary to resolve the data is
estimated by the system, and the total number of components saved
in the MCR model is typically set to n + 1.
For each number of components k between 2 and n + 1, the
typical results provided by an MCR model are listed as follows:
Residuals: these are error measures that describe how much
variation remains in the data after k components have been
estimated.
Estimated concentrations (C): these describe the estimated pure
components’ profiles across all of the samples included in the
model. These could be estimated chromatograms or reaction
profiles in kinetics studies.
Estimated spectra (ST): these describe the instrumental properties
(e.g. spectra) of the estimated pure components.
A more detailed explanation of these parameters is provided in the
following sections.

12.8.1 Residuals

The residuals are a measure of the fit (or rather, lack-of-fit) of the
model. The smaller the residuals, the better the fit. MCR residuals can
be studied from three different points of view.
Variable residuals: these are a measure of the variation remaining
in each variable after k components have been estimated.
Sample residuals: these are a measure of the distance between
each sample and its model approximation.
Total residuals: these are a measure of how much variation in the
data remains to be explained after k components have been
estimated. Their role in the interpretation of MCR results is similar
to that of variances in PCA.

12.8.2 Estimated concentrations


The estimated concentrations show the profile of each estimated
pure component across the samples included in the MCR model.
These are typically plotted as a line plot where the abscissa shows
the samples/time stamp etc., and each of the k pure components is
represented by a single curve. The k estimated concentration profiles
can be interpreted as k new “samples” showing how much each of
the original samples “contain” of each estimated pure component.

12.8.3 Estimated spectra

The estimated spectra show the instrumental profile (e.g. spectrum)


of each pure component across the X-variables included in the
analysis. They are typically plotted as a line plot where the abscissa
shows the X-variables, and each of the k pure components is
represented by a single curve. These k estimated spectra can be
interpreted as the spectra of the pure components generated by the
model for a particular data set. Comparison of the spectra of the
original samples to the estimated spectra may be useful in
determining which of the actual samples are closest to the pure
components.

12.8.4 Practical use of estimated concentrations and


spectra and quality checks

Once a satisfactory MCR model has been established, interpretation


of the results is absolutely necessary due to the rotational ambiguity.
The results can be interpreted from three different points of view:
1) Assess or confirm if the number of pure components in the system
under study is realistic for the problem being investigated.
2) Identify the extracted components, using the estimated spectra
and verify the results are meaningful in a relevant physico–chemical
sense.
3) Quantify variations across samples, using the estimated
concentrations.
The following provides a few general rules and principles that may
help in the interpretation of an MCR model:
1) Always cross-check the MCR model with a PCA, try different
settings for the sensitivity to pure components, to ensure a
physically interpretable result is obtained.
2) The spectral profiles obtained may be compared to a library of
similar spectra in order to identify the nature of the pure
components that were resolved.
3) Estimated concentrations are relative values within an individual
component itself. Estimated concentrations of a sample do not
reflect its real composition, however, by the use of appropriate
scaling to known concentrations of pure components, quantitative
MCR is possible.

12.8.5 Outliers and noisy variables in MCR

As is the case for all multivariate analysis, the available data may be
more or less “clean” or “precise” when building the first curve
resolution model. The main tool for diagnosing outliers in MCR
consists of using the plot of sample residuals. Any sample that stands
out on the plots of sample residuals (either with MCR fitting or PCA
fitting) is a possible outlier.
To find out more about such a sample (Why is it outlying? Is it an
influential sample? Is that sample disturbing the model?), it is
recommended to run a PCA on the data. If an outlier is justifiably
removed, recalculation of the MCR model without that sample may
lead to a more interpretable analysis. In MCR, some of the available
variables—even if they are no more “noisy” than the others—may
also contribute poorly to the resolution, or even disturb the results.
The two main cases are:
Non-targeted wavelength regions: these variables carry virtually
no information that can be of use to the model;
Highly overlapped wavelength regions: several of the estimated
components have simultaneous peaks in those regions, so that
their respective contributions are difficult to disentangle (difficult to
unscramble).
The main tool for diagnosing noisy variables in MCR consists of
the plots of variable residuals. Any variable that stands out on such a
plot (either with MCR fitting or PCA fitting) may be disturbing the
model, thus reducing the quality of the resolution; try recalculating the
MCR model without such suspect variables.

12.9 MCR applied to fat analysis of fish


As a continuation of the case study presented in section 12.5 on fish
fat determination, a more detailed data treatment is provided below
incorporating the use of advanced preprocessing methods and MCR
to develop a more specific regression model than could be achieved
without the use of external information.
It was observed in Figure 12.3, that when the individual spectra
were plotted against the mean, there were some non-linearities in the
plot, indicative of systematic chemical variability in the data. As was
observed in the gluten–starch data presented in chapter 5, section
5.4.2, indiscriminant use of MSC can destroy the data internal
structure, resulting in a data set that has less information than the
original. In some applications, the use of external information to
develop a scatter correction model can be a more scientific way of
describing the variability of the data, but what happens when no such
external data is available? MCR may be of use in this situation.
Using the raw NIR spectra of a calibration set made up of 44
spectra, MCR was applied to this data resulting in the overview in
Figure 12.14. The two constraints used were non-negativity
constrains on both concentration profiles and spectra since the
original data were measured on a positive scale and concentrations
are always expected to be positive.
It was believed after an assessment of the component spectra
that the first estimated spectrum (blue curve) was the fat spectrum as
it existed in the fish samples and the second estimated spectrum
(red) was the moisture in the samples. The estimated fat spectrum
was used as a good spectrum in the EMSC algorithm (refer to
chapter 5, section 5.3.6) and the resulting spectra of Figure 12.15
were obtained.
Figure 12.14: MCR overview of fat in fish data, NIR transmission spectra.

Figure 12.15: EMSC-corrected spectra of fat in fish using estimated fat spectrum from
MCR.
Figure 12.16: PLSR model of fat in fish using EMSC/MCR preprocessing.

Figure 12.17: Updated PLSR model of fat in fish after the removal of justified outliers.

Comparison of the EMSC preprocessed data to the MSC


preprocessed data in Figure 12.4, shows that the inclusion of the
external MCR information helps very much to account for the non-
linear behaviour observed in the scatter plot of Figure 12.3.
Application of the PLSR algorithm to the EMSC-corrected spectra
resulted in the PLSR overview of Figure 12.14.
The resulting PLSR model required three PLS factors to obtain a
valid calibration. The model exposed the existence two reference
chemistry outliers, marked in the score and predicted vs reference
plots for convenience. The score plot was sample grouped using the
reference chemistry values as a grouping variable split into five
ranges. Outlier 1 was predicted at 15.9%, however, the reference
chemistry was estimated at 11.5%, while outlier 2 predicted at 22.5%
and its reference chemistry was estimated at 15.9%. Since the score
plot shows that the relative positions of the outliers are in the same
space as their predicted results, it can be concluded that the two
samples are reference chemistry outliers and should indeed be
removed from the analysis. The updated PLSR overview is provided
in Figure 12.17.

Figure 12.18: a) MSC preprocessed and b) EMSC preprocessed fat in fish calibration
models.

Table 12.1: Comparison of some fat in fish calibration models using


different preprocessing strategies.

Model ID Preprocessing Number of SEC SEP R2 (validation)


factors

1 None 5 2.33 2.95 0.92

2 Second 3 2.19 1.63 0.98


derivative
3 MSC 5 2.92 2.45 0.95

4 EMSC 3 1.46 1.58 0.98

5 SNV 4 2.96 2.22 0.96

6 mEMSC 3 1.02 1.00 0.99

Now that problem-dependent correction has been applied, the


question is, why is this model better than those based on the simpler
preprocessing alternatives?
The predicted vs reference plots for the MSC and simple EMSC
models (with outliers removed) are provided in Figures 12.18a and b,
respectively, and the model development statistics using test set
validation of a number of models are provided in Table 12.1.
This example shows well how there may not always be one,
singular best preprocessing approach. Rather the data analyst will
typically be left with a few, roughly equally good alternatives—and all
of these could be used as the final solution. After an optimised
compound model has been achieved, it is no longer a matter of
decimal advantages in one or other particular validation (or modelling)
statistics—it is much more about what meaningful arguments can be
produced for the specific solution developed. As always in the
multivariate data modelling domain, full understanding of the
methods deployed is of paramount importance (both regarding the
preprocessing used as well as the modelling strategy) as is a solid
foundation regarding the inherent data uncertainties embedded in the
original data (X), the latter guarding against over-fitting.
After a careful analysis of the results presented in Table 12.1, it
can be seen that the mEMSC model produces both the smallest
calibration and validation precision statistics in the smallest number
of components. This is the reason for its justification. Then and only
then, the R2 value can be considered (but the relevant RMSEP
estimates are provided for the sake of completion).
This example highlights some of the advanced techniques the
data analyst can try on their data sets when a comprehensive
understanding of the fundamental principles of multivariate analysis
are gained. The application of MCR and advanced preprocessing in
this example was founded based on interpretability, validation but
most importantly, that the final result was simple, i.e. lowest number
of PLSR factors resulting in the best precision estimates.

12.10 Chapter summary


This chapter introduced the advanced topics in multivariate analysis,
factor rotation and multivariate curve resolution (MCR). These topics
are related to each other in the sense that they attempt to find
physically interpretable components after the application of PCA, or
to transform the data out of a problem zone. Factor rotation attempts
to find this “simple structure” through the use of orthogonal rotation
matrices. In this way, the original PC space is rotated such that major
sources of variability line up with the established PC axes. Orthogonal
rotation works best when it is observed that the variance described in
the PCA is distributed across two PCs and thus, using rotation, the
underlying (and hopefully physically interpretable) structure is better
aligned with the PC axes.
Even after the application of orthogonal rotation to a PCA, the
rotated loadings may still be physically uninterpretable. To overcome
these situations, MCR can be used. MCR performs PC rotation under
a set of predefined constraints, in the particular case of
spectroscopic data analysis, if the original scale of the measurements
is between 0 and 2 absorbance units, then a constraint such as non-
negativity of spectra will force all estimated spectral profiles to have
values >0, thereby providing a basis for physical interpretation (if this
is, however, possible).
MCR will not always lead to an interpretable situation either, due
to ambiguity in the method. It is very important that when interpreting
estimated concentrations and spectral profiles using MCR that the
results are interpretable and have physical meaning (typically
performed through the comparison of estimated spectral profiles with
known reference spectra), otherwise, this will lead to false
conclusions and problems in the long term.
Extraction of pure spectral profiles using MCR is also a very useful
tool for the data analyst’s toolkit, particularly when looking at suitable
preprocessing methods to apply to a new data set. Sometimes, pure
component spectra are not always available (particularly when
biological matrices are being investigated) and when MCR can
extract useful information, these extracted profiles (and potentially
also interferent spectra) can be used as “good” or “bad” spectra in
preprocessing methods such as mEMSC.
An introduction to MCR was provided that is by no means
exhaustive and if this topic is of interest to the reader, the excellent
publications of the Tauler group [16] should be investigated. MCR
was applied to two-way data structures in this chapter, but it has
equal applicability in N-way structures and imaging data (both
textural and hyperspectral), naturally on a more complex level than
the 2-way data sets presented in this chapter.

12.11 References
[1] Jackson, J.E. (1991). A User’s Guide to Principal Components.
John Wiley & Sons. https://doi.org/10.1002/0471725331
[2] Harman, H.H. (1976). Modern Factor Analysis, 3rd Edn.
University of Chicago Press, Chicago.
[3] Westad, F. and Kermit, M. (2009). Independent Component
Analysis in Comprehensive Chemometrics. Elsevier Science, Ch.
2.14, pp. 227–248.
[4] Kaiser, H.F. (1958). “The varimax criterion for analytic rotation in
factor analysis”, Psychometrika 23, 187–200.
https://doi.org/10.1007/BF02289233
[5] Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate
Analysis. Academic Press, London.
[6] Neuhaus, J.O. and Wrigley, C. (1954). “The quartimax method:
an analytic approach to orthogonal simple structure”, Brit. J.
Stat. Psych. 7(2), 81–91. https://doi.org/10.1111/j.2044-
8317.1954.tb00147.x
[7] Darton, R.A. (1980). “Rotation in factor analysis”, The
Statistician 29, 167–194. https://doi.org/10.2307/2988040
[8] Malinowski, E.R. and Howery, D.G. (1980). Factor Analysis in
Chemistry. John Wiley & Sons.
[9] Saunders, D.R. (1953). “An analytic method for rotation to
orthogonal simple structure”, Princeton, Educational Testing
Service Research Bulletin 53–10.
[10] Crawford, C.B. and Ferguson, G.A. (1970). “A general rotation
criterion and its use in orthogonal rotation”, Psychometrika
35(3), 321–332. https://doi.org/10.1007/BF02310792
[11] Browne, M.W. (2001) “An overview of analytic rotation in
exploratory factor analysis”, Multivariate Behavioural Research,
36(1), 111–150.
[12] Tauler, R. and de Juan, A. (2006). “Multivariate curve
resolution”, in Practical Guide to Chemometrics, 2nd Edition.
Taylor & Francis, Ch. 11.
https://doi.org/10.1201/9781420018301.ch11
[13] Maeder, M. and Neuhold, Y.M. (2006). “Kinetic modeling and
multivariate measurements with non-linear regression”, in
Practical Guide to Chemometrics, 2nd Edition. Taylor & Francis,
Ch. 7.
[14] Rutan, S.C., de Juan, A. and Tauler, R. (2009). “Introduction to
multivariate curve resolution”, in Comprehensive Chemometrics,
Vol. 2. Elsevier, Ch. 2.15, pp. 227–248.
[15] Maeder, M. and Zilian, A. (1988). “Evolving factor analysis, a
new multivariate technique in chromatography”, Chemometr.
Intell. Lab. Syst. 3, 205–213. https://doi.org/10.1016/0169-
7439(88)80051-0
[16] Multivariate curve resolution web page,
http://www.cid.csic.es/homes/rtaqam/ [accessed 11 January
2017].
[17] Tauler, R. (1995). “Multivariate curve resolution applied to
second order data”, Chemometr. Intell. Lab. Syst. 30, 133–146.
https://doi.org/10.1016/0169-7439(95)00047-X
[18] de Juan, A. and Tauler, R., “Chemometrics applied to unravel
multicomponent processes and mixtures. Revisiting latest
trends in multivariate resolution”, Anal. Chim. Acta 500, 195–
210 (2003). https://doi.org/10.1016/S0003-2670(03)00724-4

* PCA, and MCR even more so, have been likened to “mathematical data chromatography”.
In this analogy PCA is going about its work using a standard column (length), fixed
temperature setup while MCR is using flexible column length and advanced temperature
programming etc.
Chapter 13. Process analytical
technology (PAT) and its role in the
quality by design (QbD) initiative

13.1 Introduction
The pharmaceutical and related industries have been given incentive
to adopt state-of-the-art process monitoring and control strategies,
much like other industries have been doing for many years.
Traditional arguments coming from the industry, such as “the
pharmaceutical industry is different to other industries”, have
sometimes stifled the opportunity to become innovative, however,
this situation is gradually changing. Is the pharmaceutical industry
different from other industries? “Yes, absolutely”; it deals with the
treatment of sick people and the need for high quality products that
do their job is imperative. Is the pharmaceutical industry different
from other industries from a manufacturing perspective? “Absolutely
not”—all industries share the same issues regarding product quality
and process efficiency. Within the industry, contrary arguments that
try to maintain the status quo, may be blocking the opportunity to
improve, where typically it is cited that a regulatory agency “will not
accept” substantial changes to the process or product, sometimes
without even “testing the water”.
Indeed, it used to be a very expensive process to change the
market dossier of a product if any process or product changes were
to be made post approval, and this increased the resistance to make
any meaningful process changes, even though they were necessary.
In the early 2000s the US FDA, under the guidance of Dr Ajaz
Hussain, analysed all of the warning letters issued to companies
based on process deviations and instituted what is known today as
the Code of Good Manufacturing Practices for the 21st Century
(cGMPs for the 21st Century [1]). The concept and final paper was
called the “Scientific, Risk-Based Approach” to pharmaceutical
manufacturing. Dr Hussain highlighted the notable lack of innovation
in pharmaceutical manufacturing at a conference in Singapore in
2007, where he stated that the manufacturers of M&Ms have much
tighter controls over their coating process than the pharmaceutical
industry has on tablet coating processes. Should this not raise
concern? Is not the pharmaceutical industry held up as the gold
standard of product quality and manufacturing excellence in the
public eye?
Hussain’s main focus was to encourage industry to adopt a
paradigm shift from an 18th century approach to quality to a 21st
century approach. The area of precision agriculture has utilised state-
of-the-art technology for many years for fertilisation management and
irrigation planning for crops using near infrared (NIR) spectroscopy
and chemometrics, methods only recently adopted to any great
degree by the pharmaceutical industry. cGMPs for the 21st Century
was a concerted attempt to help industry realise that innovation does
not stop after the R&D stage and should continue throughout the
entire product’s lifecycle. In order to make innovation work, a new
mindset is required throughout an entire organisation, where a “can
do’ attitude is adopted, rather than a reactive and pre-emptive one.
The reality is that “quality costs”, and cannot be considered a red
line on an accountants’ ledger book. Paying for quality upfront will
naturally lead to cost-effective manufacture of the highest standard of
product. To aid industry in the implementation of better quality
systems, the Quality by Design (QbD) initiative was established such
that scientists and engineers could implement the latest advances in
process monitoring and control systems with a regulatory framework
to support such implementation. QbD therefore requires advanced
process sensor technology and modern quality systems to enable
such implementations. This is where the general premise of Process
Analytical Technology (PAT) comes to the fore.
This chapter aims to provide both new and existing practitioners
with an overview of the QbD and PAT initiatives, their interrelationship
and how this all ties into the key analysis methods of DoE,
chemometrics and TOS. The most important guidance documents
will be reviewed in a pragmatic manner along with a practical,
implementation approach to the pharmaceutical quality system
(PQS). QbD is the embodiment of all of the concepts discussed in this
textbook so far—from sampling, to appropriate technology, to the
design of rational experiments right through to better process
understanding and finally a system for monitoring and controlling
processes to meet the highest possible levels of validity.

13.2 The Quality by Design (QbD) initiative


A key statement from the cGMPs for the 21st Century guidance is the
following,
“Quality cannot be tested into products, it should be built in, or by
design”
This statement powerfully outlines a paradigm shift from “Quality
by Testing” to “Quality by Design”. While quality control (QC)
practices are highly important for many aspects of a products
release, they are not all encompassing and the results generated are
typically taken from a non-representative sampling scheme (chapter
3). The situation is described in full and adequate solutions are
offered powerfully in Esbensen et al. [2].
As a key example, there is an essential mixing step in almost any
solid dose pharmaceutical manufacturing process, batch or
continuous. In spite of intense efforts over more than 20 years, the
current state of affairs regarding adequacy and verifiability of
pharmaceutical mixing and tablet homogeneity is at an impressive
standstill. The situation is characterised by two draft guidance
documents, one of which has been withdrawn, and the second never
approved. Esbensen et al. [2] analysed the contemporary regulatory,
scientific and technological situation and suggested a radical way out
calling for a paradigm shift regarding sampling for QC of
pharmaceutical blends. In synergy with the QbD/PAT efforts, blend
uniformity testing should only be performed with properly designed
sampling approaches that can guarantee representativity—in contrast
to current regulatory demands for severely deficient thief sampling.
This was shown to be the only way to develop the desired in-process
specifications and control for content uniformity and dosage units
meeting desired regulatory specifications. Their exposé shows how
process sampling based on TOS constitutes a new asset for meeting
the requirements of section 211.110 of the current Good
Manufacturing Practices regulations [3]. This approach was called
upon to establish the desired science-based, in-process
specifications allowing independent approval or rejection by the
quality control unit. A strategy for guaranteed representative sampling
and monitoring with “built in” automated measurement system check,
variographic analysis, was shown to facilitate comprehensive quality
control of pharmaceutical processes and products.
It has been the authors’ experience in some companies where,
when a single failure has been detected in a sample, the instinctive
reaction is to keep on sampling and testing in an effort to retain the
batch. This is the ugly face of quality by testing and does not adhere
to the principles of QbD.
So then, how is QbD achieved?
Three key terms have resulted from the QbD initiative and these
are,
Critical process parameters (CPPs), which have been determined
to have the most impact on product quality. Methods such as
factorial designs and optimisation designs (chapter 11) can be used
to understand the main effects and interactions of the CPPs that
influence quality.
Critical quality attributes (CQAs), which are the key product
performance and efficacy characteristics of a product that make it
effective for its intended purpose. These are typically the response
variables of a designed experiment or a multivariate quality
approach and can, in many cases, be attained using PAT.
Quality target product profile (QTPP): A prospective summary of
the quality characteristics of a drug product that ideally will be
achieved to ensure the desired quality, taking into account safety
and efficacy of the drug product ICH Q2(R1) [4].
These critical features lead to another major concept in QbD,
namely the Design space. The definition of design space comes from
the important guidance document issued by the International
Conference on Harmonisation (ICH) entitled Q8, Pharmaceutical
Development ICH Q8(R2) [5],
“The multidimensional combination and interaction of input
variables and process parameters that have been demonstrated to
provide assurance of quality”
Stated in a manner that is consistent with this book, the definition
of design space may be rewritten as follows,
“The use of methods such as Design of Experiments, Multivariate
Analysis and Statistical Process Control that have established the
effects and interactions of the CPPs such that the CQAs have been
assured at the point of manufacture in real time”
This means that a process oriented rather than a product-oriented
approach to quality is required which has been stated to be the
underlying premise of PAT. Following on from the definition of design
space, the concept of desired state is defined as follows from ICH
Q8(R2),
“Product Quality and Performance achieved and assured by
Design of Effective and Efficient Manufacturing Processes”
Stated in a different way, within the design space, lies the “desired
state”.
It is now apparent that the key concepts of QbD can be achieved
through the use of DoE and MVA, however, other approaches can
also be used, but may not be as effective as these.
QbD, like PAT is not a single approach or methodology. It is the
development of a new skillset that can be modified and adapted
based on the problem at hand. The embodiment of QbD has recently
been realised in continuous manufacturing systems (CMS) currently
approved by the US FDA for the manufacture of solid dose products
in a real-time release (RtR) environment. Each product/process
combination has to be solved in its own unique way and reliance on a
single technology to solve every problem is not an option. This is
exactly the right attitude which is one of the first issues taught in the
PAT curriculum, Dickens [6].

13.2.1 The International Conference on Harmonisation (ICH)


guidance

To further bolster the pragmatic guidance for implementing a QbD


strategy, the four documents issued by ICH listed below form the
foundation of an excellent framework,
ICH Q8, Pharmaceutical Development [5]: This document outlines
the key aspects of utilising the tools of QbD primarily for secondary
manufacturing. It defines the design space and provides practical
examples for implementation.
ICH Q9: Quality Risk Management [7]: This document defines a
number of tools to be used to mitigate the risk that the many input
and output variables from a process will cause serious harm to the
end user of the product. It is an effective strategy for defining CPPs
and CQAs that then need designed experiments to assess for main
effects and interactions, but its most valuable use is to pre-screen
out any low risk factors so that they don’t use up valuable
experimentation budget on unimportant variables.
ICH Q10, Pharmaceutical Quality Management System (PQS) [8]:
This document defines strategies to support IT and control
engineers when implementing a real time QbD system for process
monitoring or control. It is most useful when combined with the
guidance of GAMP®5 [9] particularly for the validation of
computerised systems. The PQS is the central theme behind
continuous improvement (CI), corrective and preventative
maintenance (CAPA), overall equipment effectiveness (OEE) and
early event detection (EED).
ICH Q11 Development and Manufacture of Drug Substances
(Chemical Entities and Biotechnological/Biological Entities) [10]:
This document is the primary manufacturing equivalent of ICH Q8
which extends the principles of QbD to biotechnology/biological
products and their manufacturing processes.

Figure 13.1. Interrelationships of the ICH guidance documents specific to QbD.

Figure 13.1 shows the interrelationship between the ICH Q8-Q11


documents.
As with most good initiatives in the pharmaceutical and related
industries, an ocean of guidance documents has appeared from
everywhere and many groups/societies have produced so much
documentation that the overall initiative is at the risk of becoming the
great talk fest rather than a pragmatic step towards better
manufacturing. An anecdote coined by the authors of this text a
number of years ago was “PAT is not idle chat”.

13.2.2 US FDA process validation guidance


Possibly the single most important guidance document to support the
QbD initiative is the US FDA’s 2011 Process Validation Guidance [11].
Condensed to its simplest form, this guidance has two main
focusses,
1) All new submissions to the US FDA must be based on the QbD
approach.
2) The three batch validation approach is no longer acceptable and
continuous verification is now a requirement.
The only real way to achieve continuous verification is through the
implementation of an effective PQS which in turn monitors and
controls CPPs and CQAs through the use of PAT and methods such
as statistical or multivariate statistical process control (SPC/MSPC),
which in turn have been established through methods such as DoE
and risk mitigation. Following on from the three-batch validation
process used until only recently, time to market was typically the
main driver of the validation effort. With quality taking backstage, the
validation effort was typically biased in such a way to ensure absolute
success. To explain, during a validation effort, a company would
require its best process operators to manufacture the validation
batches, the raw material suppliers were asked to provide their
known best batches of material and the laboratory personnel used to
perform the analytical data were typically the most experienced
analysts. What resulted was a “best case” situation of manufacturing,
which typically ended in non-ideal situations, i.e. inexperienced
operators, following standard operating procedures (SOPs) to the
letter would end up with non-conforming batches…. Why? There are
many things that can go wrong in batch production that are out of the
control of most companies when a serious material issue comes up,
however, when a validation effort is biased, the process has not been
tested for robustness and typically batch issues will follow, either for
the operation at hand, or in downstream operations.

13.3 Process analytical technology (PAT)


The classical paragraph cited from the 2004 PAT Framework
Guidance document [12] is the definition of PAT stated as follows,
“The Agency considers PAT to be a system for designing,
analysing, and controlling manufacturing through timely
measurements (i.e., during processing) of critical quality and
performance attributes of raw and in-process materials and
processes, with the goal of ensuring final product quality.”
A common misconception regarding PAT is that it is a means of
bringing the laboratory to the process. This misconception is no
different to “quality by testing”. The answer to what PAT is can be
found by careful inspection of the definition above. PAT is an enabler
of QbD in the sense that correct technology adoption can reveal
insights into new and existing processes and these insights allow the
modification or even replacement of existing equipment in favour of
equipment that will minimise the risk of manufacturing failures. This is
performed by analysing the data obtained and making scientific, risk-
based decisions driven by objective data that will allow the
implementation of a control strategy, particularly through the
development of a PQS, refer to section 13.4 for more details on the
PQS and its construction.
Therefore, PAT is not necessarily a single spectrometer generating
quantitative data for a quality control test, it is a complete holistic
means of continuous improvement and early event detection such
that proactive, rather than reactive quality decisions can be made.
This is the intended meaning of “timely measurements”. It must also
be noted that valid (i.e. representative) raw material characterisation
is one of the most important aspects of a true PAT initiative and
correspondingly a QbD system. Many companies have adopted NIR
and Raman spectroscopy for identification of raw materials. This in
itself does not constitute PAT. It is only the replacement of a
compendia monograph test by an alternative identification test. What
can an identification test reveal about the materials processability?
Absolutely nothing! In addition, comes the fact that NIR and Raman
probe heads only see a very small fraction of the material flow; this
may, in many circumstances, lead to a significant fundamental
sampling error (FSE) a.o. see chapter 3 and Esbensen and Paasch-
Mortensen [13].
It is only when the technology is used to predict the materials
quality attributes and characterise its processability that this usage
can be considered a PAT. This information then has to be used as a
CQA in some form that will be an input (or a CPP) to another unit
operation. The authors have used the following analogy to a “jack-in-
the-box” when speaking about raw material understanding. If a raw
materials inherent variability is high, this is like stuffing a spring into a
box and closing the lid. Since the current way of thinking is to keep all
processing parameters fixed, the process is not allowed to adapt to
the raw materials characteristics. Therefore, when the box is opened,
the spring expands violently and this expansion of variability is what
typically happens to a product when it is manufactured based on a
fixed process model understanding—which is unrealistic. This
situation is depicted in the top pane of Figure 13.2.
The QbD/PAT approach is first to understand the material and
devise better campaigning strategies of materials to products. Then,
based on the raw material characteristics, allow the process to
become flexible and be able to adapt to the de facto existing material
variants, then when the final product is released, it should have the
lowest variability in performance characteristics and the highest
quality possible, as depicted in the bottom pane of Figure 13.2.

13.3.1 At-line, online, inline or offline: what is the difference?

These definitions have caused much confusion to practitioners over


the years and this section aims to provide a definitive, once and for all
definition of what these terms mean and why they are implemented
as such.
1) At-Line: The physical taking of samples from a process line via a
pre-established protocol to a measurement system that is in close
proximity to the process for the quasi-real-time assessment of a
series of samples for quality characterisation and detection of
process deviations. These samples are typically not returned to the
main product stream after analysis. This approach may or may not
be of sufficient speed to perform the needed PAT role.
Figure 13.2: Product variability using traditional and QbD approaches to
manufacturing.

2) On-line: The introduction of a system that can bypass the main


product flow, or a scaled-down fraction hereof, and hold a sample
in a stationary manner such that a longer analysis time can be
applied, or if the sample requires preconditioning, before an
analysis method is applied. These samples may or may not be
returned to the main product stream, depending on whether any
modification to the product’s integrity has been made. The
functioning of the bypass valve is the critical element in this
approach. It is necessary to have demonstrated that the bypass
flow is indeed representative of the main flow, an issue often
overlooked or actively suppressed, see chapter 3 and below.
3) In-line: The placement of the analysis system directly in the main
product stream that has been demonstrated to produce
representative (or fit-for-purpose representative) measurements of
the product as it exists in the process. Theoretically, this approach
is meant to minimise the major sampling errors for such
measurements as no samples are extracted from the line, unless a
sampling port has been designed such that the sample measured
is the one collected. It is often overlooked that reference samples
must be extracted from the same flow in order to provide for a
bona fide multivariate calibration. The requirement that in-line
measurement systems have indeed eliminated all sensor sampling
errors is a requirement very often overlooked or suppressed. This
critical issue is analysed and exposed in full in Esbensen and
Paasch-Mortensen [13].
4) Off-line: The physical taking of samples from a process, via a pre-
established protocol to a remote (usually a QC laboratory) for
detailed analysis using a number of analytical and physical tests.
Results are typically nowhere near real time and are typically not
used for process correction, except only in certain, fast cases.
5) Remote sensing: The first impulse on reading “remote sensing”
may well be a satellite platform equipped with appropriate sensors
(LANDSATs, or the many more advanced Earth Observation
satellites (NASA; NOAH), space probes (e.g. New Horizon) or
planetary rovers, e.g. Curiosity, which is equipped with a
ChemCam (chemical camera) that a.o. employs the The
Unscrambler® software. However, with PAT a pathlength of 800
km is not being used, but one of cm to mm only. Any sensor
system interrogating the process material through a non-contact
interaction is a remote sensing approach. A prime example would
be a NIR-camera located 80 cm above a conveyor belt transporting
wood shards, the analyte in question being “instant moisture
determination” (at least from the uppermost few mm of the lot
material being carried through the field-of-view of the camera).
Other applications concern, e.g., NAA (neutron activation analysis)
for density determination, or “clamp-on” impedance sensors
intended to characterise the flow regimen of compound
oil/water/gas flows in pipelines. For a broad catalogue of PAT
modalities in the present context, see Bakeev [14].

13.3.2 Enablers of PAT

NIR spectroscopy has been the major PAT used until recently and
has enjoyed the status of being the preferred technology for use in
the pharmaceutical industry, mainly due to its versatility and non-
destructive sampling. In more recent times, there has been an
emergence of other spectroscopic and non-spectroscopic tools that
have found their way into the PAT practitioners’ toolkit including
Raman spectroscopy, terahertz spectroscopy, improvements in mid-
infrared spectroscopy, particle size analysis (PSA) and many more
tools (particularly based around imaging) are becoming available all
the time.

Near infrared (NIR) spectroscopy for PAT

NIR spectroscopy has found widespread usage as a means of raw


material identification. This is because of the speed aspects of the
technology and for lot sizes of 100+ containers per delivery,
laboratory testing times could be reduced by up to 90% compared to
traditional pharmacopoeia monograph testing. However, since NIR is
also sensitive to the physical characteristics of materials, Plugge and
Van der Vlies [15], raw material performance attributes were soon
being predicted from the identification scans. This is the
differentiating factor distinguishing a simple ID test to making it into a
PAT. The instrumentation available in the early days of
pharmaceutical NIR were primarily holographic grating based
instruments, Swarbrick [16], which meant they were not amenable to
simple implementation into a process environment. Many studies
were performed on pilot scale equipment to show that NIR could
provide detailed insights into processes such as solid dose blending,
fluid bed drying a.o. However, it was not until the emergence of diode
array (DA)-based instruments that PAT took a next major step
towards monitoring processes in real time.
The DA instruments offered a distinct advantage over both
grating-based and Fourier transform (FT) instruments of speed and
no moving parts. This made them more robust to manufacturing
conditions, however, the long-term stability of these early instruments
was poor compared to the research-grade instruments, and their
early adoption was limited to a small number of PAT groups.
Unfortunately, only a small number of the most progressive of
companies allowed risk-based implementation of NIR into production
equipment.
There were a number of groups who developed elaborate systems
using grating-based and FT systems for rotating blenders and
stationary driers. With the FT instruments proving to be more reliable
when scanning moving powder samples, Berntssen [17], these
systems became the first choice for implementation and fibre optical
cable interfaces to multiplexed spectrometers became a popular
choice. With the birth of the age of mobile phones and wireless
connectivity, this soon made its way into NIR analysers. This opened
up for many new opportunities to monitor powder blending processes
in rotating blenders. Whatever the configuration (IBC, V-blender,
double cone blender etc.) interfacing the NIR to the vessel(s) is
possible using standard sanitary fittings and the inclusion of a
sapphire sight window at the optimal sampling point. Methods such
as PCA and moving block standard deviation (MBSD, and related
methods) have been successfully used to monitor and stop the
blending process when the endpoint has been reached [18, 19].
However, keeping the reader on his/her toes w.r.t. the lessons
learned in chapters 3 and 9, what is the typical size of the analytical
volume relative to the full volume of the vessel in this scenario? It is
not a priori given that many spectra obtained from a sensor grab
sampling approach necessarily averages up to a full representative
signal of the entire flow in front of the window (but it is always
possible to test this hypothesis specifically by a replication
experiment, see chapter 9 for more details).
The PAT aspects of monitoring blending operations stem around
the fact that powder mixing processes are the least understood of all
mixing phenomena, Muzzio [20]. Sampling inside the bed is typically
achieved using a method known as thief sampling (a fancy name for
grab sampling, chapter 3) that a.o. results in forced segregation of the
blend and all thief sampling is therefore principally non-representative
data for blend uniformity Muzzio [21]. In the case of dynamic blending
systems, the principle of mixing is based on cascade flow, where the
powder bed folds over itself in the blender and eventually uniformity
is supposed to be achieved. Sampling in this case by NIR is relatively
simple, since the cascading powder blend forms a front (similar to the
crest of a wave as it breaks on a beach), placing the instrument into
the blender at any point in the direction of rotation will lead to a
measurement of the powder as it exists in the process. Blanco et al.
[18] have reviewed the methods used for determining blending
endpoints and have found that methods based on PCA are the most
robust for assessment. This is because PCA can separate major
sources of variation (i.e. macro-mixing phenomena due to overall
blending of a mixture) from minor, but still important, sources of
variation (i.e. micro-mixing phenomena that are highly important for
blends that contain a small amount of the active ingredient). This
information is achieved through the spectral loadings generated in
PCA and their interpretation. The complete, very complex issue:
assumptions vs myths vs facts regarding mixing processes were
analysed by Esbensen et al. [2].
Again, as was stated previously, PAT is not a single tool or
approach to all problems and in many cases the running of multiple
endpoint models on a single process may lead to better
understanding. The following important information can be obtained
through the application of NIR into a dynamic blender,
1) The macro (i.e. large-scale blending) uniformity of the mixture.
2) The micro (i.e. interstitial blending) uniformity of important blend
ingredients.
3) The attrition that may occur and leads to process issues
downstream.
Continuing from point 3 above, when NIR spectra are allowed to
be collected on a process for an extended period, a typical sinusoidal
pattern of blending may be observed. This cyclic behaviour is the
result of mixing/de-mixing processes that can either be attributed to
attrition (i.e. the breaking down of the particle size of the mixture
ingredients and their re-distribution) or it is simply the end result of
what mixing can achieve on a mixture that contains differently sized
particles; the extra little mixing achieved is immediately nullified by
counteracting segregation resulting in a non-vanishing steady-state
situation characterised by a significant residual heterogeneity, i.e.
homogeneity cannot be achieved completely regardless of mixing
time [2].
If the particle size becomes too fine, some powder blends are
more likely to segregate (others are not), but more dust is produced
and if/when the powder is being compressed, issues such as punch
sticking can cause production issues, more downtime and less
process efficiency.
For many years, the ultimate goal of NIR was to monitor the
content uniformity of tablets as they come off the tablet press as a
100% inspection system. While this application would prove to be an
excellent way to enhance batch traceability and allow a reject system
of tablets that did not conform to specification, is this really a PAT
implementation or is it just bringing the QC lab to the process? In any
case, the tablet ejection speeds are just too fast in order for a reliable
NIR measurement to be taken and a new strategy was sought for this
application. At the top of the punches of a tablet press, there is a
system known as the feed frame. This is where powder from a
container is either gravity fed or vacuum fed to the press and the
powder is fed into the tablet dies. There is an excellent opportunity to
place an NIR probe (particularly the micro-instrumentation based on
linear variable filter (LVF) technology, Swarbrick [16], which has a
small instrument footprint and is robust to the dusty conditions of a
tablet press). Placing such a sensor just above the powder may lead
to quasi-100% inspection of the tablets. A tablet unit dose is
considered to be a 100% statistically representative sample (this is
because a tablet press is (for all intents and purposes) a large
spinning riffler and if each tablet was to be tested at-line, or on-line
using an external system, the process would take weeks to complete.
The purpose of the feed frame system is to ensure uniformity of
delivery to the press and if any deviations are observed, this
information must be related back to the flow characteristics of the
powder, either in gravity fed operations (through mechanisms such as
rat holing or percolation) or vacuum transfers where segregation may
be influenced by static or other factors. This is the PAT aspect of the
tablet press monitoring application and the information obtained is
used for process improvement, not a replacement of QC testing (this
is a secondary benefit of PAT and should always be viewed in this
way).

Case study, NIR for fluid bed drying monitoring and control
(real-world PAT implementation)
In one implementation, a FT instrument was coupled to two fluid bed
driers (FBD) using a multiplexer in a manufacturing facility of generic
products for one of its product formulations. The manufacturer was
experiencing a major bottleneck and downstream processing issues
for this product and isolated the FBD as the root cause of the
problem.
Careful analysis of the data being generated by QA suggested
that the loss on drying (LOD) of the product was not only missing
target more times than not, but also the uniformity of moisture in the
powder bed was also non-uniform. This issue suggested that the
initial three-batch validation approach was not robust enough to pick
up this process flaw! The secret to this processing issue was found
through the use of NIR. After a strict design of experiments procedure
was carried out to optimise the position of the fibre optic probe in the
drier, initial trials on real production batches were conducted. The
method of PCA targeted at the 1930 nm (moisture) region of the
spectrum was observed in scores and loadings space. This showed
that the product was dry in 10 min (compared to the validated 40 min
specification). It was interesting to note that the SOP for the process
stated that after 10 min, the process should be stopped, the granules
should be remixed and the bed placed back into the FBD for
continued drying. This raised concerns as the need for extra manual
handling could introduce potential contamination.
After a review of the data generated by NIR, when the bed was
spatially sampled (unfortunately grab sampled in this case, chapter
3), the data revealed that the side walls, where the NIR probe was
located were typically much drier than the inner sections of the bed.
This indicated a lack of fluidisation in the process and this was why
there was a need for manual remixing of the granules. The next step
was to perform an engineering study on the process to better
understand why the bed was not fluidising in the first 10 min.
In a modern FBD, the air vents to the bed are usually vertically
oriented with respect to the FBD column, providing the most efficient
airflow to the wet mass. Also, these modern FBD systems have a
dehumidification system that controls the moisture level of the heated
air that dries the powder. In this particular case, the FBD was an older
system with a side air vent and no dehumidification system. These
were the major causes of the lack of robustness in the process and
an engineering solution was required to initiate fluidisation without
manual intervention. This was provided in the form of a feature of the
drier known as the “product loosening” button that automatically
induced a fluidised bed, however, it was not used in the past as there
was no way of triggering when to apply the button until NIR came
along.
To assess if this functionality was the key to the problem, NIR was
used to monitor when the bed reached its “dry” state and then the
button was manually pressed. The function of the product loosening
button is to create an instantaneous pressure and release cycle that
redistributes the powder without manual intervention. The
observation was that the moisture monitored by NIR rose sharply and
clear fluidisation was visible in the sight glass of the drier. The powder
bed then dropped to near dry around the 20 min point where the
product loosening button was used again. Only a small increase in
moisture was observed at this time point and the product reached a
stable endpoint after only 25 min. Although the reduction of the
drying time by 15 mins compared to the validated process did not
seem to be a great gain in time saving, it represented a situation of
greater efficiency where a product of desired state was attained.
However, the real benefits of the NIR method are as follows,
1) The FBD can be operated in a more efficient manner without
complete system reengineering.
2) NIR provided a key insight into the process mechanism and
allowed the engineers to understand the root cause of the problem.
3) Although only 15 min was gained on the process efficiency, this
does not also take into account that the bed required manual
remixing twice (10 min per remix) and if the product did not meet
LOD specification after 40 min, it had to be returned to the FBD,
dried for a further 5 min and an LOD taken again. The LOD test
required 10 min and if it had to be performed twice with re-drying,
a total time of 30 min was added to the process. NIR therefore
allowed a reduction of over an hour per mix compared to the
current state and with 8 mixes per batch, the math speaks for itself.
4) NIR allowed the operators to monitor the product to its desired
state without any manual intervention. This resulted in less process
issues downstream compared to the current implementation.
5) Greater quality was built into the process by design.
It was common occurrence with the product that the re-drying
step was required for each mix and therefore 8 h (or the equivalent of
one working shift) was required just to account for a poorly validated
process. The NIR was implemented as part of a control system that
automatically implemented the product loosening feature at the
appropriate time, thus allowing improved granule formation, improved
material flows, produced less issues downstream and as an added
benefit to the organisation, allowed four extra batches to be
produced per month without the need for factory expansion.
The following represents a typical cost justification for
implementing PAT in a QbD environment.
Initial system cost and development time (including salaries): 300 K
USD
Operator costs and energy consumption estimate (per hour): 500
USD
Typical cost on eight mixes per batch current system: 5 K USD
Cost to manufacture four batches per month (less materials): 16
K USD
Operating cost based on eight mixes per batch using NIR: 1.5 K
USD
Market value of batch (internal): 200 K USD
Revenue increase through four extra batches: 800 K USD
Increased production less initial equipment outlay: +500 K USD
Payback period: 1 month
These figures are based on costs and available instrumentation at
the time of this development, however, with miniaturisation of
instrumentation (and subsequently lower costs), the figures stated
above are achievable and realistic for this type of implementation.
For a more complete description of the NIR method and its
applications, the reader is referred to the excellent handbook by
Burns and Cuirczak [22] and the concise reviews by Swarbrick [16,
23]

Raman spectroscopy for PAT


Raman spectroscopy has enjoyed a renaissance as an analytical tool
in pharmaceutical (and other industries) over the past decade. In a
nutshell, Raman spectroscopy offers the sharp spectral bands typical
of the mid-IR region with the sample preparation simplicity of NIR.
Unlike mid-IR and NIR, Raman is a scattering phenomenon, not an
absorption phenomenon and the Raman effect is many orders of
magnitude lower in sensitivity compared to absorption. The Raman
effect in some cases also has to compete with absorption,
particularly in the NIR region and as such, the instrumentation
involved in Raman spectroscopy is much more complex than its
infrared cousins.
Since the development of the notch filter [24], charge coupled
devices (CCDs) and diode lasers, Raman spectroscopy has found
much more application as a process tool, particularly in operations
such as active pharmaceutical ingredient (API) crystallisation
processes where it is extensively used to monitor the formation of
polymorphs [24]. There have also been many new portable Raman
instruments come onto the market in the past few years for raw
material identification. In particular, the method of spatially offset
Raman spectroscopy (SORS) [24] has been implemented as a means
to measure materials through packaging such as paper or plastic
used to contain the materials. Using Raman as a raw material
identification method does not qualify it as a PAT tool as only
identification is possible. Where Raman finds usage is in situations
where the specificity of NIR is not good enough to distinguish
between chemical species and in situations where the system being
measured is highly aqueous.
Due to the high-powered lasers used to induce the Raman effect,
these systems must be built with the highest possible occupational
health and safety (OH&S) regulations in mind as exposure to the laser
can cause irreparable eye damage, even blindness. When installed as
a PAT tool, Raman spectroscopy is typically interfaced to a process
using fibre optic cables and the implementation of Raman into a
dynamic blending system is not possible, purely based on current
hardware limitations.
There are a number of camps that have arisen in the
pharmaceutical and biopharmaceutical industry in recent time
supporting Raman over NIR (and mid-IR) and claiming one is better
than the other. It is the authors’ experience, through conducting
parallel studies of Raman and infrared systems on a low
concentration aqueous chemical reaction that all technologies have
the same limit of detection and quantification. The only real case
where this breaks down is for surface enhanced Raman spectroscopy
(SERS) [24], however, this technique is useless for real-time process
monitoring due to the long periods of time it takes for the material of
interest to adsorb onto the substrate before a detectable signal can
be observed. SERS is capable of measuring nanogram-scale
concentrations of materials showing Raman scattering and is an
essential tool in drug development and understanding metabolic
pathways. Back to the topic of the camps advocating one technology
over the other, this is completely unfounded as the Raman and NIR
techniques are complementary and should be used as such. This is in
alignment with the premise of PAT, i.e. use the right technology for
the application, one size does not fit all.
Finally, Raman spectroscopy can suffer from the effects of
fluorescence, even for low concentrations of contaminants in a
system and the effect of the fluorescence is highly laser wavelength
dependent. Many modern Raman systems offer a range of laser
excitation sources from those in the NIR region right through to the
UV/vis region. This means that Raman instruments tend to be single
purpose for a particular application, but when they perform that
application, they perform it extremely well.

Case study: Raman spectroscopy for quantitation of API in a


wet massing process

Wet massing (or granulation) is the process of building up the particle


size of smaller or “fluffy” APIs (that typically do not have good
blending characteristics) with a binder that is added in liquid form.
The wet granulation equipment typically consists of a stainless-steel
bowl that has a large impeller at its base that moves the powder
mass around in front of a chopper that rotates at high speed to
regulate the particle size of the final granules. Optimisation of the
granulation process can be easily performed using Design of
Experiments (DoE, chapter 11), by means of some form of factorial
and optimisation design. Typical controllable factors include,
1) Impeller speed (rpm)
2) Chopper speed (rpm)
3) Rate of liquid addition (L min–1)
4) Spray vs direct liquid addition
By optimising these factors, consistent granulations are
achievable, but these are macro properties of the system, what
happens inside the granulator at the particle level? It has only been in
recent times, driven by the PAT initiative, that key insights into
powder mixing processes can be made in situ. Technologies such as
Raman, NIR and focused beam reflectance method (FBRM) for
particle size analysis can now be inserted via fibre optic cables into
the granulator, effectively putting a microscope into the process and
gaining real-time mixing information.
In this case study, prediction of hydrate formation in the API
present in the granules can lead to processing issues downstream.
Raman spectroscopy in the past has not been a reliable method for
quantitation on a macro level due to the typically very small beam
spot size measured by the laser. Recent technology has improved
this situation through the development of process capable probes
that measure over a larger sample area [25].
Mounting the smaller spot size probes into the powder bed of a
granulator can lead to sticking and therefore fouling of the probe
surface. This is because the probe needs to be in close contact with
the powder mass in order to generate an acceptable signal for
analytical measurements. The use of a non-contact probe with a large
sampling window can minimise such fouling and provide more
reproducible spectra in a process environment.
The quantitative results of the two probes are provided in Figure
13.3, which shows again (as is the theme of this textbook) how
important sampling is at all levels and aspects of any problem.
For a more complete description of the Raman method and its
applications, the reader is referred to the excellent handbook by
Lewis and Edwards [24].

Other technologies for PAT, a brief overview


This section only provides a brief overview of some other PAT tools
that have been used to monitor pharmaceutical processes. For a
more detailed explanation of each of these methods, the interested
reader is referred to the textbook on PAT, Bakeev [14].

Mid-IR for reaction monitoring


The mid-IR region of the electromagnetic spectrum lies in the lower
energy region just below the NIR. It is the source of the frequencies
measured in the NIR region, i.e. mid-IR is the fundamental
frequencies for the overtones and combination frequencies that are
observed in the NIR region.
Classically, mid-IR was the method of choice in the QC laboratory
for raw material identification, but due to its high level of sample
preparation and detailed analysis, it was not considered a feasible
option for rapid ID methods. Also, its implementation into processes
requires expensive fibre optic cables and elaborate sampling devices.
It does, however, find use in reaction monitoring of APIs, particularly
in non-aqueous environments. The detailed information found in the
fingerprint region is very useful for understanding reaction
mechanisms and there are a number of commercial systems in use
for this purpose, Coates [26].

Focussed beam reflectance measurement (FBRM) for particle size


analysis
Focussed beam reflectance measurement (FBRM) is an in-line
method of analysis for measuring the particle size distribution of
predominantly solid materials. Particularly useful for monitoring
granulation or milling operations, FBRM allows for the detection of
excessive fine material in the powder samples and can help in the
real-time engineering of particle characteristics.
It is typically used in conjunction with a method such as in-line
NIR for measuring multiple characteristics simultaneously and is
finding use in continuous manufacturing systems (CMS) operations.
See section 13.7 for more details on CMS.
Figure 13.3: Comparison of prediction results between a small spot and a large spot
Raman probe.

Other than mid-IR and FBRM, methods such as UV/visible


spectroscopy, terahertz spectroscopy and acoustics have been used
for monitoring processes and all of these utilise multivariate methods
for their interpretation and model building, Bakeev [14].

13.4 The link between QbD and PAT


QbD represents a radical paradigm shift for
pharmaceutical/biopharmaceutical and even medical device
manufacturing at the beginnings of the 21st century. It represents an
attempt by regulatory authorities to minimise a dictatorial role in
product and process development by giving manufacturers the
freedom to become innovative and to teach the authorities how they
are making their products. This is achieved through the design space,
but the question that is being raised most often is “what is a design
space?” Again, analysis paralysis can take over in companies in at
the beginning of their QbD journey and risk assessments are
performed that essentially block any possible progress. So, for the
record, this is the most simplistic explanation of the design space,
“Measure only what is critical to quality, using the appropriate
technology that will allow changes to be made in a proactive, not a
reactive manner”
The term “timely measurements” was used in the fundamental
definition of the PAT initiative and instils a mindset of proactive
process control. Therefore, PAT is a key enabler of QbD and in many
ways the two are not mutually exclusive. This is particularly true when
it comes to the PQS outlined in ICH Q10 (also refer to Figure 13.1).
The four key elements of the PQS are defined as follows,
Process performance and product quality monitoring system
Corrective action and preventive action (CAPA) system
Change management system
Management review of process performance and product quality
Process performance and product quality monitoring system
refers to computerised systems that collect data on CPPs, batch
identifiers (including unit operation identifiers), environmental
conditions and any other data deemed necessary for the manufacture
of high quality products. These data management systems are a key
component of the PQS and more will be discussed in the section on
continuous manufacturing (section 13.7).
Corrective action and preventive action (CAPA) system in the case
of QbD is a proactive system usually based on advanced process
control (APC) platforms. These systems manage and control the
process based on measurements obtained from the process
performance and product management system and they also have an
intangible aspect not explored by many companies that they can
provide an estimate of mean time before failure (MTBF). This is
particularly useful for defining maintenance schedules that will ensure
process equipment will run at its most efficient state, which leads to
quality assurance. Such CAPA systems are typically based on
multivariate statistical process control (MSPC) and more will be
detailed on this in section 13.5.
Change management system in the case of QbD is determined by
the design space established for the process, be it a holistic overview
or a granular (unit operation) management system. As per the
definition of design space,
“Working within the design space is not considered as a change.
Movement out of the design space is considered to be a change and
would normally initiate a regulatory post approval change process.”
This definition in itself provides a more flexible approach to
manufacturing. Gone are the days of fixed process for variable raw
materials. The process can now be developed to adapt to raw
material and intermediate material changes as long as they are within
the bounds defined by the design space, which is a measure of the
knowledge management of the company implementing the PQS. As
always, deviation from the design space (which has been shown to
indicate an “edge of failure” point of the process/product) can now be
assessed using multivariate controls that not only point to where the
root cause of the failure occurs, but also allow a corrective process to
be implemented before failure occurs. If the process significantly
deviates from the design space, usual regulatory change control
procedures must be initiated in order to determine the root cause of
the problem.

Figure 13.4: Basis of the PQS for pharmaceutical production.

Management review of process performance and product quality


is better implemented through a PQS as timely information can be
retrieved even during the manufacturing process. Annual reporting is
now a matter of compiling the computerised results into a report
template, but knowledge management only takes place if the
outcomes of the reports are acted upon in a reflected continuous
improvement strategy. Any deviations and conclusions can then be
put into a designed experiment strategy for greater process
knowledge and understanding.
This now raises an important question, what constitutes the PQS.
From the authors’ experience, PQS must start with as much data
collection and automation as possible through the use of an
advanced manufacturing execution system (MES) platform and
supervisory control and data acquisition (SCADA) system connected
to the processing equipment. From there, the other parts build upon
this base. This is shown schematically in Figure 13.4.
The various elements of the PQS are outlined as follows,
PAT level: The correct and validated technology capable of
producing meaningful CQA data and for controlling CPPs.
Manufacturing level: Equipment that has been engineered or
modified to manufacture consistently high-quality product with
minimal downtime and maintenance requirements.
Execution level: A high level system that collects data from many
systems and is capable of adjusting a process in real time such
that proactive quality control is implemented.
Control level: An advanced software platform that can take
compiled data from the execution level and PAT level systems,
apply MVA/DoE models to the data and feed this information back
to the execution level for APC. This level also stores data into a
secure database for modelling or retention. Can be linked to the
office network for annual reviews or to a LIMS system for product
traceability.
Analysis level: Allows access to qualified data analysts to develop
process control models or gain further insights into process
mechanisms for continuous improvement strategies.
Overall, this may be considered the complete knowledge
management system and it meets the entire requirements of ICH
Q10. It is a system that allows continuous verification to be
implemented as per the US FDA Process Validation Guidelines and
can be holistically qualified and validated as per the suggested
guidance of GAMP®5. The system shown in Figure 13.4 is generic
enough to be used as a blueprint that can be implemented into any
manufacturing facility, the detail lies in the right technology to use as
the PAT tools, the frequency of measurement, the corrective action
system and how to utilise the generated data for continuous
improvement.

13.5 Chemometrics: the glue that holds QbD and


PAT together
Big data, mega data and more data, that’s what modern process and
control systems generate. It is hopefully apparent that manufacturing
systems generate multivariate, time series data. Philosophies such as
Six Sigma or lean manufacturing have attempted to present a over-
simplistic means of controlling processes however, they cannot
provide the necessary detail or insights that multivariate analysis can
provide.
The currently in vogue big data solutions have provided senior
managers with a dashboard approach to viewing their data.
Unfortunately, this is really no different to using spreadsheet
applications for a simplistic data overview, just with pretty pictures.
Process scientists and engineers require a much more diverse toolkit
that encompasses the simplistic views of the big data solutions with
the more complex analyses required to understand not only main
influences, but also their interactions, as described in great detail in
chapter 11 on Design of Experiments (DoE).
To further elaborate, the data analysis approach for QbD is shown
in Figure 13.5.
Starting to move from the base of the triangle up through the
hierarchy, as outlined in chapter 3, representative sampling is key to
any-and-all data analyses and—modelling at the top and this
competence must lay the foundation of any QbD/PAT initiative. The
next layer up consists of univariate data collection (this includes
collection of spectra as well, even though these are multivariate in
nature) and this level is used for either inputs into a DoE strategy, or
an MVA approach, which may be considered the pinnacle of the data
hierarchy.

Figure 13.5: Hierarchy of data analysis for QbD.

The multivariate approach to data analysis is all-encompassing


and allows a “helicopter” view of the overall data landscape. When
used effectively, MVA can help reveal parameters useful for DoE
studies when applied to non-designed data, but most importantly, the
MVA approach both isolates single variable inputs when these are
operative as such as well as their interactions with (many) other
variables. When the parameters of interest have been isolated, a
focussed univariate assessment can be made (refer to the top-down
hierarchy shown in Figure 13.5). This “top-down” approach to data
analysis, based on a solid sampling foundation is the only way to
fully understand data structures and therefore to better understand
the processes the data was generated from.
PAT analysers typically generate multivariate data that is modelled
using chemometric approaches. Predictive models of CQAs may be
developed and used to provide understanding of the trending of a
process over time. Process data typically presents itself in three
major forms,
1) Steady state processes: Where a constant state of quality
parameters it maintained over the entire manufacturing period (but
there can be substantial deviations along the time line, none of
which will change the general constant level). This data includes
processes such as tabletting, milling etc.
2) Evolving processes: Where the process dynamically changes over
time and is typical of biological fermentations, drying operations
and coating processes etc.
3) More irregular processes: Processes characterised by variable
loads, inputs and processing conditions (as a complex function of
raw material compositions... and much more)
The chemometric models used to analyse these types of data are
very different in their approaches and require profound subject-
matter knowledge of the system being modelled. In recent times, the
term “process signature” has gained popularity in the pharmaceutical
and related industries for better understanding of how a process
progresses over its course. Using methods such as PCA (refer to
chapters 4 and 6), multivariate data is reduced to single points in
space, defined by their scores. When PC scores are plotted over time
in an evolving process, the aim of the PCA is to determine if there is a
consistent pattern from batch to batch. For a steady-state process, it
is expected that there will be no patterns in the data as this would
indicate the presence of systematic influences that would change the
steady state nature.
13.5.1 A new approach to batch process understanding:
relative time modelling

When PC scores are plotted against each other for time-series-based


data, the time dimension is removed from the data analysis, although
now embedded in the “connecting line” progression linking one
object (process state) to the next. In a recent pioneering publication,
Westad et al. [27] developed a method known as relative time
modelling (RTM) for the establishment of process signatures in
evolving processes. Concerned by the mathematical distortions
imposed on batch data by other available algorithms, the RTM
method utilises the time independent nature of plotting scores
together and thus allowing the definition of a relative batch starting
point and a ditto relative end point for any consistently performing
process. An approach with the exact same objective is Jørgensen et
al. [28], who also ventured to morph unequal batch process times to
a common basis through a multi-stage PLS approach. In some ways,
this approach is a precursor for RTM.
Batch processes are widely used in many industries, usually in the
form of chemical reactors, biological fermentations and many others.
In these situations, the quality of final products is a function of the
initial raw material inputs and how the process is adapted to
accommodate this variability. In the past, processes were not allowed
to be adjusted (i.e. before QbD) and the final product quality was
much more a matter of luck rather than good process management.
There have been numerous attempts in the past to model and
monitor evolving batch processes and these typically start using
three-dimensional data structures such as those shown in Figure
13.6.
In the top-left of Figure 13.6, the three-dimensional data structure
is represented by the data cube (Variables × Time × Batch) which can
be analysed in various ways. The first way is to retain the three--
dimensional structure of the data and use methods such as parallel
factor analysis (PARAFAC [29]) or the so-called Tucker 3 models [30].
This modelling strategy decomposes the data into three main loading
directions and assesses the three-way interactions of each direction
in the data set. The discussion of multiway methods is outside of the
scope of the current text; suffice to say that they are relatively
complex and work best when the length of each batch dimension in
the matrix is equal (a situation rarely attained in practice without the
use of mathematical manipulation).
Figure 13.6 also shows that three-dimensional data can be
unfolded (more correctly matricised) in two distinct ways leading to
two other methods currently available for the analysis of such data.
These methods are,
Unfolding the data along the time direction leads to the batch
modelling approach first presented by MacGregor [31] which
involves the use of dynamic time warping [32] to establish equal
batch lengths.
Figure 13.6: Typical batch data structure.

Unfolding along the variable direction was proposed by Wold et al.


[33] and eliminates the restriction of equal batch lengths by
creating a two-dimensional matrix of super variables. These are
regressed against a so-called maturity index and a model is
created by regressing the batch data against this index to
determine the endpoint of the process.
There are fundamental physical and chemical limitations on both
of these approaches if the analyst is not wary. In the case of the time
wise unfolding, the aim of the warping is to create a situation where
each batch starts at a fixed time zero. Taking for example the process
of fluid bed drying (FBD), the initial moisture state of the powder mass
is hopefully consistent based on the process operators following
good standard operating procedures (SOPs), however, experience
has shown that there can be up to ±5% moisture variation between
granulations. Now consider the development of a batch model using
DTW for a data set containing the extremes, i.e. target ± 5%. Even if
there are a number of training batches in the data set at target, the
best the DTW can do is warp the –5% moisture batch back to target
and compress the +5% moisture batch to target, but is this
procedure chemically viable? Absolutely not. The chemistry of the
system cannot be mathematically manipulated to be something it is
not. This situation is more pronounced when biological systems are
monitored and the initial chemical/biological state of the materials
cannot be controlled like other processes. This is what DTW aims to
achieve and is only acceptable when it can be assured that the initial
state of the material in the process does not deviate to a great extent
with respect to the golden target.
The maturity index approach also suffers a major flaw in the
chemical/biological state point of view. The maturity index is a list of
ordered integers used to map batch state, but by definition, the
model is trying to regress a potentially non-linear system using a
linear index as the response variable. While this approach has merits
in some situations, it may only have limited scope in a process where
the progress can be linearly modelled, otherwise multiple phase
models must be developed. The maturity index approach also
requires the starting point of the materials to be of small variability
compared to the target. While this sounds like it should be the case in
a pharmaceutical/biopharmaceutical environment, experience
dictates that monitoring biological processes is akin to analysing
“soup”.
The method of RTM is not influenced by unequal batch lengths,
does not require an absolute time zero and can handle non-identical
residence time (i.e. does not require equal time point spacing like the
alternative methods). This is because the time dimension is
completely removed from the procedure and a new relative time scale
is back-projected to the original process time scale through the use
of PCA. Subsequently, based on the technology used to measure the
process, the original time scale is replaced by a chemical/biological
timescale that best represents the current state of the materials in the
process.
The theory behind the RTM approach is simple to explain,
1) Representative data are collected on acceptable batches using one
(or many) sensors suitably aligned (refer to section 13.6.1) for data
analysis.
2) Data is categorised as a two-dimensional table in an unfolded
manner with batch defined as the unfolding variable.
3) Run PCA on each batch and overlay each PC score trajectory on
top of each other and look for consistency. Only if the process
signatures overlay to a high degree can a batch model be
developed. If the PCA score trajectories do not delineate a
common structure, this is an indication of two possible events,
a) The technology being used to monitor the batches is not capable
of defining a stable, useful batch signature for the process, or
b) The processing conditions are so highly variable, that a
reengineering of the process may be required!
4) Using a grid search method, a common start and endpoint of the
process is defined. This defines the relative start and endpoint of
the model. The endpoint samples must be analysed by a reference
method to ensure that the model is predicting the final state of the
product.
5) Using the grid procedure, establish the mean trajectory
representative of the individual batch trajectories.
6) Define upper and lower statistical bounds on the process trajectory
that are used to indicate whether the process is progressing as
expected or is about to deviate from the established design space.
7) Validate the method using new batches that are normal and
wherever possible are capable of testing the edges of failure and
even a failure state of the process. Note that this is test set
validation at the helicopter level of test batches.
Consider data taken from a chemical synthesis process, where
the input variables are temperature (two probes positioned to
measure different parts of the process) and the pressure in the
reactor vessel. It is assumed here that these are the three CPPs
capable of defining batch quality and although there are only three
input variables involved, it still represents a multivariate process
control situation. It also provides a good case of illustrating the
complexity of a simple system.
Figure 13.7 shows the individual temperature probe 1 readings for
the four batches used to develop the model.
The first conclusion drawn from Figure 13.7 is that the batches are
all different based on this one variable. A more careful inspection of
this data indicates a lateral shift of the data with respect to each other
rather than a physical/chemical difference. This is the problem with
analysing batch data in the time domain. Each batch could be
warped such that they all overlay in the natural time axis, but this is
an unnecessary manipulation that renders the data meaningless in
the chemical sense, just to fit the form of a preconceived model.
Figure 13.8 shows all three variables measured using PCA. Due to
a high degree of redundancy between the sensors, the underlying
dimensionality in this system is two. The t1 vs t2 scores plot shows
that all batch data overlay to a high degree when time is taken out of
the picture.
Using a grid system, the batch is broken down into component
grids where a spline interpolation algorithm is used to define the
common batch trajectory (process signature). The final batch
trajectory and its design space is then calculated based on the
interpolation algorithm. This is shown in Figure 13.9.
Figure 13.7: Variables measured in real time show offsets when compared on a batch-to-
batch basis.

The model shown in Figure 13.9 is representative of the batch in


terms of its dynamics as it evolves. Since the limits are based on the
standard deviations of the batches around the mean trajectory,
significance levels can be used to assess the batch. A new batch is
projected onto the PC space defined by the batch model using the
usual representation defined in equation 13.1. The loading matrix
vector P represents the common process signature.

From the newly projected score, the distance to mean trajectory


can be calculated from equation 13.2.
where DTrajectory is the orthogonal distance from the new score to its
projected position on the trajectory, tnew is the new score calculated
from the batch model, tnew ⊥ ttrajectory is the projected position on the
trajectory.

Figure 13.8: t1 vs t2 scores plot for the chemical reaction data.

During the monitoring phase of the process, the following steps


are implemented into a control system such as those described in
section 13.4.
Figure 13.9: Process trajectory and limits for the chemical reaction process.

1) Preprocess, centre and scale the new data before the batch model
can be applied.
2) Estimate the new scores tnew,a = xnewpa for the a components used
to develop and validate the batch model.
3) Project these scores onto the trajectory for estimation of the
relative time, distance to the trajectory and distance to the model.
Figure 13.10 shows the projection of a new batch onto the
developed batch model and the details of the projection are
discussed in the text that follows.
The main observations are made as follows,
1) The new batch started prior to the common starting point of the
batch model. This indicates that the conditions were immature
compared to the common situation, therefore the batch model
knows, through projection, that it has to wait until a point projects
into the design space of the batch model before monitoring and
control begins.
2) There were a few points which transgressed outside the limits of
the design space (these were deliberately set), however, through
use of APC, the batch can be corrected before any major quality
issues occur.
3) Note that the spacing between the points is not even. This is the
major advantage of RTM over other batch modelling approaches, if
the reaction slows down, stalls or even reverses (as shown at score
coordinates (–1,0), as long as the batch remains within the derived
common limits, there is no reason to suggest that the batch is
deviating.

Figure 13.10: Projection of a new batch onto the chemical reaction process batch
model.

4) Precise estimation of the true endpoint is possible without the risk


of over-processing the batch.
When used in conjunction with an APC system, process scripts
can be written to automatically correct a batch in a proactive manner
before the batch limits are broken. Extension of the method is
possible to spectroscopic and other data as this approach is intuitive,
but most importantly it is scientifically, not mathematically grounded,
therefore it fits well into the QbD approach of a scientifically, risk-
based approach to batch modelling.
Finally, with respect to one-dimensional score trajectories and
projection to original variables, returning to Figure 13.7 that showed
the misalignment of the temperature values for probe 1, the method
of RTM can also back project to original variable space, so after
developing the batch model, when the temperature values of probe 1
are projected into relative time space, they all overlay as shown in
Figure 13.11.
By this simple example, the merits of RTM are shown over
alternative methods of batch analysis as they allow back projection
into one-dimensional plots that are consistent with the data views
currently accepted in statistical process control (SPC) applications.
Figure 13.12 shows the F-residuals plot for RTM that can also be
used as a multivariate statistical process control (MSPC) plot when
the number of PCs for a batch model become large.

Figure 13.11: Projection of temperature probe-1 values into relative time space for
chemical reaction process.

In terms of QbD/PAT, this approach forms the cornerstone of


evolving process understanding and control. Combined with data
fusion techniques and the PQS, monitoring and controlling evolving
processes in primary and secondary manufacturing situations will be
based on scientific, risk-based methods as encouraged by the latest
regulatory guidance documentation and will help companies become,
More efficient
Less energy consuming
More proactive towards quality
Less wasteful in terms of scrap and batch rejection
More able to detect root causes of problems and define a course of
action based on the outputs of the PQS.
For pharmaceutical and related industries: first to market in the
development of quality medications based on more sound
regulatory submissions.
On the last point, it is estimated that the time to bring a new drug
substance/entity to market from phase 0 is approximately 12 years at
a cost of over 1 billion USD. This is because the traditional methods
used and the data analysis methods employed are all old and based
on univariate statistics. Given that the populations used for the study
of new drugs suffer from participant dropouts and mortality rates, the
significance levels used to assess the effectiveness and safety of the
drug are all very outdated. Chemometric and DoE methods offer the
much-needed empirical approach to data analysis where group
models can be developed and used to project onto new populations
for better understanding of internal correlations. This is applicable not
only to patient studies, but as shown in this book, to many other
process and product development tasks. In formulation studies, DoE
can be used to assess the simultaneous effect of multiple excipients
on the stability of a drug substance. This leads to designing
processes that maintain the integrity and composition of the
formulation. This translates to early event detection systems that can
correct a process/product deviation before it becomes an issue,
which leads to the development of robust quality assurance and
quality control programs.

Figure 13.12: RTM F-residuals plot for chemical reaction process.

13.5.2 Hierarchical modelling

In a recent article by Swarbrick [34], the question was asked, “what is


the future of chemometrics”. The answer was quite simply described
by what is known as “model utilisation”. The methods of PCA, PLSR
and other basic methods discussed in this book have been the
mainstay and look to be the mainstay of chemometrics methods well
into the future. The challenge now for the chemometrician,
particularly when addressing QbD implementations, is “how can
models be better utilised in a hierarchical manner”. To illustrate this
point, in the early days of raw material identification using NIR, when
the libraries became too large, the possibility of classification
ambiguities became more likely, but how are such ambiguities dealt
with?
A hierarchical model is one that allows logical statements to be
included in the decision process, thus leading to a unique decision
being made, without the risk of circular (infinite) looping. There are
three main types of hierarchical model that can be considered and
these are defined as follows,
1) The Classification–Classification Hierarchy: Starting from a global
classification method (and this is typically a SIMCA model, see
chapter 10), a developer will address any ambiguities in the global
model by assigning them as a known ambiguity. Any samples
uniquely identified are then sent to a reporting level and the
process is stopped. However, in the case that a known ambiguity is
encountered, a second level model can be implemented that is
directed by the result from the first level. The next level model can
be a SIMCA model or in the case of a two-class classification
resolution, the use of SVM or LDA models may be acceptable.
2) The Classification–Prediction Hierarchy: This type of model starts
with a global classification library (and any ambiguity resolution
levels) and based on a unique classification, a quantitative model in
the form of PLSR, PCR, MLR or SVMR can be applied. The final
result is an identification of the material and an associated
quantitative prediction of material properties.
3) The Prediction–Prediction Hierarchy: This approach is useful when
a regression model does not yield linear predictions over the entire
calibration range. The regression line can be broken down into
smaller regions where linear models can be developed. Based on
the region where an initial, approximate prediction is made, this
guides the hierarchy to use the quantitative model that is most
linear for a more precise final prediction.
These models and their applications are discussed in the following
sections.

13.5.3 Classification–classification hierarchies

Using a simple example of a SIMCA library for raw material


identification consisting of libraries for five materials, assume that
unique classification is possible for three materials (materials 1, 2 and
3) and that two materials (materials 4 and 5) cannot be separated by
the global classification model. This situation is described visually in
Figure 13.13.
In Figure 13.13, Materials 1–3 have no regions of overlap in
multivariate space, therefore, they can be uniquely classified by the
global model. Materials 4–5 do overlap in multivariate space and
cannot be uniquely separated. The confusion matrix for this situation
is shown in Table 13.1 for the five materials in this example.
The two last rows of Table 13.1 show the output of a SIMCA
model in the presence of an ambiguity. In a hierarchical model, this
would be defined as an “allowed ambiguity” or a “known ambiguity”.
These would trigger the model to go to the next level of hierarchy
where a more refined, or different class of model would be
implemented to resolve the ambiguity.

Figure 13.13: Classification ambiguity in a SIMCA model.


As a logic script, this would be equivalent to the following,
If Material Class = 1,0,0,0,0 then “Assign Sample to Material 1”
Similar situations for Materials 2 and 3
If Material Class = 0,0,0,1,1 then “Use Classification Model for
Separating Materials 4 and 5” else “Report as Unknown
Ambiguity”
If Material Class = 0,0,0,0,0 then “Report No Classification”
The above logic ensures that the possibility of infinite looping will
never occur.
To provide the overall workflow of the classification–classification
hierarchy described by the logic above, the flowchart of Figure 13.14
shows how the classification scheme works.

13.5.4 Classification–prediction hierarchies

This situation would represent the most commonly used hierarchy in


a modern control system. The use of real-time spectroscopy provides
industry with an opportunity to continuously verify the identity of what
is being manufactured, thus providing greater assurance and
traceability. It also ensures that the correct prediction model is always
being applied to the data thereby reducing any risk of error due to
wrong model application, thus eliminating human intervention. This
situation will be presented as a case study taken from a
petrochemical blending operation to show how the hierarchy is
developed and how it works.

Table 13.1: Confusion matrix for five-material classification library.

Material 1 Material 2 Material 3 Material 4 Material 5

Material 1 1 0 0 0 0

Material 2 0 1 0 0 0

Material 3 0 0 1 0 0

Material 4 0 0 0 1 1
Material 5 0 0 0 1 1

Figure 13.14: Generic flowchart of the hierarchy defined for the classification–
classification case for the five raw material library.

Petrochemical refineries produce a large quantity of finished


gasoline product every day, no matter how big or small the operation
is. Since the volume per minute of product going to tanks is large,
systems must adapt quickly to correct gasoline composition to
ensure the final product meets the specifications it is designed to
meet. A classification–prediction hierarchy can be used that first
classifies the grade of the gasoline before its properties are
predicted.
In this example, three gasoline products, a premium grade, a
standard grade and a grade with ethanol are being manufactured at a
particular refinery. NIR spectroscopy is an excellent tool for
monitoring properties such as octane number, olefin content and
other characteristics of the gasoline, typically employing on-line
sampling and measurement systems. After the product is graded, this
information is recorded to the sample and the corresponding
prediction model is applied. This information is then used by the
master batch system to make any changes during blending in order
to maintain a uniform product. This has two major impacts on the
company,
1) It ensures that the company is producing the correct product that
is being used by consumers.
2) Any increases in octane number above the grade of gasoline can
result in profit losses of 6–10 c US per barrel. While this does not
seem much, if it is considered that a large refinery can produce 1
mil barrels per day, the lost profit in a single day is 100 K USD+.
Now it may be considered a significant loss.

Figure 13.15: Flowchart of the classification–prediction hierarchy for gasoline blending


example.

The flowchart of the classification–prediction hierarchy is shown in


Figure 13.15.

13.5.5 Prediction–prediction hierarchies


In some calibration developments, a large span of constituent values
may require modelling. For example, in the case where a constituent
is to be measured between 0% and 100%, the overall model may
show curvature over the entire span. The modelling strategies
presented in this textbook are based on linear models being fit to the
data (refer to chapter 7 on multivariate regression). In some cases,
the use of preprocessing (chapter 5) can be used to eliminate
curvature, but this is not always the case.
To resolve this issue, it may be possible to model “linear” regions
in the overall curved response. In order to apply one such linear
model to a new sample, the original curved response has to be
separated into particular zones where the modelling process can be
guided such that prediction is made with the correct linear model.
This is called guided locally weighted regression (GLWR) [34]. Figure
13.16 shows how a curved response can be separated into three
zones.
The workflow associated with the prediction–prediction hierarchy
is provided in the flowchart in Figure 13.17.
There are other combinations of the hierarchical models
discussed above however, the overall simplicity or complexity is left
to the imagination of the model developer and the complexity of the
situation to be solved. It must be remembered though, it takes some
time and effort to build such a model and after it is validated, any
changes require revalidation of the entire hierarchy. In order to make
this as painless as possible, it is always good to have in place,
1) A positive challenge data set to show that any modifications to the
model have not affected its ability to characterise a known set used
for the original validation.
2) A negative challenge set to show that the model rejects cases that
it has not been trained to model.
Figure 13.16: Guided locally weighted regression as defined by the prediction–
prediction hierarchy.

Figure 13.17: Flowchart of the prediction–prediction hierarchy.

3) A specificity set that is used solely to show that the modification to


the model is specific to the changes implemented.
If this process is followed, a robust and reliable model will result
that should produce many years of process knowledge and
understanding.

13.5.6 Continuous pharmaceutical manufacturing: the


embodiment of QbD and PAT

Recently, the US FDA approved the first commercial continuous solid


dose manufacturing operation in Boston, MA for a dual active drug
product used for the treatment of cyctic fibrosis, Brennan [35]. This
approval marked a pioneering step in the industry, but why was it
successful? Success was determined by commitment by the entire
organisation from the CEO right through to the janitor to think in a
QbD way. Continuous manufacturing (CM) represents a quantum leap
forward for the industry in so many ways,
All unit operations (considered in the past to be sub-batch
operations) are now linked together to form a steady state process.
Quality assurance is performed at the point of manufacture,
allowing for rejection of “micro-batches” of product that are found
to deviate.
CM represents a 500 times faster increase in product manufacture
and release, in particular it allows for real time release (RtR) to be
implemented.
CM can take place in a single room and thus has a 90% less
footprint than a traditional batch manufacturing facility and also
uses 60% less energy.
By far the most important aspect of CM is that it is a QbD
process, by definition. This is because a CM process can be
controlled to adapt to raw material and intermediate material
changes, it produced only the highest quality end product and it can
be run continuously over multiple “batch numbers”. The concept of a
batch is different in CM being that a batch is now considered to be an
exact number of final packaged goods, so that for instance, when
100,000 packages of Batch A are finalised, Batch B automatically
starts up. It is this aspect of CM that ensures each batch only
contains product meeting the QTTP and CQAs related to the desired
state.
Although CM operations can be operated without PAT
implemented, this is not a good idea as the opportunity to implement
early event detection systems based on spectroscopy are voluntarily
abandoned and therefore not implemented. There is very little reason
for such self-curtailing attitudes—especially when there is much more
to be gained. Figure 13.18 provides a generic CM implementation
with PAT implemented and a full description of the system is
provided based on the technologies used.
The steps in the CM process in Figure 13.18 are defined as
follows,
1) Raw materials are assessed by NIR for identity and conformity. The
conformity data are used to determine the feed rates and other
properties of the material flows for optimal manufacturing.

Figure 13.18: Generic CM implementation with PAT.

2) Since NIR is useful for understanding blend uniformity of powder


materials, placing an analyser at the output of the continuous pre-
blender provides an assessment of what is being input into the
continuous granulation step.
3) Continuous granulation is performed using a twin-screw granulator
where binder solution is added to the pre-blend and the granulated
material is fed into a continuous fluid bed drier (FBD). The
implementation of PAT into the continuous granulator has proven
to be a challenge, however, with the development of redundant
micro-near infrared systems that can be taken offline for cleaning
while a second system takes over may provide a solution to the
current challenges.
4) Continuous drying is achieved through a segmented drying system.
As a micro-batch of material is added to one of the chambers of
the drier, the whole drier rotates so as to collect a new micro-batch
such that the process is maintained in a steady state. This is the
rate limiting step of the process and the timing of the entire
process is typically based off this step. NIR has been used in the
past for the monitoring of the FBD unit operations, however, since
the drier rotates, implementation becomes difficult and could
require the use of multiple, micro instruments. Risk-based
approaches have been taken to eliminate the need for multiple
sensors at the FBD stage, as described in the next step.
5) After the granules have been dried, they are vacuum transferred to
an in-line continuous mill where NIR and FBRM are used in
conjunction to monitor moisture content and particle size
distribution. In step 2, the pre-blend uniformity was verified and the
material allowed to go through the granulation and drying stages.
Post analysis at the mill assesses whether the granulation and
drying steps were performed as expected and any non-conforming
material can be rejected to waste at this point. The assessment of
moisture content and particle size distribution ensures that there
are no processing issue at the blending and compression stages
and also minimises the risk of microbiological growth due to high
moisture content.
6) The next step in the process is continuous blending of the granules
with other excipients and the lubricant before compression. Again,
NIR is used to assess blend uniformity before compression and this
provides a pseudo measurement of tablet uniformity.
7) During compression, a 100% inspection system cannot be
implemented and to provide an added assurance of tablet quality,
samples are automatically extracted from the line in 2-min intervals
for hardness, thickness and diameter using mechanical testing and
active content is assessed using a large spot size Raman (or NIR)
spectrometer. This testing is performed as a verification that the
tablets are acceptable before the coating process is performed.
8) Coating thickness is assessed by Raman spectroscopy to ensure
uniformity and in some cases, the coating may be enteric in nature,
so the thickness is important for drug performance, including
dissolution (QTTP). The coating system is semi-batch with two
coaters operating simultaneously in an offset manner. As tablets
come off the tablet press, they are loaded into hoppers (where a
relaxation time is imposed) that contain a check weighing system.
When the mass in the hopper reaches its target mass, the tablets
are loaded into the first coater while a second hopper is loaded, the
process is continued until all tablets are coated and moved onto
packaging.
As described, only through the use of PAT can CM be truly run in
the QbD sense. This now requires the implementation of a PQS that
can handle large volumes of data in an extremely fast manner and
can provide feed forward/feedback data in a holistic fashion to the
entire process. This is outside of the scope of the current book;
suffice to say that the PQS is similar to the generic situation shown in
Figure 13.4. This system manages the operation of the processing
equipment, applies the right chemometric model to the right analyser
at the right time, uses the predicted data for making quality decisions
and keeps the entire process in a steady state.

13.6 An introduction to multivariate statistical


process control (MSPC)
In recent times, the pharmaceutical and many other industries started
to adopt the principles of Six Sigma [36] into business-critical
operations in an attempt to better control variability in manufacturing.
Six Sigma in itself is an ideology that encompasses a number of
disciplines in order to better understand the common and special
causes of variation that products and processes may experience.
In chapter 2, the basic principles of statistical process control
(SPC) were introduced. These methods form a major part of the Six
Sigma toolkit; however, they are of limited use in complex processes
as they have a focus on one-variable-at-a-time (OVAT). As has been
discussed in this book to date, only the multivariate approach to data
analysis can provide a complete picture of what is happening in
complex systems. Six Sigma was found to be lacking in this regard
and as an off shoot of the initiative, the Design For Six Sigma (DFSS),
Yang et al. [37] initiative was borne. DFSS has a much greater
emphasis on “Greenfield” developments that aim not to tweak
existing processes (as per the Six Sigma principles) but to re-design
them so that they are fit for purpose (this is just a simpler approach to
the overall QbD ideology).
The positive outcome of the Six Sigma initiative is that it raised the
importance of using correct and reliable measurement systems for
monitoring processes and to understand the variability arising from
the “gauges” used. This is a highly critical practice to be implemented
if reliable designed experiments are to be analysed or process
models using multiple sensors are to be used for process control.
However, SPC is univariate in its approach and can only be used as a
confirmation tool after a multivariate analysis or projection has been
applied.
Multivariate statistical process control (MSPC) encompasses
methods that utilise multivariate tools to monitor and control complex
processes. As in nearly all methods used in multivariate analysis, PCA
is the workhorse for extracting information out of data and using that
information for process understanding and control. MSPC is also
aligned with the definition of design space in that this is typically a
multivariate space that is being monitored or controlled. Examples
already presented in this chapter have discussed typical multivariate
sensors such as spectrometers for measuring materials as they exist
in the process in real time, but multivariate data does not have to be
generated by a single analyser or sensor. In fact, multivariate data
can be generated based on a number of univariate sensors
simultaneously monitoring the same sample (or sample area in the
temporal sense) to provide even more information on the state of the
process. Before a detailed discussion of MSPC is provided, the next
section looks at the important topic of data fusion.

13.6.1 Aspects of data fusion

When dealing with multiple “individual” sensors and even multivariate


sensors, representative measurements can only be generated when
the data from each sensor is collected and aligned in a short period
of time. Individual sensors generate data at their own natural
frequencies and if these frequencies are different to each other, this
makes data alignment and fusion a highly challenging task,
particularly when there is a large time gap between one sensor’s
outputs and the majority of the other sensor outputs.
As the old saying goes, “a chain is as strong as its weakest link”
and drawing from this analogy, data fusion is only as good as the
slowest measurement system to be aligned. For example, if a
temperature probe outputs data every second and a Raman
spectrum is collected every 30 min, aligning this type of data to the
frequency of the temperature probe, or to even every minute of
measurement is still a non-representative situation as a spectrum was
only collected every 30 min. In this case, it is best to set the data
polling rate at 30 min to match the spectrometer, otherwise, increase
the frequency of spectroscopic measurements (if at all possible).
Continuing on from this point, the collection of spectral data at a
higher frequency may not always be possible. For example, if the
spectrometer is part of an on-line measurement system and the
sample requires 25 min of conditioning before a scan is made, then
this is the limiting factor that will inhibit higher frequency of
simultaneous measurements throughout. Another factor (however,
not as relevant now as in the past) was the physical limitation of
memory and data storage that can restrict the frequency of
measurement.
The polling rate is the critical measure for successful data
alignment and fusion. This is the critical time, as determined by
subject matter expertise, that will detect the onset of a deviation and
will allow sufficient time to correct the issue before it becomes a
problem, i.e. implementing a polling time that allows proactive, not
reactive approaches to quality control and assurance. Once the
polling rate is defined, this can be the benchmark for setting the
slowest time for the most complex, or challenging system in the
sensor chain. The following scenarios best describe the data fusion
process for obtaining representative data from multiple sources.

Univariate sensor fusion


In this scenario, multiple univariate sensor outputs are to be aligned
and fused to generate a representative multivariate table that can be
modelled, or used as input into a multivariate model. This is shown
diagrammatically in Figure 13.19.
In Figure 13.19, there are four sensors, however, there can be 1 –
N (where N can be ≥2) that output data at their own natural
frequencies. The data can be filtered to remove the effects of shock
noise generated by most industrial processes before being sent to an
accumulation layer. The accumulation layer stores the collected data
until it is polled by the polling layer. When polled, the system
averages whatever is in the accumulation layer and then clears the
data to start the process again. It is assumed (or even verified) that
the time between successive pollings is not likely to result in data
trending within the accumulation layer such that when the data are
averaged, the mean is representative of the central location of the
data and the data are usually distributed in a symmetrical manner
around the calculated mean (but, of course, any reader of this book
would check this assumption carefully).
The order of the fused data in the final array is of paramount
importance (unless methods such as OPC tagging are employed) as
this is the order the data is fed into the multivariate model. If the data
are misaligned in the final array, the predicted results or other output
statistics generated by application of the model will be in error.

Univariate and multivariate sensor fusion


This scenario is just an extension of the previous one where
univariate sensor output is fused with either a single or multiple
multivariate sensors. The situation is shown diagrammatically in
Figure 13.20 and the same explanations as for scenario 1 hold for
scenario 2.
Figure 13.19: Data fusion for the case of multiple univariate sensors.

Advanced univariate and multivariate sensor fusion


Following on from scenario 2, there are a number of perturbations
that can be applied, thus giving this scenario the greatest flexibility,
adaptability and power to any process monitoring situation. As in
scenarios 1 and 2, the data and their order are set in an array.
However, an analyst may want to extract the relevant multivariate
information out of a group of sensors (i.e. their covariance structure),
or they may only want a table with the relevant predicted values or
scores from the application of a multivariate model to the data prior
to building the fused array.
This scenario is the most challenging of the data fusion strategies
and requires the use of a powerful data agglomeration system that
can handle all situations in a fast and accurate manner. The
agglomeration system must also be powerful enough to back-project
any information such that the actual root cause variables found in a
run time system can be detected and controlled effectively.
Figure 13.20: Data fusion for the case of multiple univariate and multivariate sensors.

Figure 13.21: Data fusion for the case of a complex process system.

Figure 13.21 shows a generic implementation of scenario 3,


however, the form and complexity of this approach is only limited by
the process scientist/engineer’s imagination.

Univariate and multivariate sensor fusion for feed forward-


feedback systems
This scenario has most applicability to the continuous manufacturing
systems (CMS) discussed previously in section 13.7. In this scenario,
a number of unit operations are joined into a continuous chain and
the objective of the data alignment is to provide a “process
chronology” of discrete units as they pass through the entire process.
This results in a single fused array that is completely time aligned
across all unit operations and can be used for both local and holistic
PAT monitoring. In the case of local unit operation monitoring, the
outputs of one operation can be fed forward as inputs into a
subsequent unit operation, or the outputs of a downstream unit
operation can be used in a feedback manner to control an upstream
unit operation.
Scenario 4 combines all aspects of scenarios 1–3 and puts them
into a dynamic data acquisition process. This is shown
diagrammatically in Figure 13.22 for a generic situation.

13.6.2 Multivariate statistical process control (MSPC)


principles

The use of MSPC in industry is increasing, mainly due to initiatives


such as QbD and the definition of the PQS in the pharmaceutical
sector. Rather than monitoring the trends in single variables (as is
done in SPC), MSPC takes the outputs generated by the projection of
new data onto an existing model that has been validated such that
multivariate confidence intervals contain samples indicative of a
process in overall control.
As discussed in section 13.5.1, where multivariate projection was
presented in the batch modelling case, outputs such as scores and
outlier statistics can be monitored using run charts (chapter 2),
Scores plots (chapter 4) or Hotelling’s T2 (section 6.6.1), Outlier
statistics such as leverage (section 6.6.2), influence plots (section
6.6.4) and statistical-based residuals such as Q-residuals (section
7.11.2) and F-residuals (section 7.11.3) are used to assess the state
of the process where statements of confidence can be made about
the projected values.
Figure 13.22: Data fusion for a continuous manufacturing system.

The influence plot is possibly the best graphical tool available to


an analyst or process operator for assessing the state of a process. It
can be used for both evolving processes and steady state processes.
Since it is a two-dimensional plot, and since the statistics displayed
are a condensation of the number of components in the model, i.e.
even if 10 PCs are used to generate a model, the influence plot is
always a two-dimensional plot, this provides two extremely useful
properties for process monitoring and control. There is great need for
education in this space and this chapter has attempted to develop a
steep learning curve that never-the-less can be followed by
everybody. In summary for influence charts used for MSPC,
1) The definition of various process states can be assessed in a single
plot and when samples fall outside of these plots, drilldown to the
root cause is possible.
2) The plot can be broken down into quadrants and these can be an
excellent indicator of process state to operators.
An example influence plot is shown in Figure 13.23 for a fluid bed
drier (FBD) process showing how it can be used to better detect the
endpoint of the drying.
13.6.3 Total process measurement system quality control
(TPMSQC)

All of the approaches previously reviewed in this chapter are based


on the critical requirement that the process measurement system
implemented, i.e. all elements as well as their structural fusion
(univariate, multivariate and combination analytical sensor systems) is
documented, tested and verified as reliable under all conditions of
use. There is a critical success factor termed total process
measurement system quality control (TPMSQC) in play here. As it
happens this does not involve any new methodological approach,
save process sampling in general, variographic data analysis.
Variographic analysis was introduced in chapter 3; a full introduction
can be found in Esbensen et al. [2] or Esbensen and Paasch-
Mortensen [13].
The latter analysed the contemporary scientific and technological
situation regarding on-line process monitoring and suggested a way
forward by calling for a shift regarding process sampling (both
physical process sampling as well as PAT). In fact, TOS constitutes a
new asset for including quality control of the total process
measurement system simultaneously while acquiring data for, e.g.,
PAT and MSPC. TOS’ command of the necessary sampling principles
allows the suppression of adverse sampling and sensor errors and
variographic analysis is able to facilitate dedicated improvement of
both manufacturing and processing monitoring in technology and
industry. TOS comprises all the necessary tools to isolate the
effective total sampling error (TSE), including the total analytical error
(TAE) from the variability of the industrial process itself, which is the
true objective of QbD/PAT. This variance source decoupling can be
used to establish fit-for-purpose in-process specifications. A strategy
for guaranteed representative sampling (TOS) and monitoring with
“built in” automated measurement system checks allows
development of comprehensive quality control of both processes and
products. The following points are highlighted for clarity,
Figure 13.23: Using MSPC charts to better detect the endpoint of a fluid bed dryer (FBD)
unit operation.

MSPC is incomplete without TOS, because of sampling errors. PAT


sensor errors are no different in this regard to physical sampling
methods and are of identical nature and potential magnitudes; both
are usually much larger than TAE.
MSPC data analysis and modelling is therefore fraught with
unnecessarily inflated errors.
TOS allows decomposition of the error components reflecting the
measurement system itself.
TOS/PAT/MSPC allows on-line measurement system QC/QA
based on variographics.
No standard process monitoring systems today incorporates on-
line variographic modelling.
Below is illustrated how this approach works, as illustrated by
variograms outlining progressively less adequate total process
measurement systems charactaristics in the form of the “sum-total of
all sampling + analysis errors”, termed the nugget effect (n.e.) or RSV1
– dim respectively w.r.t. to the sill level.
A particularly relevant application of this TPMSQC approach on a
pharmaceutical mixing process was given by Esbensen et al. [2].

13.7 Model lifecycle management


A topic of recent interest at the International Federation of Process
Analytical Chemistry (IFPAC) conferences and one that is widely
applicable to practitioners of QbD and PAT implementations has
been that of model lifecycle management. The notion of make one
model that will last forever is simply not a valid one for a number of
reasons,
1) Processes do change over time (mechanical wear and tear etc.).
2) Instrumentation drifts or needs maintenance.
3) Raw materials change.
A multivariate model is a snapshot of the current knowledge at the
time of its development. With careful planning and equipment
maintenance, the longevity of the model can be maintained for some
time, but it must be borne in mind that a multivariate model is a living
entity and requires updating on a periodic basis. The question is then,
when does a model need updating? This is where an effective
lifecycle strategy is required.

Figure 13.24: Variogram expression of a situation in which the total process measurement
system gets progressively worse and is no longer able to depict the true, underlying
process variation (sill). QC of the measurement system, RSV1 – dim is 20, 60, 90% of the
sill, respectively. The vertical bar represents the true process variation, progressively
more-and-more swamped by measurement system failure. This system check is obtained
by variographic analysis of the process data themselves, i.e. it can be carried out in an
on-line, in real-time modus.

Using an analogy from the realm of physics, during the times of


Sir Isaac Newton, the laws of classical mechanics held well for the
world that could be measured at that time. As more knowledge and
insights were gained into the sub-atomic world, theories changed and
were updated to be more relevant to the quantum mechanics world.
The same applies to empirical models, as times and conditions
change, the model can change if it is not robust, and there lies the
key, to build a strategy for model development and maintenance that
ensures robustness.

13.7.1 The iterative model building cycle


In their fundamental textbook on DoE, Box, Hunter and Hunter [38]
show the iterative model development process adapted in Figure
13.25.
As information and knowledge is gathered from a system, a series
of deductions and inductions are made that are constantly being
refined. The process starts with the conception of an idea that is
tested for feasibility and developed into a model. As described in
detail in chapter 4 on PCA and chapter 7 on multivariate calibration,
the characteristics of a reliable model are its ability to be interpreted
and validated, but only when the model is implemented into a real
system can the learning process begin.

Idea to feasibility
There are many opportunities to implement PAT into a process, but
are all valid? The answer is no. The PAT implemented must be fit for
purpose and this is where a small, but comprehensive feasibility trial
is required. It has been the authors’ experience over many years that
too many companies run endless R&D trials to gain confidence in the
technology. When these trials exceed two to three weeks, it can be
safely assumed that this will never reach a real process and is just a
“tyre kicking” exercise. As with any analytical method development,
the production of a small, but concise protocol to accept/reject the
technology is required that has the buy in of all key stakeholders.
Modifications to the protocol are allowed, based on unanticipated
findings, however, this feasibility trial must be performed with the end
intent of implementing the technology and not to keep it going forever
in a repetitive loop.

Figure 13.25: The iterative model building process.

Feasibility to model development


The development of a reliable model requires a well thought out plan
for success. Considerations for model construction include,
1) The number of samples available from as wide a manufacturing
timeframe as possible (this typically includes taking samples to the
batch that just remains in its date of expiry).
2) The batches selected should span the greatest variation in lot
numbers of active and excipient materials.
3) To expand the span of concentrations in the set, a well-planned
experimental design must be employed to vary the excipient
content within the variability expected from typical manufacturing
situations, i.e. ±10% variability is a good starting point.
4) The concentration span of active should be those stated in typical
pharmacopoeia content uniformity limits, i.e. 75–125% label claim
for a quantitative method.
5) The verification that samples made in the laboratory are
representative of those made in production. This is where making a
100% label claim sample set in the laboratory is highly useful as is
can be used to see if it lies in the same multivariate space as the
production samples when measured by the PAT.
6) Once the sample set has been defined, a typical starting number of
samples is 180 (in a pharmaceutical context and also depending on
the complexity of the matrix being analysed). In this case, 120
samples are selected for calibration and 60 samples set aside for
internal validation. This step can only be performed when the
samples are analysed using the PAT first and the same sample sent
for reference analysis. The internal validation set must span the
calibration span, but must never exceed it. The internal validation
set defines the working region of the calibration model.
7) The final steps are to optimise the preprocessing of the data to
enhance the detection of systematic changes in the product matrix
(refer to chapter 5) and then look at the specifics of calibration
model development as per guidelines such as ICH Q2(R1) [4], i.e.
range, linearity, precision, accuracy and robustness.
The above seven steps constitute the first steps of modelling after
feasibility and, as will all model development, considerations for
repeated and replicated measurements must be performed in order
to show statistical equivalence of the PAT method to the reference
method.

Model development to interpretation and validation


In the pharmaceutical and related industries, a model can only be
used if it is interpretable and validateable. Interpretation comes from
simplicity. This is a direct consequence of the method developer’s
ability to engineer the entire calibration and validation set. The
feasibility study has proven that the PAT will do its job, the internal
validation set should show that the model is capable of predicting
accurately over the span of the calibration set. Therefore, to confirm
specificity and selectivity, the loading weights (PLSR), scores and
regression coefficients must be related to the active material, or an
interpretable change in excipient composition, Swarbrick et al. [39].
The number of components/factors used in the model must be
commensurate with the complexity of the system, i.e. if it takes 10
PLSR factors to model a single active material in a three-component
system, then there is something seriously wrong with the
development. Each loading weight has to be interpreted as
contributing to the model, otherwise, if a serious incident was to
occur from the use of the model and the model construction cannot
be justified in a court of law, then there will certainly be legal
ramifications.
The final validation of the model is a targeted sample set known
as the external validation set. This set consists of new batches of
product representing target values only. There is no way to express
these results on a predicted vs reference line as there should be no
spread of the data. In this case, the residuals of the predicted and the
reference data should be plotted on a normal probability plot and a
residuals vs reference plot to show that they are normally distributed
around zero, with a standard deviation less than or equal to the
RMSEP/SEP of the calibration model and also less than or equal to
1.4 × SEL (refer to chapter 7).
When the above criteria can be met, the model has been shown to
be fit for purpose for its intended application.

Validation to implementation and learning


This is the key step of the model lifecycle process. Only when a
model can be implemented and used on a regular basis on real
manufacturing processes can learning begin. If a model developer
had a crystal ball, then all would be known before a model is made.
However, this is not the case in reality and the only crystal ball
substitute is the use of the model.
A conservative approach to the use of the model can be taken at
first by running the model alongside the reference method for the first
few batches until confidence is gained. The reference analysis will
then be slowly decreased until only periodic checks are made. It is
important to always keep the PAT equipment in a high state of
maintenance and to run regular operational and performance
qualifications (OQ/PQ) to ensure the measurement system has not
drifted or its performance deteriorated.
The model has to be used in two ways moving forward,
1) As a means of predicting CQAs and assessing the quality of the
data using model diagnostics.
2) To detect new samples that can make the calibration model more
robust in the future.
The model can detect any changes in the overall matrix that is not
related to the active material of interest. These cases are typified by
the predicted result being in specification, but the diagnostics related
to X-residuals showing that there is a difference in the overall matrix.
These samples must be sent for further investigation using the
reference and other confirmatory methods. In some cases, an entire
batch of material may show that every sample measured is an X-
outlier. This may be indicative of a real change in product, that may
still be acceptable, but will trigger the need for model updating.

13.7.2 A general procedure for model updating

With a lifecycle model established for a PAT method, constant


monitoring of its performance becomes the first part of its continuous
improvement strategy. Any signs of model drift, after establishing that
the PAT instrument is performing within its specifications, are only
attributable to a change in product as, after all, the model is static. If
the product being measured has changed, within the design space of
course, this is a first indication that the model may need updating to
handle the new variability. At this stage, it is suggested to remove
about 5% of the oldest batch samples from the model and update
them with the new samples. After the recalibration process, apply the
new and old models to a new set of data (i.e. a test set), followed by
reference chemistry. If it can be shown that,
1) there is no significant difference between the updated method and
the reference method (paired t-test) and,
2) there is a significant difference between the old method and both
the reference method and updated method (paired t-test).
Then there is justification that the updated method can replace the
old method, so long as a similar number of components/factors are
used in the new model and the loading weights (PLSR), or equivalent,
of the model are still specific and selective for the active material.
Utilising model diagnostics in real time, i.e. influence plots with
inlier/outlier limits, new samples can be assessed for model inclusion.
Inliers help to detect samples that can “fill holes” in the calibration
space. These samples help to make the model more robust. High
leverage samples can also be considered for inclusion as long as they
represent extension of the calibration line and when they are included
in the new model, do not influence it to the point that the new
samples dominate a particular component/factor of the model.

13.7.3 Summary of model lifecycle management

Unlike classical analytical methods employed by a QC laboratory in


the pharmaceutical and related industries, PAT methods are typically
calibrated in a first major effort and the calibration model is expected
to be valid for a long period of time. In contrast, a method such as
HPLC is validated for specificity and selectivity during its
development, however, each time the instrument is used, it requires a
calibration procedure using reference standards to establish the basis
for which quantitative values are to be obtained.
The validity of multivariate models developed for PAT applications
is determined by the extensive diagnostics methods available for
assessing the quality of the X-variables used to predict a new
sample. In this manner, a confidence statement of the predicted Y-
value can be made based on how close the X-variables lie within the
design space of the original calibration model.
In the event an X-outlier is detected, but the predicted Y-value is
in an acceptable range, this is typically an indication that the matrix
has changed, but the constituent of interest is still in specification.
This may be due to a raw material change or a change in the process
equipment efficiency. These samples may be used to make a model
more robust. During the initial development of a PAT-based model, it
may not be possible to capture the widest possible variation. In this
case, the model can be used to find “holes” in the calibration set by
the use of inlier statistics. These samples should be isolated, sent for
reference analysis and added to the model to make it more robust.
The key term in model lifecycle management is robustness and
the continual pursuit of making a model more robust over its lifecycle.
Figure 13.26 provides a graphical overview of the model lifecycle
management process in the form of the commonly used V-model
GAMP®5 [9].
The V-model also shows one more aspect of model lifecycle
management, that of model retirement. As with many things, they
become obsolete over time, therefore, the lifecycle model should also
include a section on how the model will be phased out when a new
technique is to be implemented.

13.8 Chapter summary


QbD/PAT is not a single tool or methodology, it is the condensation
of representative sampling, the adoption of state-of-the-art sensor
technology, the utilisation of DoE and chemometrics methods, the
integration with modern control systems and the utilisation of data
rather than its long-term storage (BIG DATA). Summed up, QbD is
common sense, but as the well know book title reads, Making
Common Sense Common Practice, Moore [40], this is the major step
that needs to be taken for success.
In too many cases, it has been the authors’ experience that when
faced with the opportunity to implement long-term strategies for
improvement and success, those who have decision making authority
typically take one of the following paths,
Figure 13.26: V-Model of lifecycle management process of multivariate calibration models.

1) In the old ways’ approach learning stopped at school and the


requirement to take on new approaches is too far out of their
comfort zone. The same mistakes are made over and over and no
thought is given to learning and improvement. This situation is to
be avoided by progressively thinking companies.
2) There is an instant reaction to implement QbD, so long as it fits in
the box of the old regime and therefore, when the system does not
perform as expected, the blame game is played to make the QbD
implementation seem inferior to the current practices employed.
3) It is decided to implement QbD, but assignment of this task is
given to some poor soul who is already overworked and has to fit a
new paradigm into an old system (i.e. fitting a square peg in a
round hole). When little progress is made, the QbD approach
“obviously does not work” and is scrapped for the even worse
original approach.
4) A firm decision is made to change the culture and practices in the
company. A well thought out strategy is implemented based on a
core group of champions who can affect the necessary culture
change. The culture is taught and spread throughout the
organisation and real changes to quality are implemented that can
be rolled out to other products and to other facilities in the
organisation. These people should be treated like gold as they can
think laterally, are willing to challenge the system and have subject
matter knowledge that allows them to succeed. Don’t lose these
people or your organisation is likely to suffer.
QbD/PAT is held together by multivariate data analysis and
continuous verification of process data. The latter is provided by
diligent use of TOS process sampling competence. It has to be this
way because the outputs of processes and sensors (both univariate
and multivariate) are of little use unless the embedded information is
extracted from them and can be relied upon with certainty (data
relevance and representativity). PAT tools typically require
chemometrics models for data extraction and the outputs of these
models are fed directly into advanced control systems where actions
can be taken on the process before an issue occurs. This is called
proactive quality control. In order for this to work properly, all
calibration/prediction models must have been thoroughly vetted by
appropriate chemometric validation approaches; there are no
shortcuts like. cross-validation, Esbensen and Paasch-Mortensen
[13].
PAT is not a means of bringing the laboratory to the process,
contrary to popular belief. It is a method for gaining better
understanding of the process that is transferred via knowledge
management into actions that improve quality. Bringing a QC
laboratory to the process is an after-the-fact event, i.e. once the
quality assessment is made and the result is a failure, it is too late to
do anything (if you sneeze, you already have a cold). PAT is meant to
be predictive, i.e. as an early event detection system for corrective
action before the issue, so long as the corrections are performed
within the design space.
Design spaces are typically defined by multivariate and DoE
approaches. The models allow for multiple inputs to be used to define
the space where the QTTP of the product lies. Data fusion methods
are very important in this context as they determine the relevancy of
the measurements being made with respect to the actual state of the
sample. This is highly important because if the data are not being
aligned within a tight window, then the possibility of some of the
measurements representing one state and other measurement
representing a different state can occur. Data alignment must be
performed based on a predefined polling time. This time is based on
whether a critical change can be detected in time for a correction to
be successfully made. Polling time is a function of sensor
measurement frequency and available data storage capacity.
QbD/PAT can be used for both new and legacy products, but be
warned, if this methodology is applied to legacy products, don’t be
surprised if/when it is discovered that the current process may
actually be far from optimal. There is an old saying, “Processes are
always perfect, they produce as many failures as they were designed
to”. With this in mind, it must be realised that a process was
developed using the best available knowledge at the time. If a crystal
ball was available, all processes would be flawless, but this is
certainly not the case on planet Earth. This fact should not deter the
implementation of QbD/PAT as it provides a regulatory framework to
acknowledge gaps and define best practices to mitigate risk. It also
allows a company’s scientists and engineers to do what they did their
tertiary education for, to be innovative, not just to become glorified
mechanics fixing problems or putting out fires on a constant basis.
Continuous manufacturing systems (CMS) are the ultimate
culmination of a QbD process. They are designed to adapt to the
material used in the process, they use technology to detect issues
and fix them before they become serious issues, they reject sub-
batches off the line when out of specification events occur and they
ensure that the final product meets all of its efficacy, safety and
performance characteristics (i.e. QTPP). The way a CMS works is
through the use of multivariate methods to model the product and
use the outputs in a feed forward/feedback manner to control both
upstream and downstream operations. CMS is recognised by the US
FDA as the future of solid dose manufacturing and with the approval
of the first system recently, Brennan [35], interest in CMS will
definitely grow in the years to come.
It is the authors hope that after reading this chapter (and for that
matter, the whole book) that the inspiration to take a new view on
processes and products will result. There is a learning curve involved,
otherwise, progress cannot be made. The best strategy for
implementing QbD/PAT is for new products as this represents a
“greenfield” situation. However, many generic manufacturers work
with legacy products and the QbD approach is also highly applicable
to them, as long as a “greenfield” mindset is taken into the project by
all parties concerned.
QbD is achievable for those organisations that are willing to invest
the appropriate time into,
1) The qualification and /or correction of inferior process sampling
(TOS).
2) The right analytical technology(ies) for the problem(s) at hand. This
includes having implemented on-line total process measurement
system quality control (TPMSQC).
3) Performing DoE and using knowledge management to implement
control strategies.
4) Using the results of DoE and other validated studies to develop
multivariate models for use in early detection systems.
5) Integrating an appropriate pharmaceutical quality system (PQS)
based on the ICH Q8-Q11 approach.
6) Utilising the data obtained to implement continuous improvement
strategies.
7) Developing a culture that thinks QbD/PAT and does not look to the
older, yet easier, approaches to solve complex problems. Actually,
experience shows that the old ways are a “band aid” approach to
just hide an issue, instead of fixing it.
If points 1–7 above are followed and become the culture in the
company, the QbD initiative has a far greater possibility of success
compared to any current, traditional approach.

13.9 References
[1] US FDA (2004). Pharmaceutical cGMPs for the 21st Century – A
Risk Based Approach, Final Report.
http://www.fda.gov/drugs/developmentapprovalprocess/manufacturing/que
[accessed 3 June 2015].
[2] Esbensen, K.H., Roman-Ospino, A.D., Sanchez, A. and
Romanach, R.J. (2016). “Adequacy and verifiability of
pharmaceutical mixtures and dose units by variographic
analysis (Theory of Sampling) – A call for a regulatory paradigm
shift”, Int. J. Pharm. 499, 156–174.
https://doi.org/10.1016/j.ijpharm.2015.12.038
[3] Code of Federal Regulations 21 CFR Part 211, Current Good
Manufacturing Practice for Finished Pharmaceuticals, Food and
Drug Administration. Department of Health and Human
Services.
[4] ICH Harmonized Tripartate Guideline Q2(R1) (1997). “Validation
of analytical procedures: text and methodology”, Federal
Register 62(96), 27463–27467.
[5] ICH Harmonized Tripartate Guideline Q8(R2) (2009).
“Pharmaceutical development”, Federal Register 71(98).
[6] Dickens, J.E., (2010). “Overview of process analysis and PAT”,
in Process Analytical Technology, Ed by Bakeev, K. John Wiley
& Sons, NY, pp. 1–14.
https://doi.org/10.1002/9780470689592.ch1
[7] ICH Harmonized Tripartate Guideline Q9 (2009).
“Pharmaceutical risk management”, Federal Register 71(106).
[8] ICH Harmonized Tripartate Guideline Q10
(2009).“Pharmaceutical quality system”, Federal Register 74(66).
[9] GAMP®5: A Risk-Based Approach to Compliant GxP
Computerized Systems.
http://www.ispe.org/publications/gamp4togamp5.pdf [accessed
6 January 2017].
[10] ICH Harmonized Tripartate Guideline Q11 (2009). “Development
and manufacture of drug substances (chemical entities and
biotechnological/biological entities)”.
[11] US FDA, Guidance for Industry – Process Validation: General
Principles and Practices.
http://www.fda.gov/downloads/Drugs/Guidances/UCM070336.pdf
[accessed 6 January 2017].
[12] US FDA (2004). Guidance for Industry: PAT – a framework for
innovative pharmaceutical manufacturing and quality assurance.
http://www.fda.gov/downloads/drugs/guidancecomplianceregulatoryinform
[13] Esbensen, K.H. and Paasch-Mortensen, P. (2010). “Process
sampling: theory of sampling, the missing link in process
analytical technologies (PAT)”, in Process Analytical
Technology, Ed by Bakeev, K. John Wiley & Sons, NY, pp. 37–
80. https://doi.org/10.1002/9780470689592.ch3
[14] Bakeev, K. (Ed.) (2010). Process Analytical Technology, 2nd
Edn. John Wiley & Sons, NY.
[15] Plugge, W. and van der Vlies, C. (1992). “The use of near
infrared spectroscopy in the quality control laboratory of the
pharmaceutical industry”, J. Pharm. Biomed. Anal. 10(10–12),
797–803. https://doi.org/10.1016/0731-7085(91)80083-L
[16] Swarbrick, B. (2016). “Near infrared (NIR) spectroscopy and its
role in scientific and engineering applications”, in Handbook of
Measurement in Science and Engineering, Vol. 3, Ed by Kutz, M.
John Wiley & Sons, pp. 2583–2608.
https://doi.org/10.1002/9781119244752.ch71
[17] Berntsson, O., Danielsson, L.-G. and Folestad, S. (2001).
“Characterization of diffuse reflectance fiber probe sampling on
moving solids using a Fourier transform near-infrared
spectrometer”, Anal. Chim. Acta 431(1), 125.
https://doi.org/10.1016/S0003-2670(00)01313-1
[18] Blanco, M., Gozalez Bano, R. and Bertran, E. (2002).
“Monitoring powder blending in pharmaceutical processes by
use of near infrared spectroscopy”, Talanta 56, 203–212.
https://doi.org/10.1016/S0039-9140(01)00559-8
[19] Besseling, R., Damen, M., Tran, T., Nguyen, T., van den Dries,
K., Oostra, W. and Gerich, A. (2015). “An efficient, maintenance
free and approved method for spectroscopic control and
monitoring of blend uniformity: The moving F-test”, J. Pharm.
Biomed. Anal. 114, 471–481.
https://doi.org/10.1016/j.jpba.2015.06.019
[20] Muzzio, F.J., Alexander, A., Goodridge, C., Shen, E., Shinbrot,
T., Manjunath, K., Dhodapkar, S. and Jacob, K. (2004). “Solids
mixing”, in Handbook of Industrial Mixing: Science and Practice,
Ed by Paul, E.L., Atiemo-Obeng, V.A. and Kresta, S.M. John
Wiley & Sons.
[21] Muzzio, F.J., Robinson, P., Wightman, C. and Brone, D. (1997).
“Sampling practices in powder blending”, Int. J. Pharm. 155,
153–178. https://doi.org/10.1016/S0378-5173(97)04865-5
[22] Burns, D.A. and Ciurczak, E.W. (2008). Handbook of Near-
Infrared Analysis, 3rd Edn. CRC Press.
[23] Swarbrick, B. (2014). “Advances in instrumental technology,
industry guidance and data management systems enabling the
widespread use of near infrared spectroscopy in the
pharmaceutical/biopharmaceutical sector”, J. Near Infrared
Spectrosc. 22, 157. https://doi.org/10.1255/jnirs.1121
[24] Lewis, I.R. and Edwards, H.G.M. (2001). Handbook of Raman
Spectroscopy, From the Research Laboratory to the Process
Line. CRC Press.
[25] http://www.kosi.com/raman-spectroscopy/phat-probe-
head.php [accessed 16 June 2016].
[26] Coates, J.P. (2010). “Infrared spectroscopy for process
analytical applications”, in Process Analytical Technology, Ed by
Bakeev, K. John Wiley & Sons, NY, pp. 157–193.
https://doi.org/10.1002/9780470689592.ch6
[27] Westad, F., Gidskehaug, L., Swarbrick, B. and Flåten, G.-R.
(2015). “Assumption free modelling and monitoring of batch
processes”, Chemometr. Intell. Lab. Syst. 149(B), 66–72.
https://doi.org/10.1016/j.chemolab.2015.08.022
[28] Jørgensen, P., Pedersen, J., Jensen, E.P. and Esbensen, K.H.
(2004). “On-line batch fermentation process monitoring –
introducing biological process time”, J. Chemometr. 18, 81–91.
https://doi.org/10.1002/cem.850
[29] Bro, R. (1997). “PARAFAC. Tutorial and applications”,
Chemometr. Intell. Lab. Syst. 38, 149–171.
https://doi.org/10.1016/S0169-7439(97)00032-4
[30] Smilde, A., Bro, R. and Geladi, P. (2004). Multiway Analysis,
Applications in the Chemical Sciences. John Wiley & Sons.
https://doi.org/10.1002/0470012110
[31] Nomikos, P. and MacGregor, J.F. (1994). “Monitoring batch
processes using multiway principal component analysis”,
AICHE J. 40, 1361–1375.
https://doi.org/10.1002/aic.690400809
[32] Kassidas, A., MacGregor, J. and Taylor, P. (1998).
“Synchronization of batch trajectories using dynamic time
warping”, AICHE J. 44, 864–875.
https://doi.org/10.1002/aic.690440412
[33] Wold, S., Kettanch, N., Friden, H. and Holmberg, A. (1998).
“Modelling and diagnostics of batch processes and analogous
kinetic experiments”, Chemometr. Intell. Lab. Syst. 44, 331–340.
https://doi.org/10.1016/S0169-7439(98)00162-2
[34] Swarbrick, B. (2016). “Chemometrics for near infrared
spectroscopy”, NIR news 27(1), 39–40.
https://doi.org/10.1255/nirn.1584
[35] Brennan, Z., FDA Allows First Switch From Batch to Continuous
Manufacturing for HIV Drug. http://raps.org/Regulatory-
Focus/News/2016/04/12/24739/FDA-Allows-First-Switch-From-
Batch-to-Continuous-Manufacturing-for-HIV-
Drug/#sthash.0eGkzDLr.dpuf [accessed 18 June 2016].
[36] George, M.L., Rowlands, D., Price, M. and Maxey, J. (2005). The
Lean Six Sigma Pocket Toolbook. McGraw Hill.
[37] Yang, K. and El-Haik, B. (2003). Design for Six Sigma, A
Roadmap for Product Design, McGraw Hill.
[38] Box, G.E.P., Hunter, W.G. and Hunter, J.S. (1978). Statistics for
Experimenters, An Introduction to Design, Data Analysis and
Model Building. John Wiley & Sons.
[39] Swarbrick, B., Grout, B. and Noss, J. (2005). “The rapid, at-line
determination of starch content in sucrose-starch blends using
near-infrared reflectance spectroscopy; a process analytical
technology initiative”, J. Near Infrared Spectrosc. 13, 1–8.
https://doi.org/10.1255/jnirs.451
[40] Moore, R. (2013). Making Common Sense Common Practice.
Reliability Web Publishing.
Index
A
Abstract variables 391
Accuracy 59, 216
Additive effects 123, 124, 129
Advanced process control (APC) 429
Alternative hypothesis 26
Ambiguity resolution 438
Analysis level 431
Analysis of variance (ANOVA) 18, 319
Analyte 66
At-line 421
Autocorrelation 50
Auto-scaling 107, 108, 129
Averaging 109

B
Baseline shift 122
Batch trajectory 435
Bias 15, 197
Big data solutions 431
Bilinear methods 10
Blind source separation 399
Block effect 313
Box–Behnken. See Design of experiments (DOE)
Box plot 24

C
Calibration set 163
Proper training data 235
Requirements 163
Training set 163
Training stage 265
Categorical variables 309
Category variables. See Variables
Central limit theorem 33, 225
Centring 77, 93
Chromatography 114
Classification 9, 263
Basic steps 266
Classification stage 265
Distance metric 257
Modelling power 276
Pattern cognition 255
Pattern recognition 267
Si vs Hi 270
Supervised classification 255
Unsupervised classification 255
Classification table 266
Closed modus 277
Cluster analysis 255
Hierarchical cluster analysis 258
Hierarchical cluster analysis (HCA)
Agglomerative methods 258
Dendrogram 259
Divisive methods 258
k-Means clustering 256
Nearest neighbour distance 256
Ward’s method 259
Coefficient of variation 246, 328
Collinearity 169, 281
Composite sample 66
Confusion matrix 289
Continuous improvement 419
Continuous manufacturing systems (CMS) 419, 429, 442
Control charts 45
Control level 430
Control limits 47
Corrective and preventative maintenance (CAPA) 419
Correct sampling 246
Correlation 4, 48, 73, 142
Covariance 4, 48, 142
Critical process parameters (CPP) 418
Critical quality attributes (CQA) 418
Cross-validation. See Validation
Cross-validation vs test set validation 237

D
Data
Description 9
Fusion 445
Matrix 69
Dimension 70
Representativity 223
Structure 7, 93
Decomposition 79, 93
Deflation 178
Degrees of freedom (DF) 17, 36, 37, 320, 324
Derivation 122
Descriptive statistics 25, 139, 256
Design for Six Sigma 13, 444
Design of experiments (DOE) 233
Blocking 313, 329
Centre samples 309
Cube plots 343
Cubic blending 364
Curvature 330
D-Optimal designs 358
Efects
Main effect 302
Effects
Definition and calculation 303
Hierarchy of effects 322
Interaction 302
Interaction effect 306
Main 302
Main effect 306
Factor influence designs 311
Full factorial designs 301
Full cubic 363
Mixture designs 360
Antagonistic blending 364
Piepel direction 361
Scheffe polynomials 362
Simplex 360
Simplex centroid designs 365
Simplex lattice designs 363
Synergistic blending 364
Model diagnostics plots 337
Multi-linear constraint 357
Constrained designs 356
Constrained mixture spaces 371
Optimality 358
Optimisation designs 311, 316
Analysis 317
Box–Behnken 317, 318
Central composite design (CCD) 317, 349
Face centred design (FCD) 349
Which to choose 319
Resolution 307
Response optimisation
Graphical optimisation 353
Numerical optimisation 353
Response surfaces 342
Contour plots 342
Rotatable 318
Screening deisgns
Plackett–Burman designs 315
Screening designs 311, 313
Fractional factorial designs 307
Plackett–Burman designs 314
Special cubic 363
Star samples 317
Statistics
Adequate precision 328
Adjusted R-squared 328
Mean square 320
Mean squares of regression 325
Predicted R-squared 328
Pure error 334
Residual sum of squares 324
R-squared 328
Sum of squares regression 324
Total sum of squares 323
Steepest ascent 316
U-simplex 375
Design space 419, 429, 444
Desired state 419
Dimension reduction 79
Dimensions 166
Direct observations 2
Discrimination 9
Disjoint modelling 264
Dynamic time warping (DTW) 432

E
Early event detection (EED) 419
Edge of failure 429
Effective dimension 73
E-matrix 95
Empirical modelling 297
Engineering process control (EPC) 48
Estimate 161
Evolving factor analysis 403, 405
Evolving processes 432
Execution level 430
Expensive measurement 3
Experimental design. See Design of experiments (DOE)
Explained variance 95
Explained variance plot 144
Extrapolation 164
Extreme objects 151

F
Factor analysis 392
Factor matrix 392
Factor rotation
Equimax rotation 394
Parsimax rotation 394
Quartimax rotation 393
Varimax rotation 393
Orthogonal rotation 392
Factor influence study 316
Feature space 281
Fisher’s iris data 260
Fluid bed drying (FBD) 425
Focussed beam reflectance measurement (FBRM) 428
Formal statistical tests 30
F-residuals 150, 191

G
GAMP©5 431
Garbage in/Garbage out 138, 264
Gaussian 22, 51
General mixture models 362
Global estimation error (GEE) 58, 61, 223
Grab sample 56, 66
Grouping and segregation 247
Guided locally weighted regression 441

H
Half normal probability plot 322
Heterogeneity 15, 54, 223, 424
Heteroscedasticity 337
Hidden structures 6
Hierarchical cluster analysis (HCA). See Cluster analysis
Hierarchical modelling 255, 438
Classification–classification hierarchy 438
Classification–prediction hierarchy 438
Prediction–prediction hierarchy 438
Hierarchical replication 248
Histogram 15, 20
Homoscedasticity 337
Hotelling’s T<sup>2</sup> 149, 150, 151, 156, 191

I
Incorrect sampling errors (ISE) 221, 246
Increment 54, 66
Independent component analysis (ICA) 392
Oblique rotation 392
Independently and identically distributed (iid) 20
Indirect observations 2
Inexpensive measurement 3
Inferential statistics 14
Bonferroni limit 320
Confidence intervals 30
Empirical distribution function 31
Equivalence of means 38
False negative 27
False positive 27
F-test 35, 325
Hypothesis testing 26
<i>F</i>-test
F-ratio 320
Snedecor F-table 36
Kolmogorov–Smirnov 31, 51
Lilliefor’s correction 31
<i>t</i>-test
Paired <i>t</i>-test 196, 251, 453
<i>t</i>-tests 38, 51
Pooled standard deviation 41
Standard deviation 17, 18, 328
Variance 3, 17, 73, 75
Influence plot 150, 192, 205, 448
Inlier statistic 206, 453
In-line 422
Inner relationships 226
Interaction effects 302
International Conference on Harmonisation (ICH) 250, 419
Interquartile range 109
Iterative model development process 450

J
Joint confidence interval 50

K
Kernel functions 281
Kinetic modelling 403
Knowledge management system 431
Kurtosis 23
L
Lack of fit 93, 330, 334, 408
Lean manufacturing 431
Least squares fit 74
Leave-one-object-out. See Validation, Full cross-validation
Leverage 149, 150
Linear discriminant analysis (LDA) 279
Loading plots
Interpretation 83, 142
Loadings 77, 158, 179
Loading weights 179
Lot 65
Lot dimensionality transformation 60

M
Mahalanobis distance 150, 280
Manufacturing execution system (MES) 430
Manufacturing level 430
Martens’ uncertainty test. See Significance testing
Maturity index 433
Mean 14, 15, 328
Mean centre 77
Mean time before failure (MTBF) 429
Measurement error 35, 252
Measurement noise 113
Measurement uncertainty 53, 62, 66, 68, 245
Median 16
Mode 17
Model centre 77
Model diagnostics 190
Model lifecycle management 449
Modelling ability 164
Models 93
Interpretation 179
Model stability 210
Model updating 452
Model utilisation 438
Moving average 113
MS. See Mean square
Multiple linear regression (MLR) 167, 168, 169
Multiplicative 124, 129
Multiplicative effects 123
Multivariate calibration 161
Multivariate curve resolution (MCR) 398
Ambiguity
Intensity ambiguity 405
Rotational ambiguity 405
Constraints
Closure constraint 402
Estimated concentrations 408
Estimated spectra 408
Local rank constraints 402
Non-negativity constraint 401
Uni-modality constraint 401
Estimated concentrations 399
Estimated spectra 399
Multivariate curve resolution–alternating least squares (MCR-ALS)
405
Multivariate data analysis 1
Multivariate image analysis (MIA) 237
Multivariate statistical process control (MSPC) 50, 149, 150, 420, 429,
444

N
Negative challenge 286
Net analyte signal 187
NIPALS 94, 176
Noise 7, 122
Normal distribution 14, 20, 22, 322
Normal probability plot 24, 321
of residuals 337
Nugget effect 64
Null hypothesis 26, 268

O
Objectivity 257
Object residuals 78, 96
Object-to-model distance 268
Off-line 422
One-sample t-test. See Inferential statistics
One-sided 29, 36
One variable at a time (OVAT) 13, 299
On-line 422
Outliers 16, 139, 140, 141, 142, 146, 147, 148, 149, 150, 151, 153,
154, 156, 157, 158, 267
Example 154
Overall equipment effectiveness (OEE) 419

P
Paired t-test. See Inferential statistics
Parallel factor analysis (PARAFAC) 432
Pareto chart 320
Parsimonious model 209
Parsimony 258
Partial least squares regression 174
Partial least squares discriminant analysis (PLS-DA) 276
PLS1 175, 178
PLS2 175
Partial replication 248
Partition error 256
Pascal’s Triangle 306
PAT level 430
PCA. See Principal component analysis
PCA rotation. See Factor analysis
PC-models 93
PCR. See Principal component regression
PCs 73, 76, 97, 158
number of 76
Pharmaceutical quality management system (PQS) 418, 419
PLS. See Partial least squares regression
Polling rate 445
Population 13
Power 28, 42, 43
Precision 14, 35, 59
Intermediate precision 251
Repeatability 251
Reproducibility 251
Precision agriculture 417
Predicted residual sum of squares (PRESS) 196, 328
Predicted vs actual plot 338
Predicted vs reference 173
Prediction 10, 161, 162
Prediction ability 164
Prediction error 166
Prediction variance 166
Preprocessing 107, 130
Baseline correction 115
Correlation optimisation warping (COW) 127
Derivative
Second derivative 119
Derivatives 117
First derivative 117, 118, 119, 122
Quadratic baseline effects 127
Savitzky–Golay derivative 120
Second derivative 118, 119, 122, 128, 129
Segment-gap derivative 119
Detrending 126
Normalisation 114
Scaling
Logarithmic transformations 109
Mean centring 107
Median centring 109
Minimum centring 109
Spherising 109
Variance scaling 107, 108
Scatter correction
Common amplification model 125
Common offset 116
Common offset model 125
Extended multiplicative scatter correction (EMSC) 126
Multiplicative scatter correction 124, 144
Standard normal variate 123
Smoothing 109, 113, 129
Gaussian filter 113
Median filter 113
Moving block smoothing 113
Savitzky–Golay smoothing 114
Smoothing window 113
PRESS 196
Principal component analysis 69, 144
Algorithm 74
Summary 158
Principal component models 93
Principal component (PC) 73
number of 76, 94, 97, 166
Principal component regression 169
Principles of validation. See Validation
Probability density function 22
Problem-dependent 65
Problem reasons 218
Process analytical technology (PAT) 19, 418, 421
Process chronology 447
Process signature 435
Process validation guidance 420
Projection 151, 158, 447
Projection models 93
Projections 10
Proper process sampling 235
p-value 27, 28, 320, 325

Q
Q-residuals 191
Quality by design (QbD) 418
Quality risk management 419
Quality target product profile (QTPP) 419
Quantitative structure activity relationship (QSAR) 162

R
Random sampling 18
Range 109
Rank 70
Real time release 419, 442
Reference method 163
Regression 10, 161
Regression coefficients 113, 305
Relative coefficient of variation 61
Relative time modelling (RTM) 432
Remote sensing 423
Removing outliers 192
Replication 243, 245
Full replication 248
Replicate analysis 244
Replicate error 334
Replicate primary samples 246
Replicates 109
Replication experiment 61, 67, 246
Representative 163
Representative sampling 18, 66
Residual matrix 93
Residuals 74, 78, 95, 158, 408
Residuals vs predicted plot 337
Residuals vs run plot 338
Residual variance 95
Robust 16
Root mean
RMSECV 229
Root mean square errors 195
Root mean square error of cross validation (RMSECV) 195
Root mean square error of prediction (RMSEP) 166, 195, 232
Root mean square of calibration (RMSEC) 195

S
Sample 14, 65
Sampling bias 15
Sampling increment 56
Sampling unit operation 67
Sampling unit operations 54, 59
Savitzky–Golay 113
Scatter effects 122, 123
Scatter effects plot 130
Scatter plot 70
Scores 158
Projected scores 205
Score plots 80
Interpretation 80, 140, 142
Validation scores 152
Score trajectory 434
Score vectors 78
Self-modelling mixture analysis 399
Signal 7
Significance 26
Significance levels 268
Significance test 208
Significance testing
Martens’ uncertainty test 199, 208
Signs of problems 217
Simple structure 391
Singular value decomposition (SVD) 178
Singular values 405
Six Sigma 13, 431, 444
Skewness 17, 23
Soft Independent modelling of class analogy (SIMCA) 263
Coomans’ Plot 268
Discrimination power 265, 273
Membership plot 270
Model distance plot 272
Modelling power 276
Si/S0 vs Hi 272
Si vs Hi 270
Variable discrimination power plot 273
Span 164
Spatial lot coverage 58
Special cause 48
Specimen 65
Spectroscopy 87
Absorbance 111
Beer–Lambert’s law 111, 144
Diffuse reflectance 123
Full width at half maximum (FWHM) 121
Mid infrared (MIR) 111, 428
Molar absorptivity (ε) 111
Near infrared (NIR) 19, 110
Overlapping spectra 168
Raman spectroscopy 111, 426
Reflectance 111
Specular 123
Resolution 121
Spectroscopic data 110, 143
Transformations 111
Transmission 111
UV-Vis 111
Variables 107
Square error of prediction 195
SS 320
Stability plot 153
Stability test 208
Standard error 32
Standard deviation of differences (SDD) 43, 196, 251
Standard error of laboratory (SEL) 195, 251
Standard error of prediction (SEP) 197
Standard operating procedure (SOP) 138
Statistical methodologies 13
Statistical process control (SPC) 13, 47, 51, 444
Statistical sampling 222
Steady state processes 432
Studentisation 193, 337
Student’s t-test 149. See Inferential statistics
Subjectivity 257
Subject matter expertise (SME) 307
Sum of squares 18, 320
Supervisory control and data acquisition (SCADA) 430
Support vector machine classification 281
Support vectors 282
Suspicious data 157
Synergy 299

T
Theory of Sampling (TOS) 18, 53, 65, 221
Fundamental sampling error 247
Fundamental sampling principle 58, 67
Grouping and segregation error 59
Heterogeneity
Constitutional heterogeneity 54, 55
Distributional heterogeneity 54, 55
Spatial heterogeneity 245
Total analytical error (TAE) 58, 243, 449
Total analytical uncertainty 243
Total sampling error 58, 449
Unified sampling responsibility 68
Time series 45
Total process measurement system quality control 448
Total residual variance 96
Troodos 154
Tucker 3 models 432
Two-sample t-test. See Hypothesis testing
Two-sided 29
Type I error 27, 43, 268
Type II error 27, 43, 268

U
Uncertainty test 208
Univariate methods 1
Univariate regression 167
Univariate statistics 13

V
Validation 164
Cross-validation 165, 225
Categorical cross-validation 235
Full cross-validation 225
Leverage validation 165
Random cross-validation 234
Segmented cross-validation 225
Systematic cross-validation 235
Internal validation set 234
Principles of proper validation 221
Test set validation 164, 224
External validation set 234
Validation objectives 223
Variability 13, 14, 297
Variables
Category 310
Continuous 309
Dependent (y-variables) 161
Discrete variables 107, 310
Discrete numeric 309
Dummy 276
Independent (x-variables) 161
Indicator 310
Selective 6
Time series 107
Variable space 70
Variance
definition 3
Variographic analysis 63, 448
Nugget effect 67
Sill 67
Variography 67
Visualisation 255
V-model 453

X
X-residual 151, 190
x-variable. See Variables
X–Y relation outlier plots 197

Y
y-residual 193
y-variable. See Variables

You might also like