MFML PDF
MFML PDF
LEARNING
20ANMAG469P1, FALL TERM 2020-2021)
∗
HÔNG VÂN LÊ
Contents
1. Basic mathematical problems in machine learning 3
1.1. Learning, inductive learning and machine learning 4
1.2. A brief history of machine learning 5
1.3. Main tasks of current machine learning 7
1.4. Main types of machine learning 10
1.5. Basic mathematical problems in machine learning 11
1.6. Conclusion 12
2. Mathematical models for supervised learning 12
2.1. Discriminative model of supervised learning 13
2.2. Generative model of supervised learning 17
2.3. Empirical Risk Minimization and overfitting 21
2.4. Conclusion 23
3. Mathematical models for unsupervised learning 23
3.1. Mathematical models for density estimation 23
3.2. Mathematical models for clustering 29
3.3. Mathematical models for dimension reduction and manifold
learning 30
3.4. Conclusion 32
4. Mathematical model for reinforcement learning 33
4.1. A setting of a reinforcement learning 33
4.2. Markov decision process 34
4.3. Existence and uniqueness of the optimal policy 35
4.4. Conclusion 35
5. The Fisher metric and maximum likelihood estimator 36
5.1. The space of all probability measures and total variation norm 36
5.2. The Fisher metric on a statistical model 38
5.3. The Fisher metric, MSE and Cramér-Rao inequality 40
5.4. Efficient estimators and MLE 43
5.5. Consistency of MLE 43
5.6. Conclusion 44
Date: October 28, 2020.
∗
Institute of Mathematics of ASCR, Zitna 25, 11567 Praha 1, email: [email protected].
1
2 HÔNG VÂN LÊ ∗
It is not knowledge, but the act of learning ... which grants the greatest
enjoyment.
Carl Friedrich Gauss
Recommended Literature.
1. S. Shalev-Shwart, and S. Ben-David, Understanding Machine Learning:
From Theory to Algorithms, Cambridge University Press, 2014.
2. L. Deveroye, L. Györfi and G. Lugosi, A Probabilistic Theory of Pat-
tern Recognition, Springer 1996.
3. M. Mohri, A. Rostamizadeh, A. Talwalkar, Foundations of Machine
Learning, MIT Press, 2012.
Why did such a move from artificial intelligence to machine learning hap-
pen?
The answer is that we are able to formalize most concepts and model
problems of artificial intelligence using mathematical language and represent
as well as unify them in such a way that we can apply mathematical methods
to solve many problems in terms of algorithms that machine are able to
perform.
As a final remark on the history of machine learning I would like to
note that data science, much hyped in 2018, has the same goal as machine
learning: Data science seeks actionable and consistent pattern for predictive
uses. 4.
Now I shall briefly describe main tasks of current machine learning and
methods to perform these tasks.
4
according to Dhar, V. (2013), Data science and prediction, see also https://en.
wikipedia.org/wiki/Data_science
8 HÔNG VÂN LÊ ∗
5The term “regression” was coined by Francis Galton in the nineteenth century to
describe a biological phenomenon. The phenomenon was that the heights of descendants
of tall ancestors tend to regress down towards a normal average (a phenomenon also
known as regression toward the mean of population). For Galton, regression had only this
biological meaning, but his work was later extended by Udny Yule and Karl Pearson to a
more general statistical context: movement toward the mean of a statistical population.
Galton’s method of investigation is non-standard at that time: first he collected the data,
then he guessed the relationship model of the events.
6 He founded the world’s first university statistics department at University College
London in 1911, the Biometrical Society and Biometrika, the first journal of mathematical
statistics and biometry.
7Fisher introduced the main models of statistical inference in the unified framework of
parametric statistics. He described different problems of estimating functions from given
data (the problems of discriminant analysis, regression analysis, and density estimation)
as the problems of parameter estimation of specific (parametric) models and suggested the
maximum likelihood method for estimating the unknown parameters in all these models.
MACHINE LEARNING 9
Last week we discussed the concept of learning and examined several ex-
amples. Today I shall specify the concept of learning by presenting basic
mathematical models of supervised learning. A model for machine learn-
ing must be able to make predictions and improves their ability to make
predictions in light of new data.
The model of supervised learning I present today is based on Vapnik’s
statistical learning theory, which starts from the following concise concept
of learning.
Definition 2.1. ([Vapnik2000, p. 17]) Learning is a problem of function
estimation on the basis of empirical data.
MACHINE LEARNING 13
There are two main model types for machine learning: discriminative
model and generative model. They are distinguished by the type of functions
we want to estimate for understanding the feature of observable.
such that hS (x) predicts the label of (unseen) instance x with the less error.
Z
(2.2) RµL (h) := L(x, y, h) dµ
X ×Y
Note that the RHS of (2.6) coincides with the error of estimation in (1.1),
L(dn )
i.e., our empirical risk function Rµδ defined on H = Rd+1 coincides with
S
the error function R in (1.1).
Example 2.5 (0-1 loss). Let H = M eas(X , Y). The 0-1 instantaneous loss
function L(0−1) : X × Y × H → {0, 1} is defined as follows:
y
L(0−1) (x, y, h) := d(y, h(x)) = 1 − δh(x) .
The corresponding expected 0-1 loss determines the probability of the answer
h(x) that does not correlate with x:
(2.7) Rµ(0−1)
X ×Y
(h) = µX ×Y {(x, y) ∈ X × Y| h(x) 6= y} = 1 − µX ×Y ({x, h(x)}).
Example 2.6. Assume that x ∈ X is distributed by a probability measure
µX and its feature y is defined by y = h(x) where h : X → Y is a measurable
mapping. Denote by Γh : X → X × Y, x 7→ (x, h(x)), the graph of h. Then
(x, y) is distributed by the push-forward measure µh := (Γh )∗ (µX ), where
(Γh )∗ µX (A) = µX Γ−1 −1
(2.8) h (A) = µX Γh (A ∩ Γh (X ) .
(
1 − µY|X ({y = 1}|x) h(x) = 1
Eµ (h(x)) =
µY|X ({y = 1}|x) h(x) = 0.
Given x ∈ X , our aim is to minimize Eµ (h(x)). If 1 − µY|X ({y = 1}|x) ≤
µY|X ({y = 1}|x) we set h(x) := 1, and h(x) := 0 otherwise. By rearranging
the inequality into the form of µY|X ({y = 1}|x) ≥ 1/2 we get the original
definition of Bayes optimal predictor fµ (x).
L2 (X , (ΠX )∗ µ) = {f ∈ RX | i1 (f ) ∈ L2 (X × R, µ)},
The expected risk RµL is called the L2 -risk, also known as mean squared error
(MSE). Show that the regression function r(x) := Eµ i2 (Y )|X = x belongs
to F and minimizes the L2 (µ)-risk.
n
1X
(2.19) R̂SL (h) := L(xi , yi , h).
n
i=1
If L is fixed, then we also omit the superscript L.
The empirical risk is a function of two variables: the “empirical data” S
and the predictor h. Given S a learner can compute R̂S (h) for any function
h : X → Y. A minimizer of the empirical risk should also “approximately”
minimize the expected risk. This is the empirical risk minimization princi-
ple, abbreviated as ERM.
Remark 2.13. We note that
(2.20) R̂SL (h) = RµLS (h)
where µS is the empirical measure on (X × Y) associated to S, cf. (2.6). If
h is fixed, by the weak law of large numbers, see e.g. Proposition B.2 in the
Appendix, the RHS of (2.20) converges in probability to the expected risk
RµL (h), so we could hope to find a condition under which the RHS of (2.20)
L .
for a sequence of hS , instead of h, converges to Rµ,H
Example 2.14. In this example we shall show the failure of ERM in certain
cases. We shall consider a disriminative model (X , Y, H = Y X , L, P(X ×Y))
where Y = {0, 1} and L is the 0-1 loss function, see Example 2.5. Then the
empirical 0-1 risk R̂S0−1 is defined as follows:
|i ∈ [n] : h(xi ) 6= yi |
(2.21) R̂S0−1 (h) :=
n
for a training data S = {(x1 , y1 ), · · · , (xn , yn )} and a function h : X → Y.
We also often call RS0−1 (h) - the training error or the empirical error.
Now we assume that labeled data (x, y) is generated by a map f : X → Y,
i.e., y = f (x), and further more, x is distributed by a measure µX on X .
Then (x, f (x)) is distributed by the measure µf = (Γf )∗ (µX ), see Example
2.6. Since H = Y X , we have f ∈ H, and clearly Rµ0−1 f
(f ) = 0. Next, for any
given ε > 0 and any n ∈ N , we shall find a map f ∈ Y X , a measure µX ,
+
22 HÔNG VÂN LÊ ∗
Remark 2.17. (1) We omit the ERM principle for generative models of
supervised learning and refer the reader to [Vapnik1998, §1.9.2, p. 36-37].
(2) ERM principle was a starting point of departure of classical parametric
statistics to statistical learning theory founded Vapnik [Vapnik1998, p. 7].
2.4. Conclusion. A discriminative model for supervised learning consists
of a quintuple (X , Y, H, L, PX ×Y ) where H ⊂ M eas(X , Y) and PX ×Y ⊂
P(X × Y). In this model the aim of a learner is to find a prediction rule
A : S 7→ hS ∈ H for any sequence of i.i.d. labeled training data S = Sn ∈
(X × Y)n of size n such that (x, hS (x)) approximates the training data best,
i.e. RµL (hS ) is smallest as possible. The ERM principle suggests that we
could choose hS to be the minimizer of the empirical risk R̂SL instead of
the unknown function RµL , and we hope that as the size of S increases the
expected error RµL (hS ) converges to the optimal performance error Rµ,H L .
Without further condition on H and L the ERM principle does not work.
A generative model of supervised learning is a discriminative model
(X , P(Y), H, L, PX ×P(Y) ) whose training data are of the form {(x1 , δy1 ), · · · ,
(xn , δyn )}. Discriminative models can be regarded as “limit” cases of gener-
ative models.
Finally I would like to remark that there is another popular type of gen-
erative models, called Bayesian models, which we shall learn at the end of
our course.
14note that in this example the instantaneous loss function depends only on two vari-
ables x ∈ X , θ ∈ Θ.
15Note that the log-likelihood function log p (x) does not depend on the choice of
θ
a dominating measure µ0 , i.e., if we replace µ0 by µ00 that also dominates p(θ) then
log dp(θ)
dµ0
= log dp(θ)
dµ0
.
0
26 HÔNG VÂN LÊ ∗
where pnθ (Sn ) is the density of the probability measure µnθ on X n . It follows
that the minimizer θ of the empirical risk R̂SLn = RµLSn is the maximizer of
the log-likelihood function log[pnθ (Sn )], cf. (2.5).
According to ERM principle, which is inspired by the weak law of large
numbers, the minimizer θ of R̂SLn should provide an “approximation” of
the density pu of the unknown probability measure µu that governs the
distribution of i.i.d xi ∈ X .
Remark 3.3. (i) The instantaneous loss function L(x, y, h) in a supervised
learning measures the discrepancy between the value of a prediction h(x)
and a possible feature y, see (2.3).
(ii) In density estimation problem it is harder to figure out a correct
choice of an instantaneous loss function L : X × Θ → R, or more gener-
ally, an instantaneous function L : X × Θ × p(Θ) as in the next example of
non-parametric estimation, see (3.12). (In Theorem 5.19 we shall give a jus-
tification of the instantaneous loss function given by the minus log-likelihood
function (3.1) using the notion of efficient estimator and the Cramér-Rao
inequality).
For X = R and µ0 = dx being the Lebesgue measure, we have another jus-
tification of the choice of the risk function being the expected log-likelihood
function [Vapnik1998, p. 30]. Namely this risk function has the following
nice properties.
(1) The minimum of the risk function RµLu : Θ → R (if exists) is attained
at θ∗ such that the density function p(θ∗ ) may differ from pu at a
µ0 -zero set.
(2) The risk function RµLu satisfies the Bretagnolle-Huber inequality
Z q
|p(x, θ) − pu (x)|dx ≤ 2 1 − exp(RµLu (θ) − RµLu (θu )).
X
(iii) Note that minimizing the expected log-likelihood function Rµ (θ) is
the same as minimizing the following modified risk function [Vapnik2000,
p.32]
Z Z
∗ L pθ (x)
(3.4) Rµu (θ) := Rµu (θ) + log pu (x)pu (x)dµ0 = − log pu (x)dµ0 .
X X pu (x)
The expression on the RHS of (3.4) is the Kullback-Leibler divergence
KL(pθ µ0 |µu ) that is used in statistics for measuring the divergence between
pθ µ0 and µu = pu µ0 . The Kullback-Leibler divergence KL(µ|µ0 ) is defined
for probability measures (µ, µ0 ) ∈ P(X ) × P(X ) such that µ µ0 , see also
Remark 5.9 below. It is a divergence, i.e., it satisfies the following properties:
(3.5) KL(µ|µ0 ) ≥ 0 and KL(µ|µ0 ) = 0 iff µ = µ0 .
By (3.4), a maximizer of the expected risk function RµLu (µθ ) on Θ minimizes
the KL-divergence KL(µθ |µu ) regarded as a function on Θ. The relations
in (3.5) justify the choice of the expected risk function RµLu .
MACHINE LEARNING 27
To measure the accuracy of the estimator p̂PSnR (h; ·) we first need to define
an expected loss function RL on the hypothesis class H of possible densities
of probability measures we are interested in. In the given case we choose H
to be defined as follows
Z
(3.10) H = P(β, L) := {p ∈ RR ≥0 ∩ Σ R (β, L)| p(x) dx = 1}
R
16
where ΣR (β, L) is the Hölder class on R of the set of l = bβc times
differentiable functions f : R → R whose derivative f (l) satisfies
(3.11) |f (l) (x) − f (l) (x0 )| ≤ L|x − x0 |β−l , ∀ x, x0 ∈ R.
Let PR ⊂ P(R) be a statistical model of possible distributions µ0 = p0 dx
that contains our unknown distribution µu = pu dx. By definition PR = H.
A natural candidate for an instantaneous loss function
(3.12) L : R × H × PR → R
is the square distance function
(3.13) L(x, p, p0 ) := |p(x) − p0 (x)|2 .
This induces the expected loss at x0 ∈ R function as follows
(3.14) RL := R|x
L
0
: H × PX → R, (p, p0 ) 7→ |p(x0 ) − p0 (x0 )|2 .
Let
n
An ⊂ H R
be the set of possible estimators p̂n : Rn → H, Sn 7→ p̂Sn ∈ H. Then we
define the expected accuracy at x0 function
L
M R|x 0
: An → R
by
L L
(3.15) M R|x 0
(p̂n ) := M SE|x0 (p̂n ) := Eµnu (R|x 0
(p̂n (Sn ), pu )),
where µu = pu dx and Sn is distributed by µnu .
For instance, for the Parzen-Rosenblatt estimator p̂Pn R (h; ), Sn 7→ ρ̂PSnR (h; ),
we have
(3.16) M SEx0 (p̂Pn R (h; )) = Eµnu [(p̂PSnR (h; )(x0 ) − pu (x0 ))2 ].
Note that the RHS of (3.16) measures the expected accuracy at x0 of the
estimator p̂PSnR (x0 ) averaging over Sn ∈ Rn . This is an important concept of
the accuracy of an estimator in the presence of uncertainty.
It is known that under certain constraints on the kernel function K, and
assuming the infinite dimensional statistical model P = H of density func-
tions is given in (3.11), the mean expected risk M SE(p̂PSnR (h; )x0 converges to
zero uniformly on R as n goes to infinity and h goes to zero [Tsybakov2009,
Theorem 1.1, p. 9].
16 bβc denotes the greatest integer strictly less than β ∈ R
MACHINE LEARNING 29
Remark 3.4. We have considered two models for density estimation using
different instantaneous functions to measure the accuracy of our estimators
in terms of an expected loss function. In the first model of density estimation
(X , H = Θ, L : X × H → R, PX = H) we let L(θ, x) := log pθ (x). Similarly,
as in the mathematical model of supervised learning, the ERM says that
a minimizer of the empirical risk function RµLSn (θ) = − log[pnθ (Sn )] should
be the best estimator i.e., it should minimize the expected loss function
RµLu : Θ → R. Note that a minimizer of RµLSn is the MLE estimator: θ must
maximize the value log pθn (Sn ) given a data Sn ∈ X n .
In the second model for density estimation (X = Rm , H, L, PX ) an instan-
taneous function L : X × H × PX → R. Using L we define a new concept
of the expected accuracy function M RL (ρ̂n ) of an estimator ρ̂n : X n → H
that measures the accuracy of ρ̂n , averaging over observable data Sn ∈ X n
with the help of the unknown distribution generating our sample data on
X and hence on X n . In the second model we don’t discuss how to find a
“best” estimator ρ̂n : X n → H, we find the Parzen-Rosenblatt estimator by
heuristic argument. The notion of a mean expected loss function is an im-
portant concept in theory of generalization ability of machine learning that
gives an answer to Question 1.5: how to quantify the difficulty/complexity
of a learning problem.
3.2. Mathematical models for clustering. Given an input measurable
space X , clustering is the process of grouping similar objects x ∈ X together.
There are two possible types of grouping in machine learning: a partitional
clustering, where we partition the objects into disjoint sets; and a hierarchi-
cal clustering, where we create a nested tree of partitions. We propose the
following unified notion of a clustering.
Let Ωk denote a finite sample space of k elements ω1 , · · · , ωk .
Definition 3.5. A clustering of X is a measurable mapping χk : X → Ωk .
A clustering χk : X → Ωk is called hierarchical, if there exists a number
χ χlk
l ≤ k − 1 such that χk = χlk ◦ χl : X →l Ωl → Ωk .
A probabilistic clustering of X is a probabilistic morphism χk : X ; Ωk ,
i.e. a measurable mapping χk : X → P(Ωk ). A probabilistic clustering
χk : X ; Ωk is called hierarchical, if there exists a number l ≤ k − 1 such
χ χlk
that χk = χlk ◦ χl : X ;l Ωl ; Ωk .
Now we can define a model of clustering similarly as in the case of density
estimations. To formalize the notion of similarity we need a distance function
d : X × X → R+ , i.e. d is nonnegative, symmetric, satisfying triangle
inequality, and d(x, y) = 0 iff x = y. Then we pick up a set C of possible
clusterings, i.e., C is a subset of all measurable/probabilistic morphisms
∪∞i=1 χk : X ; Ωk and the goal of a clustering algorithm is to find a clustering
χk ∈ C of minimal cost
(3.17) G : (X , d) × C → R.
The cost function G is also called the objective function.
30 HÔNG VÂN LÊ ∗
Example 3.6. The k-means objective function is one of the most pop-
ular clustering objectives. In k-means, given a clustering χk : X → Ωk =
{ω1 , · · · , ωk }, the data is partitioned into disjoint sets C1 = χ−1
k (ω1 ), · · · , Ck =
−1
χk (ωk ) and each Ci is represented by its centroid ci := c(Ci ) ∈ (X 0 , d0 ),
where X ⊂ X 0 and d = d0|X . The centroid ci is defined as the point of X 0
that minimizes the averaging distance to points in Ci using a measure µ, i.e.:
Z
ci := arg min
0 0
d(x, x0 )dµ(x).
x ∈X Ci
Note that we assume that µ is known and it should govern the distribution
of x in X .
Then we define the k-means objective function Gk as follows
k Z
X
(3.18) Gk ((X , d), χk ) := d(x, ci )dµ.
i=1 Ci
Let Projg (Rd , Rn ) denotes the set of all orthogonal projections from Rd
to Rn and Embg (Rn , Rd ) the set of all isometric embeddings from Rm to Rd .
Let F ⊂ Embg (Rd , Rn ) × Projg (Rn , Rd ) be the subset of all pairs (W, U )
of transformations such that W ◦ U = Id|Rn . Exercise 3.9 implies that any
minimizer (W, U ) of R̂Sm is an element of F.
Exercise 3.10. ([SSBD2014, Theorem 3.23, p. 325]) Let C(Sm ) ∈ End(Rd )
be defined as follows
m
X
C(Sm )(v) := hxi , vixi .
i=1
Action
{
Environment
Reward / Agent
;
State
are usually assumed in discrete time steps but can be continuously taken
in time. For today lecture we shall assume that agent’s actions are taken
in discrete time steps, called decision epochs, in the set E := {0, · · · , T } of
all decision epochs. This model can be straightforwardly generalized to a
continuous-times where action are taken at arbitrary points in time.
(a) At each time t ∈ E, the agent observes a state st in the set S of
possible states of environment, and selects an action at in the set A of all
possible actions.
(b) The environment moves to a new state st+1 ∈ S with conditional
probability P r(st+1 |at , st ) and an immediate reward rt is given to the agent.
17
rt = r(st , at , st+1 ) is the immediate reward given to the agent at the state
at that selects action at and moves to state st+1 .
• A path is a sequence h = [s1 , a1 , · · · , sT , aT , sT +1 ].
• A policy π : S ;→ A is defined uniquely by its value π(at |s1 ).
• Given policy π and a path h of states
h = [s1 (h), a1 (h), · · · , sT (h)],
the return of π along the path h is defined as
T
X −1
(4.1) R(π, h) := γ τ r(s1+τ (h), π(s1+τ (h)))
τ =0
X
(4.3) Vπ (s) = E[r(s, π(s))] + γ s(s0 )Vπ (s0 ).
s0
(3) The tangent cone fibration CX (resp. the tangent fibration T X ) is the
union ∪x∈X Cx X (resp. ∪x∈X Tx X ), which is a subset of V × V and therefore
it is endowed with the induced topology from V × V .
Example 5.5. Let us consider a mixture family PX of probability measures
pη µ0 on X that are dominated by µ0 ∈ P(X ), where the density functions
pη are of the following form
(5.6) pη (x) := g 1 (x)η1 + g 2 (x)η1 + g 3 (x)(1 − η1 − η2 ) for x ∈ X .
Here g i , for i = 1, 2, 3, are nonnegative functions on X such that Eµ0 (g i ) = 1
and η = (η1 , η2 ) ∈ Db ⊂ R2 is a parameter, which will be specified as follows.
Let us divide the square D = [0, 1] × [0, 1] ⊂ R2 in smaller squares and color
them in black and white like a chessboard. Let Db be the closure of the
subset of D colored in black. If η is an interior point of Db then Cpη PX = R2 .
If η is a boundary point of Db then Cpη PX = R. If η is a corner point of Db ,
then Cpη PX consists of two intersecting lines.
Exercise 5.6. (cf. [AJLS2018, Theorem 2.1]) Let P be a statistical model.
Show that any v ∈ Cξ P is dominated by ξ. Hence the logarithmic represen-
tation of v
log v := dv/dξ
is an element of L1 (X , ξ).
Next we want to put a Riemannian metric on P i.e., to put a positive
quadratic form g on each tangent space Tξ P . By Exercise 5.6, the log-
arithmic representation log(Cξ P ) of Cξ P is a subspace in L1 (X , ξ). The
space L1 (X , ξ) does not have a natural metric but its subspace L2 (X , ξ) is
a Hilbert space.
Definition 5.7. (1) A statistical model P that satisfies
(5.7) log(Cξ P ) ⊂ L2 (X , ξ)
for all ξ ∈ P is called almost 2-integrable.
(2) Assume that P is an almost 2-integrable statistical model. For each
v, w ∈ Cξ P the Fisher metric on P is defined as follows
Z
(5.8) g(v, w) := hlog v, log wiL2 (X ,ξ) = log v · log w dξ.
X
open subset in Rn . It follows from (5.8) that the Fisher metric on P has the
following form
Z
∂v pθ ∂w pθ
(5.9) g|p(θ) (dp(v), dp(w)) = · pθ dµ0 ,
X pθ pθ
for any v, w ∈ Tθ Θ.
Remark 5.9. (1) The Fisher metric has been defined by Fisher in 1925 to
characterize “information” of a statistical model. One of most notable ap-
plications of the Fisher metric is the Cramér-Rao inequality which measures
our ability to have a good density estimator in terms of geometry of the
underlying statistical model, see Theorem 5.16 below.
(2) The Fisher metric gp(θ) (dp(v), dp(v)) of a parameterized statistical
model P of dominated measures in Example 5.8 can be obtained from the
Taylor expansion of the Kullback-Leibler divergence I(p(θ), p(θ + εv)), as-
suming that log pθ is continuously differentiable in all partial derivative in θ
up to order 3. Indeed we have
Z
pθ (x)
I(p(θ), p(θ + εv)) = pθ (x) log dµ0
X pθ+εv (x)
Z
(5.10) = −ε pθ (x)∂v log pθ (x)dµ0
X
Z
(5.11) − ε2 pθ (x)(∂v )2 log pθ (x)dµ0 + O(ε3 ).
X
Since logθ (x) is continuously differentiable in θ up to order 3, we can apply
differentiation under the integral sign, see e.g. [Jost2005, Theorem 16.11, p.
213] to (5.10), which then must vanish, and integration by part to (5.11).
Hence we obtain
I(p(θ), p(θ + εv)) = ε2 gp(θ) (dp(v), dp(v)) + O(ε3 )
what is required to prove.
5.3. The Fisher metric, MSE and Cramér-Rao inequality. Given a
statistical model P ⊂ P(X ), we wish to measure the accuracy (also called
the efficiency) of an estimator σ̂ : X → P via MSE. For this purpose we need
further formalization by “linearizing” σ̂ via a composition with a mapping
ϕ : P → V , where V is a real Hilbert vector space with a scalar product
h·, ·i and the associated norm k · k.
Remark 5.10. Our definition of an estimator agrees with the notion of an
estimator in nonparametric statistics. In parametric statistic we consider a
statistical model P ⊂ PX that is parameterized by a parameter space Θ and
an estimator is a map from X to Θ.
Definition 5.11. A ϕ-estimator is a composition of an estimator σ̂ : X → P
and a map ϕ : P → V .
MACHINE LEARNING 41
Remark 5.13. For σ̂ ∈ L2ϕ (X , P ) the mean square error M SEξϕ (σ̂) of the
ϕ-estimator ϕ ◦ σ̂ is defined by
(5.13) M SEξϕ (σ̂) := Eξ (kϕ ◦ σ̂ − ϕ(ξ)k2 ).
The RHS of (5.13) is well-defined, since σ̂ ∈ L2ϕ (X , P ) and therefore
hϕ ◦ σ̂(x), ϕ ◦ σ̂(x)i ∈ L1 (X , ξ) and hϕ ◦ σ̂(x), ϕ(ξ)i ∈ L2 (X , ξ).
42 HÔNG VÂN LÊ ∗
∞
Vξϕ (σ̂) = Vξϕ [σ̂](vi , vi ).
X
(5.15)
i=1
Then we define the MSE and variance of ϕk -estimator ϕk ◦ σ̂k , which can be
estimated using the Cramér-Rao inequality. It turns out that the efficient
unbiased estimator w.r.t. MSE is MLE. The notion of a ϕ-estimator allows
to define the notion of a consistent sequence of ϕ-estimators that formalizes
the notion of asymptotically accurate estimators. Since the concept of a
ϕ-estimator also encompasses the Parzen-Rosenblatt’s estimator, the notion
of the accuracy σ̂k∗ defined in (5.22) also agrees with the notion of accuracy
of estimators defined in (3.15).
21Vapnik considered ERM algorithm also for the case that the function R̂L may not
Sn
reach infimum on H, using slightly different language than ours. In [SSBD2014] the
authors assumed that a minimizer of R̂SLn always exists.
46 HÔNG VÂN LÊ ∗
we have
µm {S ∈ Z m : |RµL (Aerm (S)) − Rµ,H | ≤ 2ε} ≥
MACHINE LEARNING 47
A : S 7→ AS , any f ∈ Y X we have
Z Z
X 1 n
(6.10) Rµ(0−1) (AS ) d(µnf )(S)dµY
Y X (f ) ≥ (1 − )(1 − )
Y X (X ×Y)n
f
#Y #X
Proof of Lemma 6.10. We set for S ∈ (X × Y)n
P ri : (X × Y)n → X , (x1 , y1 ), · · · , (xn , yn ) 7→ xi ∈ X ,
n
[
XS := P ri (S).
i+1
Note that S is distributed by µnf means that S = {(x1 , f (x1 )), · · · , (xn , f (xn )),
so S is essentially distributed by the uniform probability measure (µX n
X ) . Let
us compute and estimate the double integral in the LHS of (6.10) using (2.9)
and the Fubini theorem.
1 X AS (x)
EµY X Eµnf Rµ(0−1) (A S ) = E Y X E(µX )n (1 − δ f (x) )
YX
f
#X µY X X
x∈X
1 X AS (x)
≥ EµY X E(µX )n (1 − δf (x) )
#X Y X X
x6∈XS
1 X AS (x)
= E(µX )n EµY X (1 − δf (x) )
#X X
YX
x6∈XS
1 1
= E(µX )n #[X \ XS ] · 1 −
#X X #Y
since #[X \XS ]≥#X −n 1 n
(6.11) ≥ (1 − )(1 − ).
#Y #X
This completes the proof of Lemma 6.10.
Continuation of the proof of Theorem 6.9. It follows from Lemma 6.10
C[2m]
that there exists f ∈ Y X such that, denoting µ := µf , we have
Z Z
1
(6.12) Rµ(0−1) (AS ) dµm = Rµ(0−1) (AS ) d(µm )(S) ≥ .
(X ×Y)m (C[2m]×Y)m 4
(0−1)
Since 0 ≤ Rµ ≤ 1 we obtain from (6.12)
1 1
µm {S ∈ (X × Y)m | Rµ(0−1) A(S) ≥ } > .
8 8
This implies that (6.3) does not hold for (ε, δ) = (1/8, 1/8), for any m and
C[m]
µ(m) = µf . This proves Theorem 6.9.
Remark 6.11. In the proof of Theorem 6.9 we showed that if there is a
subset C ⊂ X of size 2m and the restriction of H to C is the full set of
functions in {0, 1}C then (6.3) does not hold for any learning algorithm A,
m ∈ N and (ε, δ) = (1/8, 1/8). This motivates the following
50 HÔNG VÂN LÊ ∗
Definition 6.12. (1) A hypothesis class H ⊂ {0, 1}X shatters a finite subset
C ⊂ X if #H|C = 2#C .
(2)The VC-dimension of a hypothesis class H ⊂ {0, 1}X , denoted by
V C dim(H), is the maximal size of a set C ⊂ X that can be shattered by
H. If H can shatter sets of arbitrarily large size we say that H has infinite
VC-dimension.
Example 6.13. (1) A hypothesis class H shatters a subset of oen point
x0 ∈ X if and only if there are two functions f, g ∈ H such that f (x0 ) 6=
g(x0 ).
(2) Let H be the class of intervals in the real line, namely,
H = {1(a,b) : a < b ∈ R},
where 1(a,b) : R → {0, 1} is the indicate function of the interval (a, b). Take
the set C = {1, 2}. Then, H shatters C, since all the functions in the set
{1, 2}(0,1) can be obtained as the restriction of some function from H to C.
Hence V C dim(H) ≥ 2. Now take an arbitrary set C = {c1 < c2 < c3 } and
the corresponding labeling (1, 0, 1). Clearly this labeling cannot be obtained
by an interval: Any interval h(a,b) that contains c1 and c3 (and hence labels
c1 and c3 with the value 1) must contain c2 (and hence it labels c2 with 0 ).
Hence H does not shatter C. We therefore conclude that V C dim(H) = 2.
Note that H has infinitely many elements.
Exercise 6.14 (VC-Threshold functions). Consider the hypothesis class
F ⊂ {−1, 1}R of all threshold functions sign b : R → R, where b ∈ R, defined
by
sign b (x) := sign (x − b)
Show that V C dim(F) = 1.
Exercise 6.15. Let us reconsider the toy example 2.2, where we want to
predict skin diseases by examination of skin image. Recall that X = ∪5i=1 I1 ×
I2 × I3 × {Ai } and Y = {±1}. Recall that a function f : X → Y can be
identified with a subset f −1 (1) ⊂ X . Let us consider the hypothesis class
H ⊂ Y X consisting of cubes [a1 , b1 ] × [a2 , b2 ] × [a3 , b3 ] × {Ai } ⊂ X , i = 1, 5.
Prove that V C dim(H) < ∞.
In Remark 6.11 we observed that the finiteness of V C dim H is a necessary
condition for the existence of a uniformly consistent learning algorithm on
a unified learning model (X , H, L(0−1) , PX (X × Y)). In the next section we
shall show that the finiteness of V C dim H is also a sufficient condition for
the uniform consistency of Aerm on (X , H, L0−1 , PX (X × Y)).
Outline of the proof. Note that the “only if” assertion of Theorem 6.16
follows from Remark 6.11. Thus we need only to prove the “if” assertion. By
Lemma 6.5 it suffices to show that if V C dim(H) = k < ∞ then mH (ε, δ) <
∞ for all (ε, δ) ∈ (0, 1)2 . In other words we need to find a lower bound for
the LHS of (6.5) in terms of the VC-dimension, which is an upper bound of
the RHS of (6.5), when ε ∈ (0, 1) and m is sufficiently large. This shall be
done in three steps.
23In [CS2001, p.8] the authors used the L -norm, but they considered only the sub-
∞
space of continuous functions
54 HÔNG VÂN LÊ ∗
In the second step, we reduce the problem of estimating upper bound for
the sample complexity mH to the problem of estimating upper bound for
the sample complexities mDj , where {Dj | j ∈ [1, l]} is a cover of H, and
using the covering number. Namely we have the following easy inequality
ρm {S ∈ Z m | sup |M SEρ (f ) − M SES (f )| ≥ ε} ≤
f ∈H
l
X
(7.3) ρm {S ∈ Z m | sup |M SEρ (f ) − M SES (f )| ≥ ε}.
j=1 f ∈Dj
Combining the last relation with (7.3), we derive the following desired upper
estimate for the sample complexity mH .
Proposition 7.4. Assume that for all f ∈ H we have |f (x)−y| ≤ M ρ-a.e..
Then for all ε > 0 we have
mε 2
m m ε (− 2 1 2
ρ {z ∈ Z : sup |M SEρ (f )−M SEz (f )| ≤ ε} ≥ 1−N (H, )2e 4(2σ + 3 M ε)
f ∈H 8M
where σ 2 = supf ∈H Vρ (fY2 ).
This completes the proof of Theorem 7.1.
Exercise 7.5. Let L2 denote the instantaneous quadratic loss function in
(2.16). Derive from Theorem 7.1 an upper bound for the sample complexity
mH (ε, δ) of the learning model (X , H ⊂ Cn (X ), L2 , PB (X ×Rn )), where H is
compact and PB (X × Rn ) is the space of Borel measures on the topological
space X × Rn .
Remark 7.6. If the hypothesis class H in Theorem 7.1 is a convex subset in
H then Cucker-Smale got an improved estimation of the sample complexity
mAerm [CS2001, Theorem C∗ ].
7.2. Rademacher complexities and sample complexity. Rademacher
complexities of a learning model (Z, H, L, P ) are refined versions of the
sample complexity mH by adding a parameter S ∈ Z n for the empirical
Rademacher complexity, and by adding a parameter µ ∈ P for the expected
version of the (expected) Rademacher complexity. Rademacher complexities
are designed for data-depending estimation of upper bounds of the sample
complexity mAerm of the ERM algorithm.
For a learning model (Z, H, L, P ) we set
L
(7.4) GH := {gh : Z → R, gh (z) = L(z, h)| h ∈ H}.
Definition 7.7 (Rademacher complexity). Let S ∈ Z n . The empirical
Rademacher complexity of a family G ⊂ RZ w.r.t. a sample S is defined as
follows
n
1X
R̂S (G) := E(µZ2 )n [sup σi g(zi )],
Z2 g∈G n i=1
where {σi ∈ Z2 | i ∈ [1, n]} and µZZ22 is the counting measure on Z2 , see (6.9).
56 HÔNG VÂN LÊ ∗
(0−1) 1
(7.5) R̂S (GH ) = R̂P r(S) (H).
2
Using the identity
1
L(0−1) (x, y, h) = 1 − δyh(x) = (1 − yi h(xi ))
2
we compute
m
(0−1) 1 X
R̂S (GH ) = E(µZ2 )m [sup σi δyh(x
i
i)
]
Z2 h∈H m i=1
m
1 X 1 − yi h(xi )
= E(µZ2 )m [sup σi ]
Z2 h∈H m 2
i=1
E Z σi =0 m
(µ 2 )m 1 1 X
Z2
= E(µZ2 )m [sup −σi yi h(xi )]
2 Z2 h∈H m i=1
m
1 1 X 1
= E(µZ2 )m [sup σi h(xi )] = R̂P r(S) (H)
2 Z2 h∈H m 2
i=1
which is required to prove.
We have the following relation between the empirical Rademacher com-
plexity and the Rademacher complexity, using the McDiarmid concentration
inequality, see (B.7) and [MRT2012, (3.14), p.36]
r
ln(2/δ)
(7.6) µn {S ∈ Z n |Rn,µ (GH
L L
) ≤ RS (GH )+ } ≥ 1 − δ/2.
2m
Theorem 7.9. (see e.g. [SSBD2014, Theorems 26.3, 26.5, p. 377- 378])
Assume that (Z, H, L, µ) is a learning model with |L(z, h)| < c for all z ∈ Z
and all h ∈ H. Then for any δ > 0 and any h ∈ H we have
r
n n L L L 2 ln(2/δ)
(7.7) µ {S ∈ Z | Rµ (h) − RS (h) ≤ Rn,µ (GH ) + c } ≥ 1 − δ,
n
MACHINE LEARNING 57
r
n n 2 ln(4/δ)
(7.8) µ {S ∈ Z | RµL (h) − RSL (h) ≤ L
RS (GH ) + 4c } ≥ 1 − δ.
n
(7.9) r
n n 2 ln(8/δ)
µ {S ∈ Z |RµL (Aerm (S)) − RµL (h) ≤ L
2RrS (GH ) + 5c } ≥ 1 − δ.
δ
(7.10) Eµn (RµL (Aerm (S)) − Rµ,H
L L
) ≤ 2Rrn,µ (GH ).
It follows from (7.10), using the Markov inequality, the following bound
for the sample complexity mAerm in terms of Rademacher complexity
2RL L
n,µ (GH )
(7.11) µn {S ∈ Z n | RµL (Aerm ) − Rµ,H
L
≤ } ≥ 1 − δ.
δ
Remark 7.10. (1) The first two assertions of Theorem 7.9 give an upper
bound of a “half” of the sample complexity mH of a unified learning model
(Z, H, L, µ) by the (empirical) Rademacher complexity Rn (G) of the associ-
ated family G. The last assertion of Theorem 7.9 is derived from the second
assertion and the Hoeffding inequality.
(2) For the binary classification problem (X ×{0, 1}, H ⊂ {0, 1}X , L(0−1) , P(X ×
{0, 1}) there exists a close relationship between the Rademacher complexity
and the growth function ΓH (m), see [MRT2012, Lemma 3.1, Theorem 3.2,
p. 37] for detailed discussion.
7.3. Model selection. The choice of a right prior information in machine
learning is often interpreted as the choice of a right class H in a learning
model (Z, H, L, P ) which is also called a model selection. A right choice of
H should make balance between the approximation error and the estimation
error of H defined in the error decomposition of H.
7.3.1. Error decomposition. We assume that the maximum domain of the
expected loss function RµL is a subspace HL,µ ⊃ H, given L and a probability
measure µ ∈ P .
We define the Bayes risk of the learning problem RµL on the maximal
domain HL,µ as follows
L
Rb,µ := inf RµL (h)
h∈HL,µ
L
Recall that Rµ,H := inf h∈H RµL (h) quantify the optimal performance of a
learner in H. Then we decompose the difference between the expected risk
of a predictor h ∈ H and the Bayes risk as follows:
(7.12) RµL (h) − Rb,µ
L
= (RµL (h) − Rµ,H
L L
) + (Rµ,H L
− Rb,µ ).
The first term in the RHS of (7.12) is called the estimation error of h, cf.
(7.1), and the second term is called the approximation error. If h = Aerm (S)
is a minimizer of the empirical risk R̂SL , then the estimation error of Aerm (S)
is also called the sample error [CS2001, p. 9].
58 HÔNG VÂN LÊ ∗
original SVM algorithm is the hard SVM algorithm, which was invented by
Vapnik and Chervonenkis in 1963. The current standard incarnation (soft
margin) was proposed by Cortes and Vapnik in 1993 and published in 1995.
8.1. Linear classifier and hard SVM. For (w, b) ∈ V × R we set
(8.1) f(w,b) (x) := hw, xi + b.
Definition 8.1. A linear classifier is a function sign f(w,b) : V → Z2 , x 7→
sign f(w,b) (x) ∈ {−1, 1} = Z2 .
+ −1
We identify each linear classifier with the half space H(w,b) := sign f(w,b) (1) =
−1 −1
f(w,b) (R≥0 ) ⊂ V and set H(w,b) := f(w,b) (0) ⊂ V . Note that each hyperplane
+
H(w,b) ⊂ V defines H(w,b) up to a reflection of V around H(w,b) and therefore
defines the affine function f(w,b) up to a multiplicative factor λ ∈ R∗ . Let
HA (V ) := {H(w,b) ⊂ V | (w, b) ∈ V × R}
be the set of of all hyperplanes in the affine space V . Then Hlin (V ) is a
double cover of HA (V ) with the natural projection π : Hlin (V ) → HA (V )
defined above.
Definition 8.2. A training sample S = (x1 , y1 ), · · · , (xm , ym ) ∈ (V ×
+
{±1})m is called separable, if there is a half space H(w,b) ⊂ V that cor-
+
rectly classifies S, i.e. for all i ∈ [1, m] we have xi ∈ H(w,b) iff yi = 1. In
other words, the linear classifier sign f(w,b) is a minimizer of the empirical
risk function RS0−1 : Hlin (V ) → R associated to the 0-1 loss function L(0−1) .
+
Remark 8.3. (1) A half space H(w,b) correctly classifies S if and only if the
(0−1)
empirical risk function R̂S (f(w,b) ) = 0 and sign (w,b) is a linear classifier
+
associated to H(w,b) .
(2) Write S = S+ ∪ S− where
S± := {(x, y) ∈ S| y = ±1}.
Let P r : (V × {±1})m → V m denote the canonical projection. Then S
is separable if and only if there exists a hyper-plane H(w,b) that separates
[Pr(S+ )] and [P r(S− )], where recall that [(x1 , · · · , xm )] = ∪m
i=1 {xi } ⊂ V . In
this case we say that H(w,b) correctly separates S.
(3) If a training sample S is separable then the separating hyperplane
is not unique, and hence there are many minimizers of the empirical risk
(0−1)
function R̂S . Thus, given S, we need to find a strategy for selecting one
of these ERM’s, or equivalently for selecting a separating hyperplane H(w,b) ,
+
since the associated half-space H(w,b) is defined by H(w,b) and any training
value (xi , yi ). The standard approach in the SVM framework is to choose
H(w,b) that maximizes the distance to the closest points xi ∈ [P r(S)]. This
approach is called the hard SVM rule. To formulate the hard SVM rule we
need a formula for the distance of a point to a hyperplane H(w,b) .
MACHINE LEARNING 61
Ahs : ∪m (V × Z2 )m → Hlin ,
Ahs (S) ∈ π −1 (A∗hs (S)) and Ahs (S) ∩ {(x, −1)| x ∈ V } = ∅.
Definition 8.5. A hard SVM is a learning machine (V ×Z2 , Hlin (V ), L(0−1) ,
P(V × Z2 ), Ahs ).
The domain of the optimization problem in (8.4) is HS , which is not easy
to determine. So we replace this problem by another optimization problem
over a larger convex domain as follows.
Lemma 8.6. For S = {(x1 , y1 ), · · · , (xm , ym )} we have
+
(8.5) Ahs (S) = H(w,b) where (w, b) = arg max min yi (hw, xi i + b).
(w,b):||w||≤1 i
Proof. If H(w,b) separates S then ρ(S, H(w,b) ) = mini yi (hw, xi i + b). Since
the constraint ||w|| ≤ 1 does not effect on H(w,b) , which is invariant under a
positive rescaling, (8.3) implies that
(8.6) max min yi (hw, xi i + b) ≥ max ρ(S, H(w,b)0 ).
(w,b):||w||≤1 i H(w,b)0 ∈HS
62 HÔNG VÂN LÊ ∗
produces a solution (w, b) := (w0 /||w0 ||, b0 /||w0 ||) of the optimization prob-
lem (8.5).
Proof. Let (w0 , b0 ) be a solution of (8.7). We shall show that (w0 /||w0 ||, b0 /||w0 ||)
is a solution of (8.5). It suffices to show that the margin of the hyperplane
H(w0 ,b0 ) is greater than or equal to the margin of the hyperplane associated
to a (and hence any) solution of (8.5).
Let (w∗ , b∗ ) be a solution of Equation (8.5). Set
γ ∗ := min yi (hw∗ , xi i + b∗ )
i
which is the margin of the hyperplane H(w∗ ,b∗ ) by (8.3). Therefore for all i
we have
yi (hw∗ , xi i + b∗ ) ≥ γ ∗
or equivalently
w∗ b∗
yi∗ (h ∗ , xi i + ∗ ) ≥ 1.
γ γ
∗ ∗
Hence the pair ( wγ ∗ , γb ∗ ) satisfies the condition of the quadratic optimization
problem in (8.7). It follows that
w∗ 1
||w0 || ≤ || ∗
|| = ∗ .
γ γ
Hence for all i we have
w0 b0 yi (hw0 , xi i + b0 ) 1
yi ( + )= ≥ ≥ γ∗.
||w0 || ||w0 || ||w0 || ||w0 ||
This implies that the margin of H(w0 ,b0 ) satisfies the required condition. This
completes the proof of Proposition 8.7.
MACHINE LEARNING 63
The loss function for the soft SVM learning machine is the hinge loss
function Lhinge : Hlin (V ) × (V × {±1}) → R defined as follows
(8.12) Lhinge (h(w,b) , (x, y)) := max{0, 1 − y(hw, xi + b)}.
Hence the empirical hinge risk function is defined as follows for S =
{(x1 , y1 ) · · · , (xm , ym )}
m
1 X
RShinge (h(w,b) ) = max{0, 1 − yi (hw, xi i + b}.
m
i=1
Lemma 8.11. The Equation (8.10) with constraint (8.11) for Ass is equiv-
alent to the following regularized risk minimization problem, which does not
depend on the slack variables ξ:
Ass (S) = arg min λ||w||2 + RShinge (f(w,b) ) ∈ Hlin .
(8.13)
f(w,b)
Proof. Let us fix (w0 , b0 ) and minimize the RHS of (8.10) under the con-
straint (8.11). It is straightforward to see that ξi = Lhinge (w, b), (xi , yi ) .
Using this and comparing (8.10) with (8.13), we complete the proof of
Lemma 8.11.
From Lemma 8.11 we obtain immediately the following
Corollary 8.12 (Definition). A soft SVM is a learning machine (V ×
Z2 , Hlin ,
Lhinge , P(V × Z2 ), Arlm ).
Remark 8.13. The hinge loss function Lhinge enjoys several good properties
that justify the preference of Lhinger as a loss function over the zero-one loss
function L(0−1) , see [SSBD2014, Subsection 12.3, p. 167] for discussion.
8.3. Sample complexities of SVM.
Exercise 8.14. Prove that V C dim Hlin (V ) = dim V + 1.
Hint. To show that any d + 1 points Sd+1 in Rd can be shattered by
Hlin (V ), it suffices to show that there is a haft space separating any subset
in Sd+1 . Next we show that there is a configuration Sd+2 of d + 2 points in
MACHINE LEARNING 65
v
m
X m
X u m
uX
(9.5) arg minm f αj Gj1 , · · · , αj Gjm +R t αi αj Gji .
α∈R
j=1 j=1 i,j=1
MACHINE LEARNING 69
To compute (9.6) we need to know only the kernel function K and not the
mapping ψ, nor the inner product h, i on the Hilbert space W .
This motivates the following question.
Problem 9.3. Find a sufficient and necessary condition for a kernel func-
tion, also called a kernel, K : X × X → R such that K can be written as
K(x, x0 ) = hψ(x), ψ(x0 )i for a feature mapping ψ : X → W , where W is a
real Hilbert space.
Definition 9.4. If K satisfies the condition in Problem 9.3 we shall say
that K is generated by a (feature) mapping ψ. The target Hilbert space is
also called a feature space.
9.2. PSD kernels and reproducing kernel Hilbert spaces.
Hence the evaluation map evx extends to a bounded linear map on the
completion H(K) of H0 (K). It follows that H(K) is a RKHS.
To show the uniqueness of a RKHS H such that K is the reproducing
kernel of H, we assume that there exists another RKHS H0 such that for all
x, y ∈ X there exist kx , ky ∈ H0 with the following properties
K(x, y) = hkx , ky i and f (x) = hf, kx i for all f ∈ H.
We define a map g : H(K) → H0 by setting g(Kx ) = kx . It is not hard to see
that g is an isometric embedding. To show that g extends to an isometry
24In other words, the vector structure on H is induced from the vector structure on R
via the evaluation map.
72 HÔNG VÂN LÊ ∗
it suffices to show that the set kx is dense in H0 . Assume the opposite, i.e.
there exists f ∈ H0 such that hf, kx i = 0 for all x. But this implies that
f (x) = 0 for all x and hence f = 0. This completes the proof of Theorem
9.11.
9.4. Conclusion. In this section we learn the kernel trick, which simplifies
the algorithm of solving hard SVM and soft SVM optimization problem,
using embedding of patterns into a Hilbert space. The kernel trick is based
on the theory of RKHS and has many applications, e.g., for defining a feature
map ϕ : P(X ) → V , where V is a RHKS, see e.g. [MFSS2017]. The main
difficulty of the kernel method is that we still have no general method of
selecting a suitable kernel for a concrete problem. Another open problem is
to improve the upper bound for sample complexity of SVM algorithm, i.e.,
to find new conditions on µ ∈ P such that the sample complexity of Ahk ,
Ask which is computed w.r.t. µ is bounded.
• The i-th input nodes give the output xi . If the input space is Rn then
we have n + 1 input-nodes, one of them is the “constant” neuron, whose
output is 1.
• There is a neuron in the hidden layer that has no incoming edges. This
neuron will output the constant σ(0).
• A feedforward neural network (FNN) has underlying acyclic directed
graph. Each FNN (E, V, w, σ) represents a multivariate multivariable func-
tion hV,E,σ,w : Rn → Rm which is obtained by composing the computing
instruction of each neuron on directed paths from input neurons to output
neurons. For each architecture (V, E, σ) of a FNN we denote by
HV,E,σ = {hV,E,σ,w : w ∈ RE }
the underlying hypothesis class of functions from the input space to the
output space of the network.
• A recurrent neural network (RNN) has underlying directed graph with
a cycle. By unrolling cycles in a RNN in discrete time n ∈ N, a RNN
defines a map r : N+ :→ {FNN} such that [r(n)] ⊂ [r(n + 1)], where
[r(n)] is the underlying graph of r(n), see [GBC2016, §10.2, p. 368] and
[Graves2012, §3.2, p. 22]. Thus a RNN can be regarded as a sequence
of multivariate multivariable functions which serves as in a discriminative
model for supervised sequence labelling.
76 HÔNG VÂN LÊ ∗
It follows from Proposition 10.9 that the sample complexity of the ERM
algorithm for (V ×Z2 , HV,E,sign , L(0−1) , P(V ×Z2 )) is finite. But the running
time for ERM algorithm in a neural network HV,E,sign is non-polynomial and
80 HÔNG VÂN LÊ ∗
The set of subgradients of f at w is also called the differential set and denoted
by ∂f (w).
Exercise 11.2. (1) Show that if f is differentiable at w then ∂f (w) contains
a single element ∇f (w).
(2) Find a subgradient of the generalized hinge loss function fa,b,c (w) =
max{a, 1 − bhw, ci} where a, b ∈ R and w, c ∈ RN and h·, ·i a scalar product.
Remark 11.3. It is known that a subgradient of a function f on a con-
vex open domain S exists at every point w ∈ S iff f is convex, see e.g.
[SSBD2014, Lemma 14.3].
• Gradient descend algorithm discretizes the solution of the gradient flow
equation (11.2). We begin with an arbitrary initial point x0 ∈ RN . We set
(11.4) wn+1 = wn − γn ∇f (wn ),
where γn ∈ R+ is a constant, called a “learning rate” in machine learning, to
be optimized. This algorithm can be slightly modified. For example, after
T iterations we set the output point w̄T to be
T
1X
(11.5) w̄T := wi ,
T
i=1
or
(11.6) w̄T := arg min f (wi ).
i∈[1,T ]
In particular, for every r, ρ > 0, if for all t ∈ [1, T ] we have ||vt || ≤ ρ and if
we set η = (r/ρ)T −1/2 then if ||w∗ || ≤ r we have
T
1X rρ
(11.10) hwt − w∗ , vt i ≤ √ .
T T
t=1
T
∗1X
f (w̄T ) − f (w ) = f ( wt ) − f (w∗ )
T
t=1
since f is convex T T
1 X
∗ 1X
≤ f (wt ) − f (w ) = (f (wt ) − f (w∗ ))
T T
t=1 i=1
by(11.7) T
1X
≤ hwt − w∗ , ∇f (wt )i.
T
i=1
Next we apply the algorithm for gradient flow described above to ∇R̂SL (h).
The weak law of large numbers ensures the convergence in probability of
∇R̂SL (h) to RHS of (11.11), and heuristically the convergence of the empirical
gradient descend algorithm to the gradient descend of the expected risk
function RµL .
Exercise 11.7. Find an upper bound for the sample complexity of the SGD
in Proposition 11.6.
In the case of 0-1 loss function L(0−1) the value RA (T ) is called the number
of mistakes that A makes after T rounds.
Definition 11.10 (Mistake Bounds, Online Learnability). ([SSBD2014,
Definition 21.1, p. 288]) Given any sequence S = (x1 , h∗ (y1 )), · · · , (xT , h∗ (yT )),
where T is any integer and h∗ ∈ H, let MA (S) be the number of mistakes A
makes on the sequence S. We denote by MA (H) the supremum of MA (S)
over all sequences of the above form. A bound of the form MA (H) ≤ B < ∞
is called a mistake bound. We say that a hypothesis class H is online learn-
able if there exists an algorithm A for which MA (H) ≤ B < ∞.
Remark 11.11. 1) Similarly we also have the notion of a successful online
learner in regression problems [SSBD2014, p. 300] and within this concept
online gradient descent is a successful online learner whenever the loss func-
tion is convex and Lipschitz.
2) In the online learning setting the notion of certainty and therefore the
notion of probability measure are absent. In particular we do not have the
notion of expected risk. So there is an open question if we can justify online
learning setting, using statistical learning theory.
11.4. Conclusion. In this section we study stochastic gradient descend as
a learning algorithm which works if the loss function is convex. To apply
stochastic gradient flow as a learning algorithm in FNN where the loss func-
tion is not convex one needs experimentally modify the algorithm so it does
not stay in a critical point which is not the minimizer of the empirical risk
function. One also trains NN with online gradient descends for which we
need a new concept of online learnability which has not yet interpreted using
probability framework.
the model, which is meant to capture our beliefs about the situation be-
fore seeing the data x ∈ X . After observing some data, we apply Bayes’
Theorem A.13, to obtain a posterior distribution for these unknowns, which
takes account of both the prior and the data. From this posterior distribu-
tion we can compute predictive distributions P (z n+1 | z 1 , · · · , z n ) for future
observations using (12.1).
To predict the value of an unknown quantity z n+1 , given a sample (z 1 , · · · , z n ),
a prior distribution µΘ , one uses the following formula
Z
n+1 1
(12.1) P (z | z , · · · , z ) = P (z n+1 |θ)P (θ|z 1 , · · · , z n )dµΘ
n
12.4. Conclusion. In our lecture we considered main ideas and some appli-
cations of Bayesian methods in machine learning. Bayesian machine learning
is an emerging promising trend in machine learning that is well suitable for
solving complex problems on one hand and consistent with most basic tech-
niques of non-Bayesian machine learning. There are several problems in
implementing Bayesian approach, for instance to translating our subjective
prior beliefs into a mathematically formulated model and prior. There may
also computational difficulties with the Bayesian approach.
88 HÔNG VÂN LÊ ∗
Remark A.5. It follows from (A.5) that for any f ∈ L1 (X , µ) and any
B ∈ B we have
Z Z
(A.6) f dµ = EB
µ f dµ.
B B
(A.8) EB
µ (f ) = ΠL2 (X ,B,µ) (f )
II]. There is an example that the existence of disintegration does not imply
the existence of regular conditional measures.
(2) The existence of disintegration µ(·, x) with Σx = Σ0 for all x ∈ X
is equivalent to the existence of conditional measures w.r.t. B in sense of
Doob [Doob1953, p. 26]. The existence of disintegration under dominating
measures has been discussed in [CP1997].
and noting that the LHS of (B.1) is equal f∗ (µ)(t, +∞), which is a monotone
function in t, we obtain the Markov inequality (B.1) immediately.
(2) For any monotone function ϕ : (R≥0 , dt) → (R≥0 , dt) applying (B.1)
we have
Eµ (ϕ ◦ f )
(B.4) f∗ (µ)(t, +∞) = ϕ∗ ◦ f∗ (µ)(ϕ(t), +∞) ≤ .
ϕ(t)
The Chebyshev equality (B.2) follows from (B.4), replacing f in (B.4) by
|g − Eµ (g)| for g ∈ L1 (X , µ), and set ϕ(t) := t2 .
B.4. Hoeffding’s inequality. ([Hoeffding1963]) Let θ = (θ1 , · · · , θn ) be a
sequence of i.i.d. R-valued random variables on Z and µ ∈ P(Z). Assume
that Eµ (θi (z)) = θ̄ for all i and µ{z ∈ Z| [ai ≤ θi (z) ≤ b]} = 1. Then for
any ε > 0 we have
m
m m
1 X −2mε2
(B.5) µ {z ∈ Z : θi (zi ) − θ̄ > ε} ≤ 2 exp ,
m (b − a)2
i=1
where z = (z1 , · · · , zm ).
where z = (z1 , · · · , zm ).
29we thank J.P. Vigneaux for suggesting us to use “probabilistic morphism” instead of
“probabilistic mapping”
98 HÔNG VÂN LÊ ∗
References
[Amari2016] S. Amari, Information Geometry and its applications, Springer, 2016.
[AJLS2015] N. Ay, J. Jost, H. V. Lê, and L. Schwachhöfer, Information geometry
and sufficient statistics, Probability Theory and related Fields, 162 (2015), 327-
364, arXiv:1207.6736.
[AJLS2017] N. Ay, J. Jost, H. V. Lê and L. Schwachhöfer, Information Geometry,
Springer, 2017.
[AJLS2018] N. Ay, J. Jost, H. V. Lê and L. Schwachhöfer, Parametrized measure
models, Bernoulli vol. 24 Nr 3 (2018), 1692-1725, arXiv:1510.07305.
[BHMSY2019] S. Ben-David, P. Hrubes, S. Moran and A. Yehudayoff, Learnability
can be undecidable, Nature Machine intelligence, 1(2019), 44-48.
[Billingsley1999] P. Billingsley, Convergence of Probability measures, 2nd edition, John
Wiley ans Sohns, 1999.
[Bishop2006] C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
[Bogachev2007] V. I. Bogachev, Measure theory I, II, Springer, 2007.
[Bogachev2010] V. I. Bogachev, Differentiable Measures and the Malliavin Calculus,
AMS, 2010.
[Bogachev2018] V. I. Bogachev, Weak convergence of measures, AMS, 2018.
[Borovkov1998] A. A. Borovkov, Mathematical statistics, Gordon and Breach Science
Publishers, 1998.
[CM2010] G. Carlsson and F. Memoli, Classifying clustering schemes, arXiv:1011.527.
[CP1997] J. T. Chang and D. Pollard, Conditioning as disintegration, Statistica Neer-
landica. 51(1997), 287-317.
[Chentsov1972] N. N. Chentsov, Statistical Decision Rules and Optimal Inference,
Translation of mathematical monographs, AMS, Providence, Rhode Island, 1982,
translation from Russian original, Nauka, Moscow, 1972.
[CS2001] F. Cucker and S. Smale, On mathematical foundations of learning, Bulletin
of AMS, 39(2001), 1-49.
[Cybenko1989] G. Cybenko, Approximation by superpositions of a sigmoidal function,
Mathematics of Control, Signals, and Systems, vol. 2(1989), pp. 303-314.
[DGL1997] L. Devroye, L. Györfi and G. Lugosi, A probabilstic theory of Pattern
Recognition, Springer 1996.
[Doob1953] J. L. Doob, Stochastic processes, John Wiley & and Sons, 1953.
[Faden1985] A.M. Faden, The existence of regular conditional probabilities: necessary
and sufficient conditions. Ann. Probab. 13(1985), 288 - 298.
[Fisher1925] R. A. Fisher, Theory of statistical estimation, Proceedings of the Cambridge
Philosophical Society, 22(1925), 700-725.
[FHIBP2018] V. Franois-Lavet, P. Henderson, R. Islam, M.G. Bellemare, J.
Pineau, An Introduction to Deep Reinforcement Learning, Foundations and
Trends in Machine Learning. 11 (3-4)(2018),219-354. arXiv:1811.12560.
[Fritz2019] T. Fritz, A synthetic approach to Markov kernel, conditional independence
and theorem of sufficient statistics, arXiv:1908.07021.
[GBC2016] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT, 2016.
[Ghahramani2013] Z. Ghahramani, Bayesian nonparametrics and the probabilistic ap-
proach to modelling. Philosophical Transactions of the Royal Society A 371 (2013),
20110553.
[Ghahramani2015] Z. Ghahramani, Probabilistic machine learning and artificial intelli-
gence, Nature, 521(2015), 452-459.
100 HÔNG VÂN LÊ ∗