In statistical classification, the Bayes classifier is the classifier having the smallest probability of misclassification of all classifiers using the same set of features.[1]
Definition[edit]
Suppose a pair
takes values in
, where
is the class label of an element whose features are given by
. Assume that the conditional distribution of X, given that the label Y takes the value r is given by
where "
" means "is distributed as", and where
denotes a probability distribution.
A classifier is a rule that assigns to an observation X=x a guess or estimate of what the unobserved label Y=r actually was. In theoretical terms, a classifier is a measurable function
, with the interpretation that C classifies the point x to the class C(x). The probability of misclassification, or risk, of a classifier C is defined as
The Bayes classifier is
In practice, as in most of statistics, the difficulties and subtleties are associated with modeling the probability distributions effectively—in this case,
. The Bayes classifier is a useful benchmark in statistical classification.
The excess risk of a general classifier
(possibly depending on some training data) is defined as
Thus this non-negative quantity is important for assessing the performance of different classification techniques. A classifier is said to be consistent if the excess risk converges to zero as the size of the training data set tends to infinity.[2]
Considering the components
of
to be mutually independent, we get the naive Bayes classifier, where
Properties[edit]
Proof that the Bayes classifier is optimal and Bayes error rate is minimal proceeds as follows.
Define the variables: Risk
, Bayes risk
, all possible classes to which the points can be classified
. Let the posterior probability of a point belonging to class 1 be
. Define the classifier
as
Then we have the following results:
, i.e.
is a Bayes classifier,- For any classifier
, the excess risk satisfies ![{\displaystyle R(h)-R^{*}=2\mathbb {E} _{X}\left[|\eta (x)-0.5|\cdot \mathbb {I} _{\left\{h(X)\neq h^{*}(X)\right\}}\right]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/92a2fec73684d3551d08cf02e2c48ed1005af28d)
![{\displaystyle R^{*}=\mathbb {E} _{X}\left[\min(\eta (X),1-\eta (X))\right]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/fa3bb978cc7e2b5d3664f9051f7562943ddf73fc)
![{\displaystyle R^{*}={\frac {1}{2}}-{\frac {1}{2}}\mathbb {E} [|2\eta (X)-1|]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e369a206f96f8aa43e546ddc6abd3a181aceacf0)
Proof of (a): For any classifier
, we have
where the second line was derived through Fubini's theorem
Notice that
is minimised by taking
,
Therefore the minimum possible risk is the Bayes risk,
.
Proof of (b):
Proof of (c):
Proof of (d):
General case[edit]
The general case that the Bayes classifier minimises classification error when each element can belong to either of n categories proceeds by towering expectations as follows.
This is minimised by simultaneously minimizing all the terms of the expectation using the classifier
for each observation x.
See also[edit]
References[edit]