New Advances in Machine Learning
New Advances in Machine Learning
New Advances in Machine Learning
Edited by
Yagang Zhang
In-Tech
intechweb.org
Published by In-Teh
In-Teh
Olajnica 19/2, 32000 Vukovar, Croatia
Abstracting and non-profit use of the material is permitted with credit to the source. Statements and
opinions expressed in the chapters are these of the individual contributors and not necessarily those of
the editors or publisher. No responsibility is accepted for the accuracy of information contained in the
published articles. Publisher assumes no responsibility liability for any damage or injury to persons or
property arising out of the use of any materials, instructions, methods or ideas contained inside. After
this work has been published by the In-Teh, authors have the right to republish it, in whole or part, in any
publication of which they are an author or editor, and the make other personal use of the work.
© 2010 In-teh
www.intechweb.org
Additional copies can be obtained from:
[email protected]
p. cm.
ISBN 978-953-307-034-6
V
Preface
The purpose of this book is to provide an up-to-data and systematical introduction to the
principles and algorithms of machine learning. The definition of learning is broad enough to
include most tasks that we commonly call “Learning” tasks, as we use the word in daily life.
It is also broad enough to encompass computer that improve from experience in quite straight
forward ways.
Machine learning addresses the question of how to build computer programs that improve
their performance at some task through experience. It attempts to automate the estimation
process by building machine learners based upon empirical data. Machine learning algorithms
have been proven to be of great practical value in a variety application domain, such as,
data mining problems where large databases may contain valuable implicit regularities that
can be discovered automatically; poorly understood domains where humans might not have
the knowledge needed to develop effective algorithms; domains where the program must
dynamically adapt to changing conditions.
Machine learning is inherently a multidisciplinary field. It draws on results from artificial
intelligence, probability and statistics, computational complexity theory, control theory,
information theory, philosophy, psychology, neurobiology, and other fields. The goal of
this book is to present the important advances in the theory and algorithm that from the
foundations of machine learning.
Large amount of knowledge about machine learning has been presented in this book, mainly
include: classification, support vector machine, discriminant analysis, multi-agent system,
image recognition, ant colony optimization, and so on.
The book will be of interest to industrial engineers and scientists as well as academics who
wish to pursue machine learning. The book is intended for both graduate and postgraduate
students in fields such as computer science, cybernetics, system sciences, engineering,
statistics, and social sciences, and as a reference for software professionals and practitioners.
The wide scope of the book provides them with a good introduction to many approaches of
machine learning, and it is also the source of useful bibliographical information.
Editor:
Yagang Zhang
VI
VII
Contents
Preface V
7. From Feature Space to Primal Space: KPCA and Its Mixture Model 105
HaixianWang
15. Mahalanobis Support Vector Machines Made Fast and Robust 227
Xunkai Wei, Yinghong Li, Dong Liu and Liguang Zhan
16. On-line learning of fuzzy rule emulated networks for a class of unknown
nonlinear discrete-time controllers with estimated linearization 251
Chidentree Treesatayapun
17. Knowledge Structures for Visualising Advanced Research and Trends 271
Maria R. Lee and Tsung Teng Chen
19. Concept Mining and Inner Relationship Discovery from Text 305
Jiayu Zhou and Shi Wang
21. A Hebbian Learning Approach for Diffusion Tensor Analysis & Tractography 345
Dilek Göksel Duru
22. A Novel Credit Assignment to a Rule with Probabilistic State Transition 357
Wataru Uemura
Introduction to Machine Learning 1
X1
1. Introduction
In present times, giving a computer to carry out any task requires a set of specific
instructions or the implementation of an algorithm that defines the rules that need to be
followed. The present day computer system has no ability to learn from past experiences
and hence cannot readily improve on the basis of past mistakes. So, giving a computer or
instructing a computer controlled programme to perform a task requires one to define a
complete and correct algorithm for task and then programme the algorithm into the
computer. Such activities involve tedious and time consuming effort by specially trained
teacher or person. Jaime et al (Jaime G. Carbonell, 1983) also explained that the present day
computer systems cannot truly learn to perform a task through examples or through
previous solved task and they cannot improve on the basis of past mistakes or acquire new
abilities by observing and imitating experts. Machine Learning research endeavours to open
the possibility of instruction the computer in such a new way and thereby promise to ease
the burden of hand writing programmes and growing problems of complex information
that get complicated in the computer.
When approaching a task-oriented acquisition task, one must be aware that the resultant
computer system must interact with human and therefore should closely match human
abilities. So, learning machine or programme on the other hand will have to interact with
computer users who make use of them and consequently the concept and skills they
acquire- if not necessarily their internal mechanism must be understandable to humans.
Also Alpaydin (Alpaydin, 2004) stated that with advances in computer technology, we
currently have the ability to store and process large amount of data, as well as access it from
physically distant locations over computer network. Most data acquisition devices are
digital now and record reliable data. For example, a supermarket chain that has hundreds of
stores all over the country selling thousands of goods to millions of customers. The point of
sale terminals record the details of each transaction: date, customer identification code,
goods bought and their amount, total money spent and so forth, This typically amounts to
gigabytes of data every day. This store data becomes useful only when it is analysed and
tuned into information that can be used or be predicted.
We do not know exactly which people are likely to buy a particular product or which author
to suggest to people who enjoy reading Hemingway. If we knew, we would not need any
analysis of the data; we would just go ahead and write down code. But because we do not,
we can only collect data and hope to extract the answers to these and similar question from
2 New Advances in Machine Learning
data. We can construct a good and useful approximation. That approximation may not
explain everything, but may still be able to account for some part of data. We believe that
identifying the complete process may not be possible, we can still detect certain patterns or
regularities. This is the niche of machine learning. Such patterns may help us understand
the process, or we can use those patterns to make predictions: Assuming that the future, at
least the near future, will not be much different from the past when the sample data was
collected, the future predictions can be expected to be right.
Machine learning is not just a database problem, it is a part of artificial intelligence. To be
intelligent, a system that is in a changing environment should have the ability to learn. If the
system can learn and adapt to such changes, the system designer need not foresee and
provide solutions for all possible situations. Machine learning also help us find solutions to
may problems in vision, speech recognition and robotics. Lets take the example of
recognising of faces: This is a task we do effortlessly; we recognise family members and
friends by looking their faces or from their photographs, despite differences in pose,
lighting, hair, style and so forth. But we do consciously and are able to explain how we do it.
Because we are not able to explain our expertise, we cannot write the computer program. At
the same time, we know that a face image is not just a random collection of pixel: a face has
structure, it is symmetric. There are the eyes, the nose, the mouth, located in certain places
on the face. Each person’s face is a pattern that composed of a particular combination of
these. By analysing sample face images of person, a learning program captures the pattern
specific to that person and then recognises by checking for the pattern in a given image. This
is one example of pattern recognition.
Machine learning is programming computers to optimise a performance criterion using
example data or past experience. We have a model defined up to some parameters, and
learning is the execution of a computer program to optimise the parameter of the model
using the training data or past experience. The model may be predictive to make predictions
in the future, or descriptive to gain knowledge from data, or both. Machine learning uses the
theory of statistics in building mathematical models, because the core task is making
inference from sample. The role of learning is twofold: First, in training, we need efficient
algorithms to solve the optimised problem, as well as to store and process the massive
amount of data we generally have. Second, once a model is learned, its representation and
algorithmic solution for inference needs to be efficient as well. In certain applications, the
efficiency of the learning or inference algorithm, namely, its space and time complexity may
be as important as its predictive accuracy.
Some tasks cannot be defined well except by example; that is we might be able to
specify input and output pairs but not a concise relationship between inputs and
desired outputs. We would like machines to be able to adjust their internal
structure to produce correct outputs for a large number of sample inputs and thus
suitably constrain their input and output function to approximate the relationship
implicit in the examples.
It is possible that hidden among large piles of data are important relationships and
correlations. Machine learning methods can often be used to extract these
relationships (data mining).
Human designers often produce machines that do not work as well as desired in
the environments in which they are used. In fact, certain characteristics of the
working environment might not be completely known at design time. Machine
learning methods can be used for on the job improvement of existing machine
designs.
Introduction to Machine Learning 5
The amount of knowledge available about certain tasks might be too large for
explicit encoding by humans. Machines that learn this knowledge gradually might
be able to capture more of it than humans would want to write down.
Brian Models: Non linear elements with weighted inputs have been suggested as
simple models of biological neurons. Networks of these elements have been
studied by several researchers including (Rajesh P. N. Rao, 2002). Brain modelers
are interested in how closely these networks approximate the learning phenomena
of living brain. We shall see that several important machine learning techniques are
based on networks of nonlinear elements often called neural networks. Work
inspired by this school is some times called connectionism, brain-style computation
or sub-symbolic processing.
Artificial Intelligence From the beginning, AI research has been concerned with
machine learning. Samuel developed a prominent early program that learned
parameters of a function for evaluating board positions in the game of checkers. AI
researchers have also explored the role of analogies in learning and how future
actions and decisions can be based on previous exemplary cases. Recent work has
been directed at discovering rules for expert systems using decision tree methods
and inductive logic programming Another theme has been saving and
generalizing the results of problem solving using explanation based learning,
(Mooney, 2000) ,(Y. Chali, 2009).
Evolutionary Models
In nature, not only do individual animals learn to perform better, but species
evolve to be better fit in their individual niches. Since the distinction between
evolving and learning can be blurred in computer systems, techniques that model
certain aspects of biological evolution have been proposed as learning methods to
improve the performance of computer programs. Genetic algorithms and genetic
programming (Oltean, 2005) are the most prominent computational techniques for
evolution.
2. References
Allix, N. M. (2003, April). Epistemology And Knowledge Management Concepts And
Practices. Journal of Knowledge Management Practice .
Alpaydin, E. (2004). Introduction to Machine Learning. Massachusetts, USA: MIT Press.
Anderson, J. R. (1995). Learning and Memory. Wiley, New York, USA.
Anil Mathur, G. P. (1999). Socialization influences on preparation for later life. Journal of
Marketing Practice: Applied Marketing Science , 5 (6,7,8), 163 - 176.
Ashby, W. R. (1960). Design of a Brain, The Origin of Adaptive Behaviour. John Wiley and Son.
Batista, G. &. (2003). An Analysis of Four Missing Data Treatment Methods for Suppervised
Learning. Applied Artificial Intelligence , 17, 519-533.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford, England: Oxford
University Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and
Statistics). New York, New York: Springer Science and Business Media.
Block H, D. (1961). The Perceptron: A Model of Brian Functioning. 34 (1), 123-135.
Introduction to Machine Learning 7
Machine Learning Overview 9
X2
Although many research efforts strive primarily towards one of these objectives, progress in
on objective often lends to progress in another. For example, in order to investigate the
space of possible learning methods, a reasonable starting point may be to consider the only
known example of robust learning behaviour, namely humans (and perhaps other biological
systems) Similarly, psychological investigations of human learning may held by theoretical
analysis that may suggest various possible learning models. The need to acquire a particular
form of knowledge in stone task-oriented study may itself spawn new theoretical analysis or
pose the question: “how do humans acquire this specific skill (or knowledge)?” The
existence of these mutually supportive objectives reflects the entire field of artificial
intelligence where expert system research, cognitive simulation, and theoretical studies
provide some (cross-fertilization of problems and ideas (Jaime G. Carbonell, 1983).
1) Knowledge Acquisition
2) Skill refinement
Machine Learning Overview 13
When it is said that someone learned mathematics, it means that this person acquired
concepts of mathematics, understood the meaning and their relationship to each other as
well as to the world. The importance of learning in this case is acquisition of knowledge,
including the description and models of physical systems and their behaviours,
incorporating a variety of representations from simple intrusive mental model models,
examples and images to completely test mathematical equations and physical laws. A
person is said to have learned more if this knowledge explains a broader scope of situations,
is more accurate, and is better able to predict the behaviour of the typical world (Allix,
2003). This form of learning is typically to a large variety of situations and is generally
learned knowledge acquisition. Therefore, knowledge acquisition is defined as learning a
new task coupled with the ability to apply the information in the effective manner.
The second form of learning is the gradual improvement of motor and cognitive skills
through practice- Learning by practice. Learning such as:
If one acquire all textbook knowledge on how to perform these aforementioned activities,
this represent the initial phase in developing the required skills. So, the major part of the
learning process consists of taming the acquired skill, and improving the mental or motor
coordination or learning coordination by repeated practice and correction of deviations from
desired behaviour. This form of learning often called skill taming. This differs in many ways
from knowledge acquisition. Where knowledge acquisition may be a conscious process
whose result is the creation of new representative knowledge structures and mental models,
and skill taming is learning from example or learning from repeated practice without
concerted conscious effort. Jamie (Jaime G. Carbonell, 1983) explained that most human
learning appears to be a mixture of both activities, with intellectual endeavours favouring
the former and motor coordination tasks favouring the latter. Present machine learning
research focuses on the knowledge acquisition aspect, although some investigations,
specifically those concerned with learning in problem-solving and transforming declarative
instructions into effective actions, touch on aspects of both types of learning. Whereas
knowledge acquisition clearly belongs in the realm of artificial intelligence research, a case
could be made that skill refinement comes closer to non-symbolic processes such as those
studied in adaptative control system. Hence, perhaps both forms of learning- (knowledge
based and refinement learning) can be captured in artificial intelligence models.
2. References
Allix, N. M. (2003, April). Epistemology And Knowledge Management Concepts And
Practices. Journal of Knowledge Management Practice .
Alpaydin, E. (2004). Introduction to Machine Learning. Massachusetts, USA: MIT Press.
Anderson, J. R. (1995). Learning and Memory. Wiley, New York, USA.
Machine Learning Overview 17
Anil Mathur, G. P. (1999). Socialization influences on preparation for later life. Journal of
Marketing Practice: Applied Marketing Science , 5 (6,7,8), 163 - 176.
Ashby, W. R. (1960). Design of a Brain, The Origin of Adaptive Behaviour. John Wiley and Son.
Batista, G. &. (2003). An Analysis of Four Missing Data Treatment Methods for Suppervised
Learning. Applied Artificial Intelligence , 17, 519-533.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford, England: Oxford
University Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics).
New York, New York: Springer Science and Business Media.
Block H, D. (1961). The Perceptron: A Model of Brian Functioning. 34 (1), 123-135.
Carling, A. (1992). Introducing Neural Networks . Wilmslow, UK: Sigma Press.
D. Michie, D. J. (1994). Machine Learning, Neural and Statistical Classification. Prentice Hall Inc.
Fausett, L. (19994). Fundamentals of Neural Networks. New York: Prentice Hall.
Forsyth, R. S. (1990). The strange story of the Perceptron. Artificial Intelligence Review , 4 (2),
147-155.
Friedberg, R. M. (1958). A learning machine: Part, 1. IBM Journal , 2-13.
Ghahramani, Z. (2008). Unsupervised learning algorithms are designed to extract structure
from data. 178, pp. 1-8. IOS Press.
Gillies, D. (1996). Artificial Intelligence and Scientific Method. OUP Oxford.
Haykin, S. (19994). Neural Networks: A Comprehensive Foundation. New York: Macmillan
Publishing.
Hodge, V. A. (2004). A Survey of Outlier Detection Methodologies. Artificial Intelligence Review,
22 (2), 85-126.
Holland, J. (1980). Adaptive Algorithms for Discovering and Using General Patterns in
Growing Knowledge Bases Policy Analysis and Information Systems. 4 (3).
Hunt, E. B. (1966). Experiment in Induction.
Ian H. Witten, E. F. (2005). Data Mining Practical Machine Learning and Techniques (Second
edition ed.). Morgan Kaufmann.
Jaime G. Carbonell, R. S. (1983). Machine Learning: A Historical and Methodological Analysis.
Association for the Advancement of Artificial Intelligence , 4 (3), 1-10.
Kohonen, T. (1997). Self-Organizating Maps.
Luis Gonz, l. A. (2005). Unified dual for bi-class SVM approaches. Pattern Recognition , 38 (10),
1772-1774.
McCulloch, W. S. (1943). A logical calculus of the ideas immanent in nervous activity. Bull.
Math. Biophysics , 115-133.
Mitchell, T. M. (2006). The Discipline of Machine Learning. Machine Learning Department
technical report CMU-ML-06-108, Carnegie Mellon University.
Mooney, R. J. (2000). Learning Language in Logic. In L. N. Science, Learning for Semantic
Interpretation: Scaling Up without Dumbing Down (pp. 219-234). Springer Berlin /
Heidelberg.
Mostow, D. (1983). Transforming declarative advice into effective procedures: a heuristic search
cxamplc In I?. S. Michalski,. Tioga Press.
Nilsson, N. J. (1982). Principles of Artificial Intelligence (Symbolic Computation / Artificial
Intelligence). Springer.
Oltean, M. (2005). Evolving Evolutionary Algorithms Using Linear Genetic Programming. 13
(3), 387 - 410 .
18 New Advances in Machine Learning
Orlitsky, A., Santhanam, N., Viswanathan, K., & Zhang, J. (2005). Convergence of profile based
estimators. Proceedings of International Symposium on Information Theory. Proceedings.
International Symposium on, pp. 1843 - 1847. Adelaide, Australia: IEEE.
Patterson, D. (19996). Artificial Neural Networks. Singapore: Prentice Hall.
R. S. Michalski, T. J. (1983). Learning from Observation: Conceptual Clustering. TIOGA Publishing
Co.
Rajesh P. N. Rao, B. A. (2002). Probabilistic Models of the Brain. MIT Press.
Rashevsky, N. (1948). Mathematical Biophysics:Physico-Mathematical Foundations of Biology.
Chicago: Univ. of Chicago Press.
Richard O. Duda, P. E. (2000). Pattern Classification (2nd Edition ed.).
Richard S. Sutton, A. G. (1998). Reinforcement Learning. MIT Press.
Ripley, B. (1996). Pattern Recognition and Neural Networks. Cambridge University Press.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain . Psychological Review , 65 (6), 386-408.
Russell, S. J. (2003). Artificial Intelligence: A Modern Approach (2nd Edition ed.). Upper Saddle
River, NJ, NJ, USA: Prentice Hall.
Ryszard S. Michalski, J. G. (1955). Machine Learning: An Artificial Intelligence Approach (Volume
I). Morgan Kaufmann .
Ryszard S. Michalski, J. G. (1955). Machine Learning: An Artificial Intelligence Approach.
Selfridge, O. G. (1959). Pandemonium: a paradigm for learning. In The mechanisation of thought
processes. H.M.S.O., London. London.
Sleeman, D. H. (1983). Inferring Student Models for Intelligent CAI. Machine Learning. Tioga Press.
Tapas Kanungo, D. M. (2002). A local search approximation algorithm for k-means clustering.
Proceedings of the eighteenth annual symposium on Computational geometry (pp. 10-18).
Barcelona, Spain : ACM Press.
Timothy Jason Shepard, P. J. (1998). Decision Fusion Using a Multi-Linear Classifier . In
Proceedings of the International Conference on Multisource-Multisensor Information
Fusion.
Tom, M. (1997). Machibe Learning. Machine Learning, Tom Mitchell, McGraw Hill, 1997:
McGraw Hill.
Trevor Hastie, R. T. (2001). The Elements of Statistical Learning. New york, NY, USA: Springer
Science and Business Media.
Widrow, B. W. (2007). Adaptive Inverse Control: A Signal Processing Approach. Wiley-IEEE Press.
Y. Chali, S. R. (2009). Complex Question Answering: Unsupervised Learning Approaches and
Experiments. Journal of Artificial Intelligent Research , 1-47.
Yu, L. L. (2004, October). Efficient feature Selection via Analysis of Relevance and Redundacy.
JMLR , 1205-1224.
Zhang, S. Z. (2002). Data Preparation for Data Mining. Applied Artificial Intelligence. 17, 375 -
381.
Types of Machine Learning Algorithms 19
X3
• Supervised learning --- where the algorithm generates a function that maps inputs
to desired outputs. One standard formulation of the supervised learning task is the
classification problem: the learner is required to learn (to approximate the behavior
of) a function which maps a vector into one of several classes by looking at several
input-output examples of the function.
• Unsupervised learning --- which models a set of inputs: labeled examples are not
available.
• Semi-supervised learning --- which combines both labeled and unlabeled examples
to generate an appropriate function or classifier.
• Reinforcement learning --- where the algorithm learns a policy of how to act given
an observation of the world. Every action has some impact in the environment, and
the environment provides feedback that guides the learning algorithm.
• Transduction --- similar to supervised learning, but does not explicitly construct a
function: instead, tries to predict new outputs based on training inputs, training
outputs, and new inputs.
• Learning to learn --- where the algorithm learns its own inductive bias based on
previous experience.
Supervised learning 3 is the most common technique for training neural networks and
decision trees. Both of these techniques are highly dependent on the information given by
the pre-determined classifications. In the case of neural networks, the classification is used
to determine the error of the network and then adjust the network to minimize it, and in
decision trees, the classifications are used to determine what attributes provide the most
information that can be used to solve the classification puzzle. We'll look at both of these in
more detail, but for now, it should be sufficient to know that both of these examples thrive
on having some "supervision" in the form of pre-determined classifications.
Inductive machine learning is the process of learning a set of rules from instances (examples
in a training set), or more generally speaking, creating a classifier that can
be used to generalize from new instances. The process of applying supervised ML to a real-
world problem is described in Figure F. The first step is collecting the dataset. If a requisite
expert is available, then s/he could suggest which fields (attributes, features) are the most
1 http://www.aihorizon.com/essays/generalai/supervised_unsupervised_machine_learning.htm
2 http://www.cis.hut.fi/harri/thesis/valpola_thesis/node34.html
3 http://www.aihorizon.com/essays/generalai/supervised_unsupervised_machine_learning.htm
Types of Machine Learning Algorithms 21
informative. If not, then the simplest method is that of “brute-force,” which means
measuring everything available in the hope that the right (informative, relevant) features
can be isolated. However, a dataset collected by the “brute-force” method is not directly
suitable for induction. It contains in most cases noise and missing feature values, and
therefore requires significant pre-processing according to Zhang et al (Zhang, 2002).
The second step is the data preparation and data pre-processing. Depending on the
circumstances, researchers have a number of methods to choose from to handle missing data
(Batista, 2003). Hodge et al (Hodge, 2004) , have recently introduced a survey of
contemporary techniques for outlier (noise) detection. These researchers have identified the
techniques’ advantages and disadvantages. Instance selection is not only used to handle
noise but to cope with the infeasibility of learning from very large datasets. Instance
selection in these datasets is an optimization problem that attempts to maintain the mining
quality while minimizing the sample size. It reduces data and enables a data mining
algorithm to function and work effectively with very large datasets. There is a variety of
procedures for sampling instances from a large dataset. See figure 2 below.
Feature subset selection is the process of identifying and removing as many irrelevant and
redundant features as possible (Yu, 2004) . This reduces the dimensionality of the data and
enables data mining algorithms to operate faster and more effectively. The fact that many
features depend on one another often unduly influences the accuracy of supervised ML
classification models. This problem can be addressed by constructing new features from the
basic feature set. This technique is called feature construction/transformation. These newly
generated features may lead to the creation of more concise and accurate classifiers. In
addition, the discovery of meaningful features contributes to better comprehensibility of the
produced classifier, and a better understanding of the learned concept.Speech recognition
using hidden Markov models and Bayesian networks relies on some elements of
supervision as well in order to adjust parameters to, as usual, minimize the error on the
given inputs.Notice something important here: in the classification problem, the goal of the
learning algorithm is to minimize the error with respect to the given inputs. These inputs,
often called the "training set", are the examples from which the agent tries to learn. But
learning the training set well is not necessarily the best thing to do. For instance, if I tried to
teach you exclusive-or, but only showed you combinations consisting of one true and one
false, but never both false or both true, you might learn the rule that the answer is always
true. Similarly, with machine learning algorithms, a common problem is over-fitting the
data and essentially memorizing the training set rather than learning a more general
classification technique. As you might imagine, not all training sets have the inputs
classified correctly. This can lead to problems if the algorithm used is powerful enough to
memorize even the apparently "special cases" that don't fit the more general principles. This,
too, can lead to over fitting, and it is a challenge to find algorithms that are both powerful
enough to learn complex functions and robust enough to produce generalisable results.
22 New Advances in Machine Learning
Problem
Identification of
Data
Data Pre-Processing
Algorithm
selection
Parameter
Tuning Training
NO
YES Classifier
OK
4 http://www.aihorizon.com/essays/generalai/supervised_unsupervised_machine_learning.htm
Types of Machine Learning Algorithms 23
actions and punished for doing others. Often, a form of reinforcement learning can be used
for unsupervised learning, where the agent bases its actions on the previous rewards and
punishments without necessarily even learning any information about the exact ways that
its actions affect the world. In a way, all of this information is unnecessary because by
learning a reward function, the agent simply knows what to do without any processing
because it knows the exact reward it expects to achieve for each action it could take. This can
be extremely beneficial in cases where calculating every possibility is very time consuming
(even if all of the transition probabilities between world states were known). On the other
hand, it can be very time consuming to learn by, essentially, trial and error. But this kind of
learning can be powerful because it assumes no pre-discovered classification of examples. In
some cases, for example, our classifications may not be the best possible. One striking
exmaple is that the conventional wisdom about the game of backgammon was turned on its
head when a series of computer programs (neuro-gammon and TD-gammon) that learned
through unsupervised learning became stronger than the best human chess players merely
by playing themselves over and over. These programs discovered some principles that
surprised the backgammon experts and performed better than backgammon programs
trained on pre-classified examples. A second type of unsupervised learning is called
clustering. In this type of learning, the goal is not to maximize a utility function, but simply
to find similarities in the training data. The assumption is often that the clusters discovered
will match reasonably well with an intuitive classification. For instance, clustering
individuals based on demographics might result in a clustering of the wealthy in one group
and the poor in another. Although the algorithm won't have names to assign to these
clusters, it can produce them and then use those clusters to assign new examples into one or
the other of the clusters. This is a data-driven approach that can work well when there is
sufficient data; for instance, social information filtering algorithms, such as those that
Amazon.com use to recommend books, are based on the principle of finding similar groups
of people and then assigning new users to groups. In some cases, such as with social
information filtering, the information about other members of a cluster (such as what books
they read) can be sufficient for the algorithm to produce meaningful results. In other cases, it
may be the case that the clusters are merely a useful tool for a human analyst.
Unfortunately, even unsupervised learning suffers from the problem of overfitting the
training data. There's no silver bullet to avoiding the problem because any algorithm that
can learn from its inputs needs to be quite powerful.
Unsupervised learning algorithms according to Ghahramani (Ghahramani, 2008) are
designed to extract structure from data samples. The quality of a structure is measured by a
cost function which is usually minimized to infer optimal parameters characterizing the
hidden structure in the data. Reliable and robust inference requires a guarantee that
extracted structures are typical for the data source, i.e., similar structures have to be
extracted from a second sample set of the same data source. Lack of robustness is known as
over fitting from the statistics and the machine learning literature. In this talk I characterize
the over fitting phenomenon for a class of histogram clustering models which play a
prominent role in information retrieval, linguistic and computer vision applications.
Learning algorithms with robustness to sample fluctuations are derived from large
deviation results and the maximum entropy principle for the learning process.
24 New Advances in Machine Learning
• Linear Classifiers
Logical Regression
Naïve Bayes Classifier
Perceptron
Support Vector Machine
• Quadratic Classifiers
• K-Means Clustering
• Boosting
• Decision Tree
Random Forest
• Neural networks
• Bayesian Networks
Linear Classifiers: In machine learning, the goal of classification is to group items that have
similar feature values, into groups. Timothy et al (Timothy Jason Shepard, 1998) stated that
a linear classifier achieves this by making a classification decision based on the value of
the linear combination of the features. If the input feature vector to the classifier is
a real vector , then the output score is
Types of Machine Learning Algorithms 25
where is a real vector of weights and f is a function that converts the dot product of the
two vectors into the desired output. The weight vector is learned from a set of labelled
training samples. Often f is a simple function that maps all values above a certain threshold
to the first class and all other values to the second class. A more complex f might give the
probability that an item belongs to a certain class.
For a two-class classification problem, one can visualize the operation of a linear classifier as
splitting a high-dimensional input space with a hyperplane: all points on one side of the
hyper plane are classified as "yes", while the others are classified as "no". A linear classifier is
often used in situations where the speed of classification is an issue, since it is often the
fastest classifier, especially when is sparse. However, decision trees can be faster. Also,
linear classifiers often work very well when the number of dimensions in is large, as
in document classification, where each element in is typically the number of counts of a
word in a document (see document-term matrix). In such cases, the classifier should be well-
regularized.
A Two-Dimensional Example
Before considering N-dimensional hyper planes, let’s look at a simple 2-dimensional
example. Assume we wish to perform a classification, and our data has a categorical target
variable with two categories. Also assume that there are two predictor variables with
continuous values. If we plot the data points using the value of one predictor on the X axis
and the other on the Y axis we might end up with an image such as shown below. One
category of the target variable is represented by rectangles while the other category is
represented by ovals.
In this idealized example, the cases with one category are in the lower left corner and the
cases with the other category are in the upper right corner; the cases are completely
separated. The SVM analysis attempts to find a 1-dimensional hyper plane (i.e. a line) that
separates the cases based on their target categories. There are an infinite number of possible
lines; two candidate lines are shown above. The question is which line is better, and how do
we define the optimal line.
The dashed lines drawn parallel to the separating line mark the distance between the
dividing line and the closest vectors to the line. The distance between the dashed lines is
called the margin. The vectors (points) that constrain the width of the margin are the support
vectors. The following figure illustrates this.
Types of Machine Learning Algorithms 27
An SVM analysis (Luis Gonz, 2005) finds the line (or, in general, hyper plane) that is
oriented so that the margin between the support vectors is maximized. In the figure above,
the line in the right panel is superior to the line in the left panel.
If all analyses consisted of two-category target variables with two predictor variables, and
the cluster of points could be divided by a straight line, life would be easy. Unfortunately,
this is not generally the case, so SVM must deal with (a) more than two predictor variables,
(b) separating the points with non-linear curves, (c) handling the cases where clusters
cannot be completely separated, and (d) handling classifications with more than two
categories.
In this chapter, we shall explain three main machine learning techniques with their
examples and how they perform in reality. These are:
• K-Means Clustering
• Neural Network
• Self Organised Map
K-means (Bishop C. M., 1995) and (Tapas Kanungo, 2002) is one of the simplest
unsupervised learning algorithms that solve the well known clustering problem. The
procedure follows a simple and easy way to classify a given data set through a certain
number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids,
one for each cluster. These centroids shoud be placed in a cunning way because of different
location causes different result. So, the better choice is to place them as much as possible far
away from each other. The next step is to take each point belonging to a given data set and
associate it to the nearest centroid. When no point is pending, the first step is completed and
an early groupage is done. At this point we need to re-calculate k new centroids as
barycenters of the clusters resulting from the previous step. After we have these k new
centroids, a new binding has to be done between the same data set points and the nearest
new centroid. A loop has been generated. As a result of this loop we may notice that the k
centroids change their location step by step until no more changes are done. In other words
centroids do not move any more.
Finally, this algorithm aims at minimizing an objective function, in this case a squared error
function. The objective function
,
Types of Machine Learning Algorithms 29
where is a chosen distance measure between a data point and the cluster
centre , is an indicator of the distance of the n data points from their respective cluster
centres.
The algorithm in figure 4 is composed of the following steps:
Although it can be proved that the procedure will always terminate, the k-means algorithm
does not necessarily find the most optimal configuration, corresponding to the global
objective function minimum. The algorithm is also significantly sensitive to the initial
randomly selected cluster centres. The k-means algorithm can be run multiple times to
reduce this effect. K-means is a simple algorithm that has been adapted to many problem
domains. As we are going to see, it is a good candidate for extension to work with fuzzy
feature vectors.
An example
Suppose that we have n sample feature vectors x1, x2, ..., xn all from the same class, and we
know that they fall into k compact clusters, k < n. Let mi be the mean of the vectors in cluster
i. If the clusters are well separated, we can use a minimum-distance classifier to separate
them. That is, we can say that x is in cluster i if || x - mi || is the minimum of all the k
distances. This suggests the following procedure for finding the k means:
Here is an example showing how the means m1 and m2 move into the centers of two
clusters.
This is a simple version of the k-means procedure. It can be viewed as a greedy algorithm
for partitioning the n samples into k clusters so as to minimize the sum of the squared
distances to the cluster centers. It does have some weaknesses:
• The way to initialize the means was not specified. One popular way to start is to
randomly choose k of the samples.
• The results produced depend on the initial values for the means, and it frequently
happens that suboptimal partitions are found. The standard solution is to try a
number of different starting points.
• It can happen that the set of samples closest to mi is empty, so that mi cannot be
updated. This is an annoyance that must be handled in an implementation, but that
we shall ignore.
• The results depend on the metric used to measure || x - mi ||. A popular solution
is to normalize each variable by its standard deviation, though this is not always
desirable.
• The results depend on the value of k.
This last problem is particularly troublesome, since we often have no way of knowing how
many clusters exist. In the example shown above, the same algorithm applied to the same
data produces the following 3-means clustering. Is it better or worse than the 2-means
clustering?
Types of Machine Learning Algorithms 31
Unfortunately there is no general theoretical solution to find the optimal number of clusters
for any given data set. A simple approach is to compare the results of multiple runs with
different k classes and choose the best one according to a given criterion
intuitively selected and are all meaningful). The number of hidden units to use is
far from clear. As good a starting point as any is to use one hidden layer, with the
number of units equal to half the sum of the number of input and output units.
Again, we will discuss how to choose a sensible number later.
• Training Multilayer Perceptrons: Once the number of layers, and number of units
in each layer, has been selected, the network's weights and thresholds must be set
so as to minimize the prediction error made by the network. This is the role of
the training algorithms. The historical cases that you have gathered are used to
automatically adjust the weights and thresholds in order to minimize this error.
This process is equivalent to fitting the model represented by the network to the
training data available. The error of a particular configuration of the network can
be determined by running all the training cases through the network, comparing
the actual output generated with the desired or target outputs. The differences are
combined together by an error function to give the network error. The most
common error functions are the sum squared error (used for regression problems),
where the individual errors of output units on each case are squared and summed
together, and the cross entropy functions (used for maximum likelihood
classification).
In traditional modeling approaches (e.g., linear modeling) it is possible to
algorithmically determine the model configuration that absolutely minimizes this
error. The price paid for the greater (non-linear) modeling power of neural
networks is that although we can adjust a network to lower its error, we can never
be sure that the error could not be lower still.
A helpful concept here is the error surface. Each of the N weights and thresholds of the
network (i.e., the free parameters of the model) is taken to be a dimension in space.
The N+1th dimension is the network error. For any possible configuration of weights the
error can be plotted in the N+1th dimension, forming an error surface. The objective of
network training is to find the lowest point in this many-dimensional surface.
In a linear model with sum squared error function, this error surface is a parabola (a
quadratic), which means that it is a smooth bowl-shape with a single minimum. It is
therefore "easy" to locate the minimum.
Neural network error surfaces are much more complex, and are characterized by a number
of unhelpful features, such as local minima (which are lower than the surrounding terrain,
but above the global minimum), flat-spots and plateaus, saddle-points, and long narrow
ravines.
It is not possible to analytically determine where the global minimum of the error surface is,
and so neural network training is essentially an exploration of the error surface. From an
initially random configuration of weights and thresholds (i.e., a random point on the error
surface), the training algorithms incrementally seek for the global minimum. Typically, the
gradient (slope) of the error surface is calculated at the current point, and used to make a
downhill move. Eventually, the algorithm stops in a low point, which may be a local
minimum (but hopefully is the global minimum).
Types of Machine Learning Algorithms 33
The algorithm is also usually modified by inclusion of a momentum term: this encourages
movement in a fixed direction, so that if several steps are taken in the same direction, the
algorithm "picks up speed", which gives it the ability to (sometimes) escape local minimum,
and also to move rapidly over flat spots and plateaus.
The algorithm therefore progresses iteratively, through a number of epochs. On each epoch,
the training cases are each submitted in turn to the network, and target and actual outputs
compared and the error calculated. This error, together with the error surface gradient, is
used to adjust the weights, and then the process repeats. The initial network configuration is
random, and training stops when a given number of epochs elapses, or when the error
reaches an acceptable level, or when the error stops improving (you can select which of
these stopping conditions to use).
y=2x+3
y=3x2+4x+1
Different polynomials have different shapes, with larger powers (and therefore larger
numbers of terms) having steadily more eccentric shapes. Given a set of data, we may want
to fit a polynomial curve (i.e., a model) to explain the data. The data is probably noisy, so we
don't necessarily expect the best model to pass exactly through all the points. A low-order
polynomial may not be sufficiently flexible to fit close to the points, whereas a high-order
polynomial is actually too flexible, fitting the data exactly by adopting a highly eccentric
shape that is actually unrelated to the underlying function. See figure 4 below.
Neural networks have precisely the same problem. A network with more weights models a
more complex function, and is therefore prone to over-fitting. A network with less weight
may not be sufficiently powerful to model the underlying function. For example, a network
with no hidden layers actually models a simple linear function. How then can we select the
right complexity of network? A larger network will almost invariably achieve a lower error
eventually, but this may indicate over-fitting rather than good modeling.
The answer is to check progress against an independent data set, the selection set. Some of
the cases are reserved, and not actually used for training in the back propagation algorithm.
Instead, they are used to keep an independent check on the progress of the algorithm. It is
invariably the case that the initial performance of the network on training and selection sets
is the same (if it is not at least approximately the same, the division of cases between the two
sets is probably biased). As training progresses, the training error naturally drops, and
providing training is minimizing the true error function, the selection error drops too.
However, if the selection error stops dropping, or indeed starts to rise, this indicates that the
network is starting to overfit the data, and training should cease. When over-fitting occurs
during the training process like this, it is called over-learning. In this case, it is usually
Types of Machine Learning Algorithms 35
advisable to decrease the number of hidden units and/or hidden layers, as the network is
over-powerful for the problem at hand. In contrast, if the network is not sufficiently
powerful to model the underlying function, over-learning is not likely to occur, and neither
training nor selection errors will drop to a satisfactory level.
The problems associated with local minima, and decisions over the size of network to use,
imply that using a neural network typically involves experimenting with a large number of
different networks, probably training each one a number of times (to avoid being fooled by
local minima), and observing individual performances. The key guide to performance here
is the selection error. However, following the standard scientific precept that, all else being
equal, a simple model is always preferable to a complex model, you can also select a smaller
network in preference to a larger one with a negligible improvement in selection error.
A problem with this approach of repeated experimentation is that the selection set plays a
key role in selecting the model, which means that it is actually part of the training process.
Its reliability as an independent guide to performance of the model is therefore
compromised - with sufficient experiments, you may just hit upon a lucky network that
happens to perform well on the selection set. To add confidence in the performance of the
final model, it is therefore normal practice (at least where the volume of training data allows
it) to reserve a third set of cases - the test set. The final model is tested with the test set data,
to ensure that the results on the selection and training set are real, and not artifacts of the
training process. Of course, to fulfill this role properly the test set should be used only once -
if it is in turn used to adjust and reiterate the training process, it effectively becomes
selection data!
This division into multiple subsets is very unfortunate, given that we usually have less data
than we would ideally desire even for a single subset. We can get around this problem by
resampling. Experiments can be conducted using different divisions of the available data
into training, selection, and test sets. There are a number of approaches to this subset,
including random (monte-carlo) resampling, cross-validation, and bootstrap. If we make
design decisions, such as the best configuration of neural network to use, based upon a
number of experiments with different subset examples, the results will be much more
reliable. We can then either use those experiments solely to guide the decision as to which
network types to use, and train such networks from scratch with new samples (this removes
any sampling bias); or, we can retain the best networks found during the sampling process,
but average their results in an ensemble, which at least mitigates the sampling bias.
To summarize, network design (once the input variables have been selected) follows a
number of stages:
• Select an initial configuration (typically, one hidden layer with the number of
hidden units set to half the sum of the number of input and output units).
• Iteratively conduct a number of experiments with each configuration, retaining
the best network (in terms of selection error) found. A number of experiments are
required with each configuration to avoid being fooled if training locates a local
minimum, and it is also best to resample.
• On each experiment, if under-learning occurs (the network doesn't achieve an
acceptable performance level) try adding more neurons to the hidden layer(s). If
this doesn't help, try adding an extra hidden layer.
36 New Advances in Machine Learning
• If over-learning occurs (selection error starts to rise) try removing hidden units
(and possibly layers).
• Once you have experimentally determined an effective configuration for your
networks, resample and generate new networks with that configuration.
• Data Selection: All the above stages rely on a key assumption. Specifically, the
training, verification and test data must be representative of the underlying model
(and, further, the three sets must be independently representative). The old
computer science adage "garbage in, garbage out" could not apply more strongly
than in neural modeling. If training data is not representative, then the model's
worth is at best compromised. At worst, it may be useless. It is worth spelling out
the kind of problems which can corrupt a training set:
The future is not the past. Training data is typically historical. If circumstances have
changed, relationships which held in the past may no longer hold. All eventualities must be
covered. A neural network can only learn from cases that are present. If people with
incomes over $100,000 per year are a bad credit risk, and your training data includes nobody
over $40,000 per year, you cannot expect it to make a correct decision when it encounters
one of the previously-unseen cases. Extrapolation is dangerous with any model, but some
types of neural network may make particularly poor predictions in such circumstances.
A network learns the easiest features it can. A classic (possibly apocryphal) illustration of
this is a vision project designed to automatically recognize tanks. A network is trained on a
hundred pictures including tanks, and a hundred not. It achieves a perfect 100% score.
When tested on new data, it proves hopeless. The reason? The pictures of tanks are taken on
dark, rainy days; the pictures without on sunny days. The network learns to distinguish the
(trivial matter of) differences in overall light intensity. To work, the network would need
training cases including all weather and lighting conditions under which it is expected to
operate - not to mention all types of terrain, angles of shot, distances...
Unbalanced data sets. Since a network minimizes an overall error, the proportion of types of
data in the set is critical. A network trained on a data set with 900 good cases and 100 bad
will bias its decision towards good cases, as this allows the algorithm to lower the overall
error (which is much more heavily influenced by the good cases). If the representation of
good and bad cases is different in the real population, the network's decisions may be
wrong. A good example would be disease diagnosis. Perhaps 90% of patients routinely
tested are clear of a disease. A network is trained on an available data set with a 90/10 split.
It is then used in diagnosis on patients complaining of specific problems, where the
likelihood of disease is 50/50. The network will react over-cautiously and fail to recognize
disease in some unhealthy patients. In contrast, if trained on the "complainants" data, and
then tested on "routine" data, the network may raise a high number of false positives. In
such circumstances, the data set may need to be crafted to take account of the distribution of
data (e.g., you could replicate the less numerous cases, or remove some of the numerous
cases), or the network's decisions modified by the inclusion of a loss matrix (Bishop C. M.,
1995). Often, the best approach is to ensure even representation of different cases, then to
interpret the network's decisions accordingly.
Types of Machine Learning Algorithms 37
• Select the winning neuron (the one who's centre is nearest to the input case);
• Adjust the winning neuron to be more like the input case (a weighted sum of the
old neuron centre and the training case).
The algorithm uses a time-decaying learning rate, which is used to perform the weighted
sum and ensures that the alterations become more subtle as the epochs pass. This ensures
38 New Advances in Machine Learning
that the centres settle down to a compromise representation of the cases which cause
that neuron to win. The topological ordering property is achieved by adding the concept of
a neighbourhood to the algorithm. The neighbourhood is a set of neurons surrounding the
winning neuron. The neighbourhood, like the learning rate, decays over time, so that
initially quite a large number of neurons belong to the neighbourhood (perhaps almost the
entire topological map); in the latter stages the neighbourhood will be zero (i.e., consists
solely of the winning neuron itself). In the Kohonen algorithm, the adjustment of neurons is
actually applied not just to the winning neuron, but to all the members of the current
neighbourhood.
The effect of this neighbourhood update is that initially quite large areas of the network are
"dragged towards" training cases - and dragged quite substantially. The network develops a
crude topological ordering, with similar cases activating clumps of neurons in
the topological map. As epochs pass the learning rate and neighbourhood both decrease, so
that finer distinctions within areas of the map can be drawn, ultimately resulting in fine-
tuning of individual neurons. Often, training is deliberately conducted in two distinct
phases: a relatively short phase with high learning rates and neighbourhood, and a long
phase with low learning rate and zero or near-zero neighbourhoods.
Once the network has been trained to recognize structure in the data, it can be used as a
visualization tool to examine the data. The Win Frequencies Datasheet (counts of the number
of times each neuron wins when training cases are executed) can be examined to see if
distinct clusters have formed on the map. Individual cases are executed and the topological
map observed, to see if some meaning can be assigned to the clusters (this usually involves
referring back to the original application area, so that the relationship between clustered
cases can be established). Once clusters are identified, neurons in the topological map are
labelled to indicate their meaning (sometimes individual cases may be labelled, too). Once
the topological map has been built up in this way, new cases can be submitted to the
network. If the winning neuron has been labelled with a class name, the network can
perform classification. If not, the network is regarded as undecided.
SOFM networks also make use of the accept threshold, when performing classification.
Since the activation level of a neuron in a SOFM network is the distance of the neuron from
the input case, the accept threshold acts as a maximum recognized distance. If the activation
of the winning neuron is greater than this distance, the SOFM network is regarded as
undecided. Thus, by labelling all neurons and setting the accept threshold appropriately, a
SOFM network can act as a novelty detector (it reports undecided only if the input case is
sufficiently dissimilar to all radial units).
SOFM networks as expressed by Kohonen (Kohonen, 1997) are inspired by some known
properties of the brain. The cerebral cortex is actually a large flat sheet (about 0.5m squared;
it is folded up into the familiar convoluted shape only for convenience in fitting into the
skull!) with known topological properties (for example, the area corresponding to the hand
is next to the arm, and a distorted human frame can be topologically mapped out in two
dimensions on its surface).
the n-dimensional data (here it would be colour and would be 3 dimensions) into something
that be better understood visually (in this case it would be a 2 dimensional image map).
In this case one would expect the dark blue and the greys to end up near each other on a
good map and yellow close to both the red and the green. The second components to SOMs
are the weight vectors. Each weight vector has two components to them which I have here
attempted to show in the image below. The first part of a weight vector is its data. This is of
the same dimensions as the sample vectors and the second part of a weight vector is its
natural location. The good thing about colour is that the data can be shown by displaying
the color, so in this case the color is the data, and the location is the x,y position of the pixel
on the screen.
In this example, 2D array of weight vectors was used and would look like figure 5 above.
This picture is a skewed view of a grid where you have the n-dimensional array for each
weight and each weight has its own unique location in the grid. Weight vectors don’t
necessarily have to be arranged in 2 dimensions, a lot of work has been done using SOMs of
1 dimension, but the data part of the weight must be of the same dimensions as the sample
vectors.Weights are sometimes referred to as neurons since SOMs are actually neural
networks. SOM Algorithm. The way that SOMs go about organizing themselves is by
42 New Advances in Machine Learning
where x[i] is the data value at the ith data member of a sample and n is the number of
dimensions to the sample vectors.
In the case of colour, if we can think of them as 3D points, each component being an axis. If
we have chosen green which is of the value (0,6,0), the color light green (3,6,3) will be closer
to green than red at (6,0,0).
So light green is now the best matching unit, but this operation of calculating distances and
comparing them is done over the entire map and the wieght with the shortest distance to the
sample vector is the winner and the BMU. The square root is not computed in the java
program for speed optimization for this section.
• C. Scale Neighbors
1. Determining Neighbors
There are actually two parts to scaling the neighboring weights: determining which
weights are considered as neighbors and how much each weight can become more
like the sample vector. The neighbors of a winning weight can be determined using
a number of different methods. Some use concentric squares, others hexagons, I
opted to use a gaussian function where every point with a value above zero is
considered a neighbor.
As mentioned previously, the amount of neighbors decreases over time. This is
done so samples can first move to an area where they will probably be, then they
jockey for position. This process is similar to coarse adjustment followed by fine
tuning. The function used to decrease the radius of influence does not really matter
as long as it decreases, we just used a linear function.
Types of Machine Learning Algorithms 43
Figure 8 above shows a plot of the function used. As the time progresses, the base goes
towards the centre, so there are less neighbours as time progresses. The initial radius is set
really high, some value near the width or height of the map.
2. Learning
The second part to scaling the neighbours is the learning function. The winning
weight is rewarded with becoming more like the sample vector. The neighbours
also become more like the sample vector. An attribute of this learning process is
that the farther away the neighbour is from the winning vector, the less it learns.
The rate at which the amount a weight can learn decreases and can also be set to
whatever you want. I chose to use a gaussian function. This function will return a
value ranging between 0 and 1, where each neighbor is then changed using the
parametric equation. The new color is:
So in the first iteration, the best matching unit will get a t of 1 for its learning
function, so the weight will then come out of this process with the same exact
values as the randomly selected sample.
Just as the amount of neighbors a weight has falls off, the amount a weight can learn also
decreases with time. On the first iteration, the winning weight becomes the sample vector
since t has a full range of from 0 to 1. Then as time progresses, the winning weight becomes
slightly more like the sample where the maximum value of t decreases. The rate at which
44 New Advances in Machine Learning
the amount a weight can learn falls of linearly. To depict this visually, in the previous plot,
the amount a weight can learn is equivalent to how high the bump is at their location. As
time progresses, the height of the bump will decrease. Adding this function to the
neighbourhood function will result in the height of the bump going down while the base of
the bump shrinks.
So once a weight is determined the winner, the neighbours of that weight is found and each
of those neighbours in addition to the winning weight change to become more like the
sample vector.
There is a very simple method for displaying where similarities lie and where they do not.
In order to compute this we go through all the weights and determine how similar the
neighbors are. This is done by calculating the distance that the weight vectors make between
the each weight and each of its neighbors. With an average of these distances a color is then
assigned to that location. This procedure is located in Screen.java and named public void
update_bw().
If the average distance were high, then the surrounding weights are very different and a
dark color is assigned to the location of the weight. If the average distance is low, a lighter
color is assigned. So in areas of the center of the blobs the colour are the same, so it should
be white since all the neighbors are the same color. In areas between blobs where there are
Types of Machine Learning Algorithms 45
similarities it should be not white, but a light grey. Areas where the blobs are physically
close to each other, but are not similar at all there should be black. See Figure 8 below
As shown above, the ravines of black show where the colour may be physically close to each
other on the map, but are very different from each other when it comes to the actual values
of the weights. Areas where there is a light grey between the blobs represent a true
similarity. In the pictures above, in the bottom right there is black surrounded by colour
which are not very similar to it. When looking at the black and white similarity SOM, it
shows that black is not similar to the other colour because there are lines of black
representing no similarity between those two colour. Also in the top corner there is pink and
nearby is a light green which are not very near each other in reality, but near each other on
the colored SOM. Looking at the black and white SOM, it clearly shows that the two not
very similar by having black in between the two colour.
With these average distances used to make the black and white map, we can actually assign
each SOM a value that determines how good the image represents the similarities of the
samples by simply adding these averages.
• Probably the best thing about SOMs that they are very easy to understand. It’s very
simple, if they are close together and there is grey connecting them, then they are
similar. If there is a black ravine between them, then they are different. Unlike
Multidimensional Scaling or N-land, people can quickly pick up on how to use
them in an effective manner.
• Another great thing is that they work very well. As I have shown you they classify
data well and then are easily evaluate for their own quality so you can actually
calculated how good a map is and how strong the similarities between objects are.
• One major problem with SOMs is getting the right data. Unfortunately you need a
value for each dimension of each member of samples in order to generate a map.
Sometimes this simply is not possible and often it is very difficult to acquire all of
this data so this is a limiting feature to the use of SOMs often referred to as missing
data.
• Another problem is that every SOM is different and finds different similarities
among the sample vectors. SOMs organize sample data so that in the final product,
the samples are usually surrounded by similar samples, however similar samples
are not always near each other. If you have a lot of shades of purple, not always
will you get one big group with all the purples in that cluster, sometimes the
clusters will get split and there will be two groups of purple. Using colour we
could tell that those two groups in reality are similar and that they just got split,
but with most data, those two clusters will look totally unrelated. So a lot of maps
need to be constructed in order to get one final good map.
• The final major problem with SOMs is that they are very computationally
expensive which is a major drawback since as the dimensions of the data increases,
dimension reduction visualization techniques become more important, but
unfortunately then time to compute them also increases. For calculating that black
and white similarity map, the more neighbours you use to calculate the distance
the better similarity map you will get, but the number of distances the algorithm
needs to compute increases exponentially.
2. References
Allix, N. M. (2003, April). Epistemology And Knowledge Management Concepts And
Practices. Journal of Knowledge Management Practice .
Alpaydin, E. (2004). Introduction to Machine Learning. Massachusetts, USA: MIT Press.
Anderson, J. R. (1995). Learning and Memory. Wiley, New York, USA.
Anil Mathur, G. P. (1999). Socialization influences on preparation for later life. Journal of
Marketing Practice: Applied Marketing Science , 5 (6,7,8), 163 - 176.
Ashby, W. R. (1960). Design of a Brain, The Origin of Adaptive Behaviour. John Wiley and Son.
Batista, G. &. (2003). An Analysis of Four Missing Data Treatment Methods for Suppervised
Learning. Applied Artificial Intelligence , 17, 519-533.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford, England: Oxford
University Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and
Statistics). New York, New York: Springer Science and Business Media.
Block H, D. (1961). The Perceptron: A Model of Brian Functioning. 34 (1), 123-135.
Carling, A. (1992). Introducing Neural Networks . Wilmslow, UK: Sigma Press.
D. Michie, D. J. (1994). Machine Learning, Neural and Statistical Classification. Prentice Hall Inc.
Fausett, L. (19994). Fundamentals of Neural Networks. New York: Prentice Hall.
Forsyth, R. S. (1990). The strange story of the Perceptron. Artificial Intelligence Review , 4 (2),
147-155.
Friedberg, R. M. (1958). A learning machine: Part, 1. IBM Journal , 2-13.
Ghahramani, Z. (2008). Unsupervised learning algorithms are designed to extract structure
from data. 178, pp. 1-8. IOS Press.
Types of Machine Learning Algorithms 47
x4
1. Introduction
Pattern classification is to classify some object into one of the given categories called classes.
For a specific pattern classification problem, a classifier is computer software. It is developed
so that objects ( x ) are classified correctly with reasonably good accuracy. Through training
using input-output pairs, classifiers acquire decision functions that classify an input datum
into one of the given classes ( i ). In pattern recognition applications we rarely if ever have the
prior probability P(i ) or the class-conditional density p ( x | i ) . of complete knowledge
about the probabilistic structure of the problem. In a typical case we merely have some vague,
general knowledge about the situation, together with a number of design samples or training
data—particular representatives of the patterns we want to training classify. The problem,
then, is to find some way to use this information to design or data train the classifier.
The organization of this chapter is to address those cases where a great deal of information
about the models is known and to move toward problems where the form of the distributions
are unknown and even the category membership of training patterns is unknown. We begin in
Bayes decision theory(Sec.2) by considering the ideal case in which the probability structure
underlying the categories is known perfectly. In Sec.3(Maximum Likelihood) we address the
case when the full probability structure underlying the categories is not known, but the
general forms of their distributions are the models. Thus the uncertainty about a probability
distribution is represented by the values of some unknown parameters, and we seek to
determine these parameters to attain the best categorization. In Sec.4(Nonparametric
techniques)we move yet further from the Bayesian ideal,and assume that we have no prior
parameterized knowledge about the underlying probability structure;in essence our
classification will be based on information provided by training samples alone. Classic
techniques such as the nearest-neighbor algorithm and potential functions play an important
role here. We then in Sec.5(Support Vector Machine) Next, in Sec.6(Nonlinear Discriminants
and Neural Networks)we see how some of the ideas from such linear discriminants can be
extended to a class of very powerful algorithms such as backpropagation and others for
multilayer neural networks; these neural techniques have a range of useful properties that
have made them a mainstay in contemporary pattern recognition research. In Sec.7(Stochastic
Methods)we discuss simulated annealing by the Boltzmann learning algorithm and other
stochastic methods. We explore the behaviour of such algorithms with regard to the matter of
local minima that can plague other neural methods. Sec.8(Unsupervised Learning and
50 New Advances in Machine Learning
Clustering),by addressing the case when input training patterns are not labelled, and that our
recognizer must determine the cluster structure.
p ( x | j ) P ( j )
P ( j | x) (1)
p ( x)
c
p ( x) p ( x | j ) P ( j ) (2)
j 1
In Eq. (1), p (x) is a scale factor and unimportant for our problem. By using Eq.(1), we can
instead express the rule in terms of the conditional and prior probabilities. And we notice
P(1 | x) P ( 2 | x)) 1 . By eliminating this scale factor, we obtain the following completely
equivalent decision rule:
While the two-category case is just a special instance of the multi-category case, it has
traditionally received separate treatment.Indeed,a classifier that places a pattern in one of
only two categories has a special name—a dichotomizer. Instead of using two dichotomizer
discriminant functions g1 and g 2 and assigning x toω1 if g 1 > g 2 , it is more common
to define a single discriminant function
Methods for Pattern Classification 51
g ( x ) g1 ( x ) g 2 ( x ) (5)
p ( x | j ) P ( j )
P ( j | x) (8)
p ( x)
c
p( x) p ( x | j ) P( j ) (9)
j 1
A Bayes classifier is easily and naturally represented in this way. For the minimum-error-
rate case, we can simplify things further by taking gi(x)=P(ωi|x),so that the maximum
discriminant function corresponds to the maximum posterior probability.
Clearly, the choice of discriminant functions is not unique. We can always multiply all the
discriminant functions by the same positive constant or shift them by the same additive
constant without influencing the decision. More generally, if we replace every g (x) by
f ( g ( x)) , where f () is a monotonically increasing function, the resulting classification is
unchanged. This observation can lead to significant analytical and computational
simplifications. In particular, for minimum-error-rate classification, any of the following
choices gives identical classification results, but some can be much simpler to understand or
to compute than others:
52 New Advances in Machine Learning
p( x | i ) P(i )
g ( x) P(i | x) c
(10)
p( x | j ) P( j )
j 1
3. Maximum-likelihood Method
It is important to distinguish between supervised learning and unsupervised learning. In
both cases, samples x are assumed to be obtained by selecting a state of nature i with
probability P(i ) ,and then independently selecting x according to the probability law
p( x | i ) . The distinction is that with supervised learning we know the state of nature(class
label)for each sample, whereas with unsupervised learning we do not. As one would expect,
the problem of unsupervised learning is the more difficult one. In this section we shall
consider only the supervised case, deferring consideration of unsupervised learning to
Section 8.
The problem of parameter estimation is a classical one in statistics, and it can be approached
in several ways. We shall consider two common and reasonable procedures, maximum
likelihood estimation and Bayesian estimation. Although the results obtained with these two
procedures are frequently nearly identical, the approaches are conceptually quite different.
Maximum likelihood and several other methods view the parameters as quantities whose
values are fixed but unknown. The best estimate of their value is defined to be the one that
maximizes the probability of obtaining the samples actually observed. In contrast, Bayesian
methods view the parameters as random variables having some known a priori distribution.
Observation of the samples converts this to a posterior density, thereby revising our opinion
about the true values of the parameters. In the Bayesian case, we shall see that a typical
effect of observing additional samples is to sharpen the a posteriori density function,
causing it to peak near the true values of the parameters. This phenomenon is known as
Bayesian learning. In either case, we use the posterior densities for our classification rule, as
we have seen before.
n
p( D | ) p( x k | ) (13)
k 1
1
(14)
p
L( ) ln p ( D | ) (15)
We can then write our solution formally as the argument that maximizes the log-
likelihood, i.e.,
ˆ arg max L( ) (16)
54 New Advances in Machine Learning
where the dependence on the data set D is implicit. Thus we have from Eq.(13)
n
L( ) ln p ( x k | ) (17)
k 1
and
n
L ln p ( x k | ) (18)
k 1
Thus, a set of necessary conditions for the maximum likelihood estimate for can be
obtained from the set of p equations
L 0 (19)
p( x | i , D ) P(i | D)
P (i | x, D) c
(20)
p( x | j , D) P( j | D)
j 1
As this equation suggests, we can use the information provided by the training samples to
help determine both the class-conditional densities and the a priori probabilities.
Although we could maintain this generality, we shall henceforth assume that the true values
of the a priori probabilities are known or obtainable from a trivial calculation; thus we
substitute P (i ) P(i | D) . Furthermore, since we are treating the supervised case, we can
separate the training samples by class into c subsets D1 ,, Dc with the samples in Di
belonging to i . As we mentioned when addressing maximum likelihood methods, in most
cases of interest(and in all of the cases we shall consider), the samples in Di have no
influence on p ( x | j , D) if i j . This has two simplifying consequences. First, it allows us
to work with each class separately, using only the samples in Di to determine p ( x | i , D) .
Used in conjunction with our assumption that the prior probabilities are known, this allows
us to write Eq. 23 as
p ( x | i , D j ) P (i )
P (i | x, D) c
(21)
p ( x | j , D j ) P( j )
j 1
Second, because each class can be treated independently, we can dispense with needless
class distinctions and simplify our notation. In essence, we have c separate problems of the
following form: use a set D of samples drawn independently according to the fixed but
unknown probability distribution p(x)to determine p ( x | D ) .This is the central problem of
Bayesian learning.
4. Nonparametric Techniques
We treat supervised learning under the assumption that the forms of the underlying density
functions are known in the last section. But in most pattern recognition applications, the
common parametric forms rarely fit the densities. In this section we shall examine
nonparametric procedures that can be used with arbitrary distributions and without the
assumption that the forms of the underlying densities are known.
There are several types of nonparametric methods of interest in pattern recognition. One is
to estimate the density functions p ( x | j ) from sample. And it can be substituted for the
true densities. Another is to estimate a posteriori probabilities P( j | x) directly. such as the
nearest-neighbor rule Finally, there are nonparametric procedures for transforming the
feature space in the hope that it may be possible to employ parametric methods in the
transformed space.
The following obvious estimate for p (x) :
56 New Advances in Machine Learning
kn / n
p ( x) (22)
Vn
V n hnd (23)
1 | u j | 1 / 2
(u ) j 1, , d (24)
0 otherwise
n xx
i
k n
(25)
i 1 hn
1 n 1 x xi
p n ( x) (26)
n i 1 V n hn
Eq.(26) expresses the estimate for p (x) as an average of functions of x and the samples xi. In
essence, the window function is being used for interpolation—each sample contributing to
the estimate in accordance with its distance from x.
It is natural to ask that the estimate p n (x ) be a legitimate density function. We can require
that
( x) 0 , (27)
and
(u )du 1 (28)
Maintain the relation Vn hnd , then the p n (x ) also satisfies these conditions.
Define n ( x ) by
1 x
n ( x) (29)
V n hn
1 n
p n ( x) n ( x xi ) (30)
n i 1
Since V n hnd , hn clearly affects both the amplitude and the width of n (x) . If hn is very
large, the amplitude of n (x) is small, and x must be far from xi before n ( x xi ) changes
much from n (0) .In this case, pn(x) is the superposition of n broad, slowly changing
functions and is a very smooth “out-of-focus” estimate of p (x) . On the other hand, if hn is
very small, the peak value of n ( x xi ) is large and occurs near x=xi. In this case p (x) is the
superposition of n sharp pulses centered at the samples—an erratic, “noisy” estimate. For
any value of hn , the distribution is normalized, that is
1 x xi
n ( x x i ) dx dx (u )du 1 (31)
V n h n
Let Vn slowly approach zero as n increases and p n (x) converges to the unknown density
p(x). p n (x) has some mean p n (x) and variance n2 ( x) . p n (x) converges to p(x) if
lim p n ( x ) p ( x) (32)
n
and
lim n2 ( x) 0 (33)
n
To prove convergence we must place conditions on the unknown density p (x) , on the
window function (u ) , and on the window width hn . In general, continuity of p () at x is
required, and the conditions imposed by Eqs.(27)&(28) are customarily invoked. With care,
it can be shown that the following additional conditions assure convergence:
sup (u ) (34)
d
lim (u ) i 0 (35)
|||| i 1
lim Vn 0 (36)
n
lim nV n (37)
n
Equations (34)&(35) keep () well behaved, and are satisfied by most density functions that
one might think of using for window functions. Equations (36)&(37) state that the volume
Vn must approach zero, but at a rate slower than 1/ n . We shall now see why these are the
basic conditions for convergence.
58 New Advances in Machine Learning
kn / n
p( x) (38)
Vn
and lim k n / n 0 are necessary and sufficient for pn (x) to converge to p(x)in probability at
n
all points where p (x) is continuous. If we take k n n and assume that pn (x) is a
reasonably good approximation to p (x) we then see from Eq.(38) that Vn 1 /( n p ( x)) . Thus,
Vn again has the form V1 / n , but the initial volume V1 is determined by the nature of the
data rather than by some arbitrary choice on our part. Note that there are nearly always
discontinuities in the slopes of these estimates, and these lie away from the prototypes
themselves.
g i ( x) w T x b (39)
where w is an m-dimensional vector and b is a bias term, and if one class is on the positive
side of the hyper-plane, i.e., g i ( x) 0 , and the other class is on the negative side, the given
problem is said to be linearly separable.
where arg returns the subscript with the maximum value of g i (x) .If more than one
decision function take the same maximum value for x , namely, x is on the class boundary,
it is not classifiable.
In the following we discuss several methods to obtain the direct decision functions for
multi-class problems.
The first approach is to determine the decision functions by the one-against-all formulation.
We determine the i th decision function g i (x) i 1,, n , so that when x belongs to class i ,
g i ( x) 0 (41)
g i ( x) 0 (42)
g i ( x) 0 (43)
and when x belongs to one of the classes i 1,, n ,
g i ( x) 0 (44)
In classifying x , starting from g1 ( x) , we find the first positive g i (x) and classify x into
class i . If there is no such i among g i (x) ( i 1, , n ), we classify x into class n .
60 New Advances in Machine Learning
g ij ( x) 0 (45)
g ij ( x) 0 (46)
n
g i ( x ) sign( g ij ( x)) (47)
j i , j 1
where
1 x0
sign( x) (48)
1 x 0
and we classify x into the class with the maximum g i (x) . By this formulation also,
unclassifiable regions exist if g i (x) take the maximum value for more than one class. These
can be resolved by decision-tree formulation or by introducing membership functions.
The fourth approach is to use error-correcting codes for encoding outputs. One-against-all
formulation is a special case of error-correcting code with no error-correcting capability, and
so is pair-wise formulation, as if“don’t”care bits are introduced.
The fifth approach is to determine decision functions at all once. Namely, we determine the
decision functions g i (x ) by
g i ( x) g j ( x ) for j i, j 1, , n (49)
In this formulation we need to determine n decision functions at all once. This results in
solving a problem with a larger number of variables than the previous methods. Unlike one-
against-all and pair-wise formulations, there is no unclassifiable region.
Determination of decision functions using input-output pairs is called training. In training a
multilayer neural network for a two-class problem, we can determine a direct decision
function if we set one output neuron instead of two. But because for an n-class problem we
set n output neurons with the i th neuron corresponding to the class i decision function, the
obtained functions are indirect. Similarly, decision functions for fuzzy classifiers are indirect
because membership functions are defined for each class. Conventional training methods
Methods for Pattern Classification 61
determine the indirect decision functions so that each training input is correctly classified
into the class designated by the associated training output. Assuming that the circles and
rectangles are training data for Classes 1 and 2, respectively, even if the decision function
g 2 ( x ) moves to the right as shown in the dotted curve, the training data are still correctly
classified. Thus there are infinite possibilities of the positions of the decision functions that
correctly classify the training data. Although the generalization ability is directly affected by
the positions, conventional training methods do not consider this.
D( x) w T x i b (50)
0 for yi 1
w T x i b (51)
0 for y i 1
Because the training data are linearly separable, no training data satisfy wT xi b 0 . Thus,
to control separability, instead of(51),we consider the following inequalities:
1 for yi 1
w T x i b (52)
1 for y i 1
Equation(2.3)is equivalent to
y i ( w T x i b) 1 for i 1, , M (53)
The hyper-plane
D( x) w T x i b c for 1 c 1 (54)
62 New Advances in Machine Learning
y k D ( xk )
for k 1, , M (56)
|| w ||
w 1 (57)
From (56) and (57),to find the optimal separating hyper-plane, we need to find w with the
minimum Euclidean norm that satisfies(52). Therefore, the optimal separating hyper-plane
can be obtained by minimizing
1
Q( w) || w || 2 (58)
2
y i ( w T x i b) 1 for i 1, , M (59)
Here, the square of the Euclidean norm w in (58) is to make the optimization problem
quadratic programming. The assumption of linear separability means that there exist w and
b that satisfy (59).We call the solutions that satisfy (2.10) feasible solutions. Because the
optimization problem has the quadratic objective function with the inequality constraints,
even if the solutions are non-unique, the value of the objective function is unique(see Section
2.6.4).Thus non-uniqueness is not a problem for support vector machines. This is one of the
Methods for Pattern Classification 63
advantages of support vector machines over neural networks, which have numerous local
minima.
The method is still effective for classifying the samples from the distributions with distinct
variance. Let us consider a new simulation data set. The randomizer produces 100 samples
64 New Advances in Machine Learning
We are interested in considering the location of the incorrect samples. Figure 4 tell us the
information.
6. Neural Networks
For classification, we will have c output units, one for each of the categories, and the signal
from each output unit is the discriminant function g k (x) as:
nH
g k ( x) zk f wkj f w ji xi w j 0 wk 0
d
(61)
j 1 i1
This, then, is the class of functions that can be implemented by a three-layer neural network.
An even broader generalization would allow transfer functions at the output layer to differ
from those in the hidden layer, or indeed even different functions at each individual unit.
Kolmogorov proved that any continuous function g (x) defined on the unit hypercube
I n (I=[0,1]and n ≥ 2)can be represented in the form
g ( x) E j ij ( xi )
2 n 1 d
(62)
j 1 i 1
1 c 1
J ( w) (tk zk ) 2 (t z ) 2 (63)
2 k 1 2
where t and z are the target and the network output vectors of length c ; w represents all
the weights in the network.
The back propagation learning rule is based on gradient descent. The weights are initialized
with random values, and are changed in a direction that will reduce the error:
J
w (64)
w
or in component form
J
wmin (65)
wmin
66 New Advances in Machine Learning
where is the learning rate, and merely indicates the relative size of the change in weights.
This iterative algorithm requires taking a weight vector at iteration m and updating it as:
7. Stochastic Search
Search methods based on evolution—genetic algorithms and genetic programming —
perform highly parallel stochastic searches in a space set by the designer. The fundamental
representation used in genetic algorithms is a string of bits, or chromosome; the
representation in genetic programming is a snippet of computer code. Variation is
introduced by means of crossover, mutation and insertion. As with all classification methods,
the better the features, the better the solution. There are many heuristics that can be
employed and parameters that must be set. As the cost of computation contiues to decline,
computationally intensive methods, such as Boltzmann networks and evolutionary methods,
should become increasingly popular.
where Z is a normalization constant. The numerator is the Boltzmann factor and the
denominator the partition function, the sum over all possible configurations
E ' / T
Z (T ) e (68)
'
Methods for Pattern Classification 67
which guarantees Eq. 2 represents a true probability. The number of configurations is very
high, 2 N , and in physical systems Z can be calculated only in simple cases. Fortunately, we
need not calculate the partition function, as we shall see.
where the expectation is over the current generation and T is a control parameter loosely
referred to as a temperature. Early in the evolution the temperature is set high, giving all
chromosomes roughly equal probability of being selected. Late in the evolution the
temperature is set lower so as to find the chromosomes in the region of the optimal classifier.
We can express such search by analogy to biology: early in the search the population
remains diverse and explores the fitness landscape in search of promising areas; later the
population exploits the specific fitness opportunities in a small region of the space of
possible classifiers.
When a pattern recognition problem involves a model that is discrete or of such high
complexity that analytic or gradient descent methods are unlikely to work, we may employ
stochastic techniques—ones that at some level rely on randomness to find model
parameters.Simulated annealing,based on physical annealing of metals, consists in
randomly perturbing the system,and gradually decreasing the randomness to a low final
level,in order to find an optimal solution.Boltzmann learning trains the weights in a network
so that the probability of a desired final output is increased. Such learning is based on
gradient descent in the Kullback-Liebler divergence between two distributions of visible
states at the output units:one distribution describes these units when clamped at the known
category information, and the other when they are free to assume values based on the
activations throughout the network. Some graphical models, such as hidden Markov models
and Bayes belief networks, have counterparts in structured Boltzmann networks, and this
leads to new applications of Boltzmann learning.
mixture density is not trivial. Furthermore, in situations where we have relatively little prior
knowledge about the nature of the data, the assumption of particular parametric forms may
lead to poor or meaningless results. Instead of finding structure in the data, we would be
imposing structure on it.
One alternative is to use one of the nonparametric methods to estimate the unknown
mixture density. If accurate, the resulting estimate is certainly a complete description of
what we can learn from the data. Regions of high local density, which might correspond to
significant subclasses in the population, can be found from the peaks or modes of the
estimated density.
If the goal is to find subclasses,a more direct alternative is to use a clustering procedure.
Roughly speaking, clustering procedures yield a data description in terms clustering of
clusters or groups of data points that possess strong internal similarities. Formalprocedure
clustering procedures use a criterion function, such as the sum of the squared distances from
the cluster centres, and seek the grouping that extremizes the criterion function. Because
even this can lead to unmanageable computational problems, other procedures have been
proposed that are intuitively appealing but that lead to solutions having few if any
established properties. Their use is usually justified on the ground that they are easy to
apply and often yield interesting results that may guide the application of more rigorous
procedures.
so that all of the features have zero mean and unit variance— standardize the data. To
obtain invariance to rotation, one might rotate the axes so that they coincide with the
eigenvectors of the sample covariance matrix. This trans- formation to principal components
can be preceded and/or followed by normalization for scale.
However, we should not conclude that this kind of normalization is necessarily desirable.
Consider, for example, the matter of translating and whitening—scaling the axes so that
each feature has zero mean and unit variance. The rationale usually given for this
normalization is that it prevents certain features from dominating distance calculations
merely because they have large numerical values, much as we saw in networks trained with
backpropagation. Subtracting the mean and dividing by the standard deviation is an
appropriate normalization if this spread of values is due to normal random
variation;however,it can be quite inappropriate if the spread is due to the presence of
subclasses.
Instead of scaling axes, we can change the metric in interesting ways. For instance, one
broad class of distance metrics is of the form
1/ q
d ( x, x' ) | xk xk '|q
d
(70)
k 1
x t x'
s ( x, x ' ) (71)
|| x || || x'||
may be an appropriate similarity function. This measure, which is the cosine of the angle
between x and x ,is invariant to rotation and dilation, though it is not invariant to
translation and general linear transformations.
1
mi x (72)
ni xDi
c
J e || x mi ||2 (73)
i 1 xDi
1 c
Je ni si (74)
2 i1
where
1
si 2 || x x'||2 (75)
ni xDi x 'Di
Equation 51 leads us to interprets si as the average squared distance between points in the
i -th cluster, and emphasizes the fact that the sum-of-squared-error criterion uses Euclidean
distance as the measure of similarity. It also suggests an obvious way of obtaining other
criterion functions. For example, one can replaces si by the average, the median, or perhaps
the maximum distance between points in a cluster. More generally, one can introduce an
appropriate similarity function s(x,x)and replaces si by functions such as
1
si 2 s( x, x' ) (76)
ni xDi
or
si min s ( x, x' ) (77)
x , x 'Di
values for the levels corresponding to c =3 and to c =4 clusters. In such a case, one can
argue that c =3 is the most natural number of clusters.
iteratively splitting into smaller clusters, each time seeking the subclusters that are most
dissimilar. The resulting hierarchical structure
is revealed in a dendrogram. A large disparity in the similarity measure for successive
cluster levels in a dendrogram usually indicates the “natural” number of clusters.
Alternatively, the problem of cluster validity—knowing the proper number of clusters —can
also be addressed by hypothesis testing. In that case the null hypothesis is that there are
some number c of clusters; we then determine if the reduction of the cluster criterion due to
an additional cluster is statistically significant.
Competitive learning is an on-line neural network clustering algorithm in which the cluster
center most similar to an input pattern is modified to become more like that pattern. In
order to guarantee that learning stops for an arbitrary data set, the learning rate must decay.
Competitive learning can be modified to allow for the creation of new cluster centers, if no
center is sufficiently similar to a particular input pattern, as in leader-follower clustering and
Adaptive Resonance. While these methods have many advantages, such as computational
ease and tracking gradual variations in the data, they rarely optimize an easily specified
global criterion such as sum-of-squared error.
Component analysis seeks to find directions or axes in feature space that provide an
improved, lower-dimensional representation for the full data space. In(linear) principal
component analysis, such directions are merely the largest eigenvectors of the covariance
matrix of the full data; this optimizes a sum-squared-error criterion. Nonlinear principal
components, for instance as learned in an internal layer an auto- encoder neural network,
yields curved surfaces embedded in the full d -dimensional feature space, onto which an
arbitrary pattern x is projected. The goal in independent component analysis—which uses
gradient descent in an entropy criterion—is to determine the directions in feature space that
are statistically most independent. Such directions may reveal the true sources(assumed
independent)and can be used for segmentation and blind source separation.
Two general methods for dimensionality reduction is self-organizing feature maps and
multidimensional scaling. Self-organizaing feature maps can be highly nonlinear, and
represents points close in the source space by points close in the lower-dimensional target
space. In preserving neighborhoods in this way, such maps also called “topologically
correct.” The source and target spaces can be of very general shapes, and the mapping will
depend upon the the distribution of samples within the source space. Multidimensional
scaling similarly learns a nonlinear mapping that, too, seeks to preserve neighborhoods, and
is often used for data visualization. Because the basic method requires all the inter-point
distances for minimizing a global criterion function, its space complexity limits the
usefulness of multidimensional scaling to problems of moderate size.
9. Conclusion
One approach to this problem is to use the samples to estimate the unknown probabilities
and probability densities, and to use the resulting estimates as if they were the true values.
In typical supervised pattern classification problems, the estimation of the prior probabilities
presents no serious difficulties. However, estimation of the class-conditional densities is
quite another matter. The number of available samples always seems too small, and serious
problems arise when the dimensionality of the feature vector x is large. If we know the
number of parameters in advance and our general knowledge about the problem permits us
74 New Advances in Machine Learning
to parameterize the conditional densities, then the severity of these problems can be reduced
significantly. Suppose, for example, that we can reasonably assume that p ( x | i ) is a normal
density with mean i and covariance matrix i , although we do not know the exact values
of these quantities. This knowledge simplifies the problem from one of estimating an
unknown function p ( x | i ) to one of estimating the parameters i and i .
In general there are two approaches to develop classifiers: a parametric approach and a
nonparametric approach. in a parametric approach, a priori knowledge of data distributions
is assumed, otherwise, a nonparametric approach will be employed. Neural networks, fuzzy
systems, and support vector machines are typical nonparametric classifiers.
10. References
Abe, S. (2005). Support vector machines for pattern classification, Springer, 1852339292, USA
Richard, D.; Peter, H. & David, S. (1999). Pattern Classification, Wiley, John & Sons,
Incorporated, 0471056693, USA
Simon S. Haykin (2008). Neural Networks and Learning Machines, Prentice Hall, 0131471392
Thorsten Joachims(2002), Learning to Classify Text Using Support Vector Machines.
Dissertation, Kluwer,.
Vapnik, V. (1998). Statistical learning theory. Chichester, UK: Wiley
Yizhang, G.; Zhifeng, H. (2007). Using SVMs Method to Detect Abrupt Change, Proceedings
of 2007 International Conference on Machine Learning and Cybernetics, pp. 3298-
3301, 1-4244-0972-1, Aug. 2007, Hong Kong
Classification of support vector machine and regression algorithm 75
5
x
1. Introduction
Support vector machine (SVM) originally introduced by Vapnik. V. N. has been successfully
applied because of its good generalization. It is a kind of learning mechanism which is based
on the statistical learning theory and it is a new technology based on the kernel which is
used to solve the problems of learning from the samples. Support vector machine was
presented in 1990s, and it has been researched deeply and extensively applied in some
practical application since then, for example text cataloguing, handwriting recognition,
image classification etc. Support vector machine can provide optimal learning capacity, and
has been established as a standard tool in machine learning and data mining. But learning
from the samples is an ill–posed problem which can be solved by transforming into a posed
problem by regularization. The RK and its corresponding reproducing kernel Hilbert space
(RKHS) play the important roles in the theory of function approach and regularization.
However, different functions approach problems need different approach functional sets.
Different kernel’s SVM can solve different actual problems, so it is very significant to
construct the RK function which reflects the characteristics of this kind of approach function.
In kernel-based method, one map which put the input data into a higher dimensional space.
The kernel plays a crucial role during the process of solving the convex optimization
problem of SVM. How to choose a kernel function with good reproducing properties is a
key issue of data representation, and it is closely related to choose a specific RKHS. It is a
valuable issue whether a better performance could be obtained if we adopt the RK theory
method. Actually it has caused great interest of our researchers. In order to take the
advantage of the RK, we propose a LS-SVM based on RK and develop a framework for
regression estimation in this paper. The Simulation results are presented to illustrate the
feasibility of the proposed method and this model can give a better experiment results,
comparing with Gauss kernel on regression problem.
Then we have already changed the symbol from L( w, b, ) to Q( ) for reflecting the final
transform.
The expression (7) is called Lagrange dual objective function. Under the constraint condition
78 New Advances in Machine Learning
y 0 ,
i 1
i i (8)
i 0, i 1, 2,, l , (9)
we find that i which can maximize the function Q( ) . Then, the sample is the support
vector when i are not zero.
i , i 0, i 1,, l .
Similar to the previous section, we can get the dual form of the optimization problem
l l l
y y K x, x .
1
max
i 1
i
2
i 1 j 1
i j i j i (11)
y 0 ,
i 1
i i (12)
0 i C , i 1,, l . (13)
Totally, the solution of the optimization problem is characterized by the majority of i
being zero, and the support vector is that the samples correspond with the i which are
not zero.
We can obtain the calculation formula of b from KKT as followed
Classification of support vector machine and regression algorithm 79
l
yi j y j K x j , xi b 1 0, i 0, C .
(14)
j 1
So we can find the value of b from anyone of the support vector. In order to stabilization,
we can also find the value of b from all support vectors, and then get the average of the
value.
Finally, we obtain the discriminate function as followed
l
f x sgn i yi K xi , x b .
(15)
i 1
i 1
i 1, (20)
0 i C , i 1,, l . (21)
We can get by solving (19). Usually, the majority of will be zero, the samples
corresponded with i 0 are still so-called the support vector.
According to the KKT condition, the samples corresponded with 0 i C are satisfied
l
R 2 K xi , xi 2 j K x j , xi a 2 0 ,
(22)
j 1
80 New Advances in Machine Learning
l
a x . Thus, according to the (22), we can find the value of
i 1
i i R by any support
K x , x .
T
f z z a z a K z, z 2 i K z , xi i j i j
i 1 i 1 j 1
d p H1 , H 2 : min x y p , (23)
xH1
y H 2
and
1
d p p
x p xi . (24)
i 1
Choose a y H 2 arbitrarily, then two hyper plane's can be write be
d p H1 , H 2 min x y p
. (25)
xH1
Moves two parallel hyper plane to enable H 2 to adopt the zero point, can be obtain the
same distance hyper plane:
H1 : , x b1 , b2 0 , H 2 : , x 0 .
If chooses y spot is the zero point, then the distance between two hyper plane is
d p H1 , H 2 min x p . (26)
xH1
Where,
expressed as a norm of L¥ , it is the dual norm of L1 , defined as
L¥ = max j w j . (32)
Supposes H : , x b 1 , H : , x b 1 , established through the two types of
The restraint is
yi ( w, x i + b ) ³ 1, i = 1, , l . (35)
Therefore obtains the following linear programming
mina . (36)
The restraint is
yi ( w, x i + b ) ³ 1, i = 1, , l , (37)
a ³ w j , j = 1, , d , (38)
a ³ -w j , j = 1, , d , (39)
a,b Î R, w Î R . d
(40)
This is a linear optimization question, must be much simpler than the quadratic
optimization.
2 norm formula of L¥
If defines L¥ between two hyper-planes the distances, then we may obtain other one form
linear optimization equation. This time, between two hyper-planes distances is
b b
d H1 , H 2 1 2 . (41)
1
Regarding the linear separable situation, two support between two hyper-planes the
distances is
1 b 1 b 2
d H , H . (42)
1
j
j
The restraint is
yi ( w, x i + b ) ³ 1, i = 1, , l . (44)
Therefore the optimized question is
d
min å a j . (45)
j =1
Bound for
yi ( w, x i + b ) ³ 1, i = 1, , l , (46)
a j ³ w j , j = 1, , d , (47)
a j ³ -w j , j = 1, , d . (48)
1 l
2 C i ,
2
min (49)
2 i 1
Restrain for
, xi i , i 0, i 1,, l . (50)
Introduces Lagrange the function
1 2 l l l
L C i i , xi i ii ,
2 i 1 i 1 i 1
(51)
~
in the formula i 0, i 0, i 1,, l .
The function L’s extreme value should satisfy the condition
L 0, L 0, L 0. (52)
i
Thus
l
i xi , (53)
i 1
l
i 1
i 1, (54)
C i i 0, i 1,, l . (55)
With (53)~(55) replace in Lagrange function (51). And using kernel function to replace inner
product arithmetic in higher dimensional space, finally we may result in the optimized
question the dual form is
1 l l
min i j k xi , x j . (56)
2 i 1 j 1
Restrain for
0 i C , i 1,, l , (57)
l
i 1
i 1. (58)
While taking the Gauss kernel function , we may discover that the optimized equation (56)
and a classification class method's of the other form ----- type (19) is equal.
We may obtain its equal linear optimization question by the reference
l
min C i . (60)
i 1
Restrain for
, ( xi ) i , i 0, i 1,, l , (61)
1 1. (62)
l
Using kernel expansion k x , x
j 1
j j i to replace the optimized question type (60) the
84 New Advances in Machine Learning
k x , x , i 1,, l ,
i 1
i i i (64)
l
i 1
i 1, (65)
i , i 0, i 1,, l . (66)
Solving this linear programming may obtain the of value and , therefore we can
obtain a decision function:
l
f ( x) i k xi , x . (67)
i 1
k x , x .
i 1
i i (68)
After Hyper plane of the decision-making reflected back to the original space, the training
samples will be contained in the regional compact. Regarding arbitrary sample x in the
region, satisfies f ( x) , and for region outside arbitrary sample y to satisfy f ( x)
In practical application, the value of parameter 2 in kernel function is smaller , which
obtains the region to be tighter in the original space to contain the training sample , this
explained that the parameter 2 will decide classified precisely .
where x i
s
, yi
s
, s 1, 2,, l s represents the s -th type of training samples
k x , x
j 1
j
s
j
s
i
s
si , s 1,, M , i 1,, ls , (70)
ls
1, s 1,, M ,
j 1
j
s
(71)
i s , si 0, s 1,, ls . (72)
Solving this linear programming, may obtain M decision functions
ls
j 1
s
f s ( x) a j k xj , x , s 1,, M .
s
(73)
Let F ( E ) be the linear space comprising all complex-valued functions on an abstract set
E . Let H be a Hilbert (possibly finite-dimensional) space equipped with inner product
, H . Let h : E H be a Hilbert space H -function on E . Then, we shall consider
the linear mapping L from H into F ( E ) defined by
f ( q) ( Lg )( p) ( g , h( p)) H . (80)
The fundamental problems in the linear mapping (80) will be firstly the characterization of
the images f ( p ) and secondly the relationship between g and f ( p ) .
The key which solves these fundamental problems is to form the function K ( p, q ) on
E E defined by
K ( p, q) ( g ( q), g ( p)) H . (81)
We let R ( L) denote the range of L for H and we introduce the inner product in R( L)
induced from the norm
Classification of support vector machine and regression algorithm 87
f R( L)
inf{ g H
; f Lg} . (82)
Then, we obtain
Lemma 2.3 For the function K ( p, q ) defined by (81), the space R( L),(, ) R ( L ) is a Hilbert
(possibly finite dimensional) space satisfying the properties that
(i) for any fixed q E , K ( p, q ) belongs to R ( L) as a function in p ;
(ii) for any f R( L) and for any q E ,
f ( q) ( f () , K (, q )) R ( L ) .
Further, the function K ( p, q ) satisfying (i) and (ii) is uniquely determined by R ( L) .
Furthermore, the mapping L is an isometry from H onto R ( L) if and only if
{h( p ); p E} is complete in H .
On the Sobolev Hilbert space H 1 ( R : a, b) on R comprising all complex valued and
absolutely continuous functions f ( x) with finite norms
< ,
1
2 2 2
(a 2 f ( x) b 2 f ( x) )dx (83)
where a , b 0 .
The function
1 ba x y 1 ei ( x y )
Ga ,b ( x, y )
2ab
e
2
a 2 b 2 2
d . (84)
is the RK of H 1 ( R : a, b) .
On the Hilbert space, we construct this horizontal floating kernel function:
d
k ( x, x) k ( x x) Ga ,b ( xi xi) . (85)
i 1
That is
d
kˆ( ) (2 ) 2 d exp( j x) k ( x)dx
R
.
d d
1 ba xi ji xi
(
2ab
(2 ) 2
e dxi )
i 1
and we obtain
L l
w 0 w i xi
i 1
L l
0 i 0
q i 1
. (93)
L
0 i i ; i 1,, l
i
L 0 wT x q - y 0 ; i 1,, l
i i i
i
Classification of support vector machine and regression algorithm 89
l
where w i xi , i i / .
i 1
function, we get a new learning method which is called least squares RK support vector
machine (LS-RKSVM). Since using least squares method, the computation speed of this
algorithm is more rapid than the other SVM.
The Simulation results shows that the regression ability of RK kernel function is much better
than Gauss kernel function. This reveals RK kernel function has rather strong regression
90 New Advances in Machine Learning
ability and it can be used for pattern recognition. We can find that the LS-SVM is a very
promising method based on RK kernel. The model has strong regression ability.
Fig. 1. The regression curve based on Gauss kernel (“.” is true value, “+” is predictive value)
Fig. 2. The regression curve based on RK kernel (“.” is true value, “*” is predictive value)
The SVM is a new machine study method which is proposed by Vapnik based on statistical
learning theory. The SVM focus on studying statistical learning rules under small sample.
Classification of support vector machine and regression algorithm 91
Through structural risk minimization principle to enhance extensive ability, the SVM
preferably solves many practical problems, such as small sample, non-linear, high
dimension number and local minimum points. The LS-SVM is an improved algorithm
which base on SVM. This paper proposes a new kernel function of SVM which is the RK
kernel function. We can use this kind of kernel function to map the low dimension input
space to the high dimension space. The RK kernel function enhances the generalization
ability of the SVM. At the same time, adopting LS-SVM, we get a new regression analysis
method which is called least squares RK support vector machine. Experiment shows that the
RK kernel function is better than Gauss kernel function in regression analysis. The RK and
LS-SVM are combined effectively. Thereby we can find that the result of regression is more
precisely.
6. Prospect
Further study should be started in the following areas:
1. The kernel method provides an effective method which can change the nonlinear problem
into a linear problem, that is, the kernel function plays an important role in the support
vector machine. Therefore, for practical problems, rational choice of the kernel function and
the parameter in it is a problem which should be research.
2. For the massive data of practical problems, a serious problem need to be solved is to
propose an efficient algorithm.
3. It is a valuable research direction that fusion of the Boosting and the Ensemble methods
are proposed to be a better algorithm of support vector machine.
4. It is significant to put the support vector machine, planning network, Gauss process and
neural network into same frame.
5. It is a significant research subject that combines the idea of support vector machine with
the Bayes Decision and consummates the maximum margin algorithm.
6. The research on support vector machine still needs to be done extensively.
7. References
Bernhard S. & Sung K.K. (1997). Comparing support vecter machines with Gaussian kernels
to radical basis fuction classifiers. IEEE transaction on signal processing
Fatiha M. & Tahar M. (2007). Prediction of continuous time autoregressive processes via the
reproducing kernel spaces. Computational mechanics, Vol 41, No. 1, dec
Karen A. Ames; Rhonda J. & Hughes. (2005). Structural stability for ill-posed problems in
Banach space, Semigroup forum, Vol 70, No. 1, Jan
Mercer J. (1909). Function of positive and negative type and their connection with the theory
of integral equations. Philosophical transactions of the royal society of London, Vol
209, pp. 415~446
O.L.Mangasarian. (1999). Arbitrary-norm Separating Plane. Operation Research Letters,
1(24): 15~23
L
Saitoh S. (1993). Inequalities in the most simple Sobolev space and convolutions of 2
functions with weights. Proc. Amer. Math. Soc. Vol 118, pp. 515~520
92 New Advances in Machine Learning
Smola A. J.; Scholkopf B. & Muller K. R. (1998). The connection between regularization
operators and support vector kernels. Neural networks, Vol 11, No. 4, pp. 637~649
Suykens J. & A. K. (2002). Least squares support vector machines, World scientific, Singapore
Vapnik. V. N. (1995). The nature of statistical learning theory. Springer-verlag, New York
Vapnik V N. (1998). Statistical Learning Theory. Wiley, New York
Yu Golubev. (2004). The principle of penalized empirical risk in severely ill-posed problems.
Probability theory and related fields, Vol 130, No. 1, sep
Classifiers Association for High Dimensional Problem: Application to Pedestrian Recognition 93
0
6
1. Introduction
Machine learning is widely used for object recognition in images (Viola & Jones, 2001). Indeed,
the goal is to recognize any object of the same class whatever the background, the illumina-
tion conditions, ... . The key-point of such a method is the ability to create a generic model
able to describe the huge variability of an object class. A large training set is then used so as
to cover all the variations taken by the object class. For each example, a simple description
provides a huge feature vector from which only a subset is relevant according to the object to
be recognised.
Kernel based machines like Support Vector Machine (SVM) (Vapnik, 1998) have shown great
performances for object recognition in images (Papageorgiou & Poggio, 2000). But high di-
mensional problem can be prohibitive for it: it implies expensive time, presence of irrelevant
features can disturb the classifier and overfitting often occurs.
Various approaches were proposed in order to decrease this number of variables (Guyon &
Elisseeff, 2003). They are of two types: the filters and the wrappers. The filters methods use only
the training set. They process the entire data before the learning step and keep only relevant
characteristics. Most widespread is the Relief algorithm, introduced by Kira and Rendell (Kira
& Rendell, 1992) and improved by Konenko (Kononenko, 1994), which computes a criterion
of relevance for each characteristic of the training set. Another approach presented by Hall in
(Hall, 2000) uses a correlation score to reduce the training set. An extension of this method
was developed in (Yu & Liu, 2003) for great dimension sets.
The wrappers methods carried out the variable choice at the same time as the training process
is done. Moreover, they use the process itself to select relevant characteristics (Kohavi & John,
1997). Solutions were brought for SVMs. Weston in (Weston et al., 2000) explores parame-
ter space by a stepped gradient descent and fix an exit threshold on the classification error.
Rakotomamonjy proposes in (Rakotomamonjy, 2003) a selection criterion based on the vari-
able influence on the decision rule of a SVM classifier. Generally, these methods, which take
account of the training set and the classifier in the same time, give good results but induce
expensive computing times.
For an out-line learning, the computing time is not the main problem. However, studies like
(Campedel et al., 2005) showed the efficiency of variable selection to improve the classifier
performances: the presence of useless data can disturb the classifier and memory is misused.
94 New Advances in Machine Learning
With the aim of time-saving, the ideal is to use a variable selection algorithm which can be
processed independently of the learning process. But not to take account of the classifier is
the major disadvantage of the filtering methods: the drawback is to select attributes which are
not finally useful for this one. To guarantee the relevance of the characteristics preserved for
the classifier, the best tool is this classifier used itself.
AdaBoost algorithms can also be used for feature selection. In (Viola & Jones, 2001) an
AdaBoost cascade is used with Haar wavelets based descriptor. At each stage of the classifier,
a Haar resolution is chosen and images are divided into several sub-windows. The classifier
rejects the non-informative ones. At the follow step, the Haar resolution is increased only for
sub-windows which were selected at the previous iteration. This stage of the classifier rejects
also the non-informative sub-windows and the process keeps on. After several stages of the
classifier, the number of sub-windows decrease quickly. Moreover the decision threshold is
readjusted as the classifier progresses. An extension of this cascade method is developed in
(Le & Satoh, 2004) with a final SVM classifier. First stages of the classifier use AdaBoost algo-
rithm to reduce the feature space and select relevant features. The last stage is a SVM classifier
which builds a face model from the features selected previously. This both methods allow to
reduce the features number in the first stages of the cascade. In this approach, the AdaBoost
algorithm is only used to select relevant features from a huge set of possible ones.
We propose here an original association of a classic AdaBoost algorithm with a kernel based
machine. Adaboost is an algorithm which builds a strong classifier by selecting a huge num-
ber of weak ones. It can be used for feature selection too: each feature can be seen as a weak
classifier and AdaBoost selects a subset of them. Our approach consists on using the result-
ing subset of weak relevant classifiers (and not relevant features) as binary vectors in a kernel
based machine learning classifier (like SVM).
We focus our proposal on pedestrian recognition: since pedestrians provide a large appear-
ance variability (size, clothes, skin colour, ...) the training set used for learning must be very
large. Numerous features are then used to describe correctly each sample of the training set.
The association of AdaBoost and kernel machine allows to handle this high dimensional prob-
lem.
This chapter is organized as follow: section 1 describes the main features used in classification;
then the classifiers methods is presented in section 2. Experiments and results of the proposed
method on a pedestrian recognition task are realizes in section 3; and finally section 4 gives
the conclusion.
The overcomplete dictionnary presented in (Papageorgiou & Poggio, 2000) allows a fast com-
putation of haar wavelets in threes directions: horizontal, vertical and diagonal (see figure
1) The main difficulty is to find adapted sizes for a given image. Indeed a finer scale only
captures noise whereas large scale doesn’t capture an object characteristics.
3. Machine Learning
Today the machine learning used in image recognition are either Boosting method either Ker-
nel method. This section describes these learning methods.
Here, we follow the standard notations, representing the output labels by a scalar y which
can take two possible discrete values corresponding to the object class: y = −1 for negative
examples (non-objects) and y = 1 for positive examples (objects). Vectors x ∈ IRQ represents
.
input features provided by images descriptors. Let S = {(xi , yi )}iN=1 denotes a training set
composed by N samples of feature vectors associated to their corresponding labels.
Here, {φm (x)|m = 1...M} are basis functions and {wm |m = 1...M} are the associated weights.
We propose non linear basis functions:
φm (x) = k(x, xm ) (4)
where k (x, xm )
is a kernel function. The classification rule can be written in a more compact
form by the following equation:
y = sign(w T φ(x)) (5)
T
where w T = (w1 , w2 , ..., w M ) is a weight vector and φ(x) = (φ(x1 ), φ(x2 ), ..., φ(x N ) . To train
the model (estimate w), we are given the training set Sh = {(xi , yi )}iN=1 . We use the Euclidean
norm to measure y-space prediction errors, so the estimation problem is of the form:
w := arg min {||w T Φ − y||2 } (6)
w
Classifiers Association for High Dimensional Problem: Application to Pedestrian Recognition 97
. .
where Φ = φ(x1 )), φ(x2 ), ..., φ(x N ) is the design matrix and y = (y1 , ...y N ) T is the training set
label vector. The estimation of the parameter vector w using the least-square criterion defined
in equation (6) is given by:1
wls = yΦ+ (7)
Alternative methods can be used to estimate w. A solution is to place priors over w in order to
set many weights to zero. The resulting model is then called sparse linear model. SVM (Sup-
port Vector Machine)(Vapnik, 1998) is a sparse linear model where the weights are estimated by
the minimization of a Lagrange multipliers based functional. Other sparse linear models, like
RVM (Relevant Vector Machines) (Tipping, 2001) may also be employed.
Vectors used for basis functions are usually composed by a subset of the training set Sh . It is
also possible to use the entire training set and in this case M = N. The matrix Φ is then sym-
metric and system resolution can be made more efficiently using Cholesky decomposition.
We make the common choice to use Gaussian data-centred basis functions:
φm (x) = exp − (x − xm (x))2 /σ2 , (8)
which gives us a ”radial basis function” (RBF) type model from which the parameter σ must
be adjusted. On one hand, if σ is too small, the ”design matrix” Φ is mostly composed of
zeros. On the other hand, if σ is too large, Φ is mostly composed of ones. We propose to
adjust σ using a non linear optimization maximizing an empirical criteria based on the sum
of the variances computed for each line of the design matrix Φ:
with
N M 2
C (σ) = ∑ ∑ φm (x) − φ(x(n) (x)) (10)
n =1 m =1
and
1 M
φm (x(n) (x))
M m∑
φ(x) = (11)
=1
The classifier thus obtained will be denoted by KHA (Kernel Approximation Hyperplane) in sec-
tion 4.
return value 1 for examples classified as positive and −1 for examples classified as negative.
In the weak learner space, AdaBoost provides a linear separator and some examples are mis-
classified. Our method provides non-linear separator and classified correctly all examples. So
Fig. 3. Our method provides non-linear separator while AdaBoost gives linear one.
the first stage of the learning process is doing by an AdaBoost algorithm which gives a new
.
binary training set for the kernel machines. We define Sh = {(h(xi ), yi )}iN=1 a new training
set where h(xi ) ∈ IR T is a vector composed by the output of each selected classifier estimated
. T
from parameters x such as h(xi ) = h1 (xi ), ..., hT (xi ) . Next the classification rule for kernel
machines becomes:
M
y = sign ∑ wm φm (h(x)) (12)
m =1
and
φm (h(x)) = k(h(x), hm (x)) (13)
where k(h(x), hm (x)) is the kernel function. The estimation problem is given by equation (6)
and the resolution is done like presented in section 3.2 .
Fig. 4. Examples of pedestrian (first line) and non-pedestrian (second line) images.
We focus our work on pedestrian recognition in images coming from low-cost camera (see
(Leyrit et al., 2008)). This work is challenging because pedestrian is a hard pattern to recog-
nize due to the differences of clothes, size... added to classic illumination and background
variations. Since pedestrian appearance provides a large variability, the training set used in
Classifiers Association for High Dimensional Problem: Application to Pedestrian Recognition 99
the learning stage must be very huge and each sample can be described by a huge feature
vector. We used the images dataset provided by Gavrila and Munder in (Munder & Gavrila,
2006). This base is subdivided into five parts; each one contains 4800 positive and 5000 neg-
ative images. Each picture has a size of 36x18 pixels, in grey levels. In the positive images,
the pedestrians are standing and entirely visible; they were taken in various postures, and
in various illumination and background conditions. Each pedestrian picture was randomly
reflected and shift a few pixels in the horizontal and vertical directions. The negative images
describe the urban environment: trees, buildings, cars, roadsigns... This base (some examples
of which are shown in figure 4) constitutes the data used for training and the validation of the
proposed method. According to (Munder & Gavrila, 2006) the three first parts are used for
the training, and the two last ones are used for the validation. It assumes that the validation
is doing independently of the training.
in figure 5 for more precision. A ROC curve (Receiver Operating Characteristic) presents vari-
ations and sensitivity of a test for various values of the discrimination threshold. The x-axis
represents the false negative rate (non-pedestrians classified as pedestrians) while the y-axis
corresponds to the true positive rate (of the pedestrians which are well detected as such). Let
us suppose a ROC curve through the point (0.1;0.9). That means that for 90% of the well classi-
fied pedestrians, 10% of non-pedestrians are badly classified. Most of the time, the classifiers
association gives best results than a standard AdaBoost. In this way, we can achieve good
recognition rate despite the high dimensional size of the features vectors.
thresholds, SVM reachs equivalent recognition rate. The decision threshold should be selected
carrefully according to each application. It depends on the rate of misclassified an application
can tolerate.
Fig. 6. The proposed method implemented with two different kernel machines.
The results presented in table 3 and in figure 7 show that these three descriptors work into
almost the same range of values. With more precison, ROC curves show that histogramms
of oriented gradients are better for these pedestrian recognition task. The binary descriptor,
despite its simplicity, achieves almost same results. Haar wavelets doesn’t reach the same
performances than the two others descriptors.
6. References
Campedel, M., Moulines, E., Matre, H. & Dactu, M. (2005). Feature Selection for Satellite
Image Indexing, ESA-EUSC 2005: Image Information Mining - Theory and Application to
Earth Observation.
Dalal, N. & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection, Proceed-
ings of the Conference on Computer Vision and Pattern Recognition (CVPR), San Diego,
California, USA, pp. 886–893.
Classifiers Association for High Dimensional Problem: Application to Pedestrian Recognition 103
Freeman, W. T. & Roth, M. (1995). Orientation histograms for hand gesture recognition,
Intl. IEEE Workshop on Automatic Face and Gesture- Recognition, Zurich,Switzerland,
pp. 296–301.
Freund, Y. & Schapire, R. E. (1996). Experiments with a New Boosting Algorithm, Proceedings
of the 30th International Conference on Machine Learning (ICML), pp. 148–156.
Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection, The Journal
of Machine Learning Research 3: 1157–1182.
Hall, M. A. (2000). Correlation-based feature selection for discrete and numeric class machine
learning, Proceedings 17th International Conference on Machine Learning (ICML), Mor-
gan Kaufmann, pp. 359–366.
Kira, K. & Rendell, L. A. (1992). A practical approach to feature selection, Proceedings of the 9th
International Workshop on Machine Learning, Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, pp. 249–256.
Kohavi, R. & John, G. H. (1997). Wrappers for feature subset selection, Artificial Intelligence
97(1-2): 273–324.
Kononenko, I. (1994). Estimating attributes: Analysis and extensions of RELIEF, Proceedings of
the European Conference on Machine Learning (ECML).
Le, D. D. & Satoh, S. (2004). Feature Selection by AdaBoost for SVM-based Face Detection,
Forum on Information Technology pp. 183–186.
Lepetit, V. & Fua, P. (2006). Keypoint Recognition using Randomized Trees, IEEE Transactions
on Pattern Analysis and Machine Intelligence 28(9): 1465–1479.
Leyrit, L., Chateau, T., Tournayre, C. & Lapresté, J.-T. (2008). Association of AdaBoost and
Kernel Based Machine Learning Methods for Visual Pedestrian Recognition, IEEE
Intelligent Vehicles Symposium (IV 2008), Eindhoven, Netherlands.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints, International
Journal of Computer Vision 60(20): 91–110.
Moutarde, F., Stanciulescu, B. & Breheret, A. (2008). Real-time visual detection of vehicles and
pedestrians with new efficient adaBoost features, Workshop on Planning, Perception
and Navigation for Intelligent Vehicles (PPNIV) of International Conference on Intelligent
RObots and Systems (IROS), Nice, France.
Munder, S. & Gavrila, D. (2006). An Experimental Study on Pedestrian Classification, IEEE
Transactions on Pattern Analysis and Machine Intelligence (PAMI) 28(11).
Papageorgiou, C. & Poggio, T. (2000). A trainable system for object detection, International
Journal of Computer Vision 38(1): 15–33.
Rakotomamonjy, A. (2003). Variable selection using svm-based criteria, Journal of Machine
Learning Research 3: 1357–1370.
Shashua, A., Gdalyahu, Y. & Hayun, G. (2004). Pedestrian Detection for Driving Assistance:
Single-frame Classification and System Level Performance, Proceedings of the IEEE
Intelligent Vehicle Symposium (IV), Parma, Italy.
Suard, F., Rakotomamonjy, A., Bensrhair, A. & Broggi, A. (2006). Pedestrian detection using in-
frared images and histograms of oriented gradients, Proceedings of the IEEE Conference
of Intelligent Vehicles (IV), Tokyo, Japan, pp. 206–212.
Tipping, M. E. (2001). Sparse Bayesian Learning and the Relevance Vector Machine, Journal of
Machine Learning Research 1: 211–244.
Vapnik, V. (1998). Statistical Learning Theroy, Wiley.
104 New Advances in Machine Learning
Viola, P. & Jones, M. (2001). Rapid object detection using a boosted cascade of simple fea-
tures, Proceedings of the IEEE International Conference on Computer Vision and Pattern
Recognition (CVPR), Vol. 1, pp. 511–518.
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T. & Vapnik, V. (2000). Feature
Selection for SVMs, Neural Information Processing Systems, pp. 668–674.
Yu, L. & Liu, H. (2003). Feature selection for high-dimensional data: a fast correlation-based
filter solution, Proceedings of the International Conference on Machine Learning (ICML),
pp. 856–863.
From Feature Space to Primal Space: KPCA and Its Mixture Model 105
0
7
1. Introduction
Kernel principal component analysis (KPCA) (Schölkopf et al., 1998) has proven to be an ex-
ceedingly popular technique in the fields of machine learning and pattern recognition, and
is discussed at length in literature. KPCA is to perform linear PCA (Hotelling, 1933; Jol-
liffe, 2002) in a high- (and possibly infinite-) dimensional kernel-defined feature space that is
typically induced by a nonlinear mapping. In implementation, the so-called kernel trick is
employed. Namely, KPCA is expressed in terms of dot products between the mapped data
points, and the dot products are then evaluated by substituting an a priori kernel function.
KPCA has demonstrated to be an incredibly useful tool for many application areas includ-
ing handwritten digits recognition and de-noising (Schölkopf et al., 1998; Mika et al., 1999;
Schölkopf et al., 1999), nonlinear regression (Rosipal et al., 2001), face recognition (Kim et al.,
2002a; Yang, 2002; Kong et al., 2005), and complex image analysis (Kim et al., 2005; Li et al.,
2008).
In practice, however, we are often confronted with the situation that needs to process a large
number of data points. This raises a problem for KPCA, since KPCA has to store and di-
agonalize the kernel matrix (also known as Gram matrix), whose size is equal to the square
of the number of training samples. So, for large scale data set, KPCA would consume large
storage space and be computationally intensive (with time complexity O(n3 ), a cubic growth
with n, where n is the number of the training samples). Then it is impractical for KPCA to
be applied in some circumstances. Another attendant problem is that eig-decomposing large
matrix directly suffers from the issue of numerical accuracy. Some algorithms have been de-
veloped to address the drawbacks associated with KPCA. By considering KPCA from a prob-
abilistic point of view, Rosipal and Girolami (2001) presented an expectation maximization
(EM) (Dempster et al., 1977; McLachlan & Krishnan, 1997) method for carrying out KPCA.
Their algorithm is of computational complexity O( pn2 ) per iteration, where p is the number
of extracted components. Whereas the EM algorithm for KPCA does alleviate computational
demand, there exists a rotational ambiguity with the algorithm. To remove the obscurity, a
constrained EM algorithm for KPCA (and PCA) was formulated based on coupled probabil-
ity model (Ahn & Oh, 2003). Also, one deficiency of these EM-type algorithms is that the
kernel matrix still required to be stored. Kim et al. (2005) then derived the kernel Hebbian
algorithm (KHA), which was the counterpart of the generalized Hebbian algorithm (GHA)
(Sanger, 1989), to iteratively perform KPCA, where only linear order memory complexity was
106 New Advances in Machine Learning
involved. However, the price one has to pay for this saving is that the time complexity is
not under control. Motivated by the idea “divide and rule”, Zheng et el. (2005) proposed
another improved algorithm for KPCA as follows. First, the entire data set was divided into
some smaller data sets, then the sample covariance matrix of each smaller data set was ap-
proximately computed, and finally kernel principal components were extracted by combin-
ing these approximate covariance matrices. With their method, the computational demand
and memory requirement are effectively relieved. However, the advantages relate with many
factors such as the required accuracy of extracted components, the number of the divided
smaller data sets (which is usually empirically set), and the data to be processed. As a generic
methodology, another thread of speeding up kernel machine learning is to seek a low-rank
approximation to the kernel matrix. Since, as noted by several researchers, the spectrum of
the kernel matrix tends to decay rapidly, the low-rank approximation often achieves sufficient
precision of the requirement. Williams and Seeger (2001) used Nyström method to compute
the approximate eigenvalue decomposition of the kernel matrix. Also, Smola and Schölkopf
(2000) presented a sparse greedy approximation technique. These two methods yield similar
forms and performances.
Another limitation of KPCA is that it defines only a global projection of the samples. When the
distribution of the data points is complex and non-convex, a global subspace based on KPCA
may fail to deliver good performance in terms of feature extraction and recognition. In input
space, Tipping and Bishop (1999) and Roweis and Ghahramani (1999) introduced mixture of
PCA to remedy the same shortcoming of PCA. Kim et al. (2002b) used mixture-of-eigenfaces
for face recognition. There are many other papers on face recognition using mixture method,
but as they do not focus on KPCA, references are omitted.
The contributions of this chaper are twofold: Firstly, viewing KPCA as a problem in primal
space with the “samples” created by using the incomplete Cholesky decomposition, we show
that KPCA is equivalent to performing linear PCA in the primal space using the created sam-
ples. So, the same kernel principal components as the standard KPCA are produced. Con-
sequently, all the improved methods dealing with linear PCA (such as the constrained EM
algorithm and the GHA method mentioned above), as well as directly diagonalizing the co-
variance matrix, could be applied to the created samples in the primal space to extract kernel
principal components. Theoretical analysis and experimental results on both artificial and real
data have shown the superiority of the proposed method for performing KPCA in terms of
computational efficiency and storage space, especially when the number of the data points is
large. Secondly, we extend KPCA to a mixture of local KPCA models by applying the mixture
model of the probabilistic PCA in the primal space. While KPCA uses one set of features to
model the data points, the mixture of KPCA uses more than one set of features. Therefore,
the mixture of KPCA is expected to represent data more effectively and has better recognition
performance than KPCA, which is also confirmed by the experiments.
The remainder of this chaper is organized as follows. The standard KPCA is briefly reviewed
in Section 2, and in Section 3, we formulate KPCA in the primal space using the incomplete
Cholesky decomposition. Next, we extend KPCA to its mixture model in Section 4. Experi-
mental results are presented in Section 5. In Section 6, we draw the conclusion.
space F
φ : R l → F , xi → φ(xi ), (i = 1, . . . , n) (1)
where φ is a typically nonlinear function. Then a standard linear PCA is performed in F using
the mapped samples. In evaluation, we don’t have to compute the mapping φ explicitly. The
mapped samples occur in the forms of dot products, say between φ(xi ) and φ(x j ), which are
computed by choosing a kernel function k:
k(xi , x j ) = (φ(xi ) · φ(x j )). (2)
The mapping φ into F such that (2) stands exists if k is a positive definite kernel, thanks to
Mercer’s theorem of functional analysis. So, the mapping φ and thus F are fixed implicitly
via the function k. The dth-order polynomial kernel, k (xi , x j ) = (xi · x j )d , Gaussian kernel with
width σ > 0, k(xi , x j ) = exp(−xi − x j 2 /2σ2 ), and sigmoid kernel k(xi , x j ) = tanh( a(xi · x j ) +
b) are commonly used Mercer kernels.
For notation simplicity, the mapped samples are assumed to be centered, i.e. ∑in=1 φ(xi ) = 0.
We wish to find eigenvalues λ > 0 and associated eigenvectors v ∈ F \ {0} of the covariance
matrix of the mapped samples φ(xi ), given by
1 n
n i∑
Cφ = φ ( xi ) φ ( xi ) T , (3)
=1
where T denotes the transpose of a vector or matrix. Since the mapping φ is implicit or Cφ
is very high dimensional, direct eigenvalue decomposition will be intractable. The difficulty
is circumvented by using the so-called kernel trick; that is, linear PCA in F is formulated
such that all the occurrences of φ are in the forms of dot products. And the dot products
are then replaced by the kernel function k. So, dealing with the φ-mapped data explicitly is
avoided. Specifically, since λv = Cφ v, all solutions v with λ = 0 fall in the subspace spanned
by {φ(x1 ), . . . , φ(xn )}. Therefore, v could be linearly represented by φ(xi ):
n
v= ∑ α i φ ( xi ) , (4)
i =1
where αi (i = 1, . . . , n) are coefficients. The eigenvalue problem is then reduced as the following
equivalent problem
λ(φ(x j ) · v) = (φ(x j ) · Cφ v) for all j = 1, . . . , n. (5)
Substituting (3) and (4) into (5), we arrive at the eigenvalue equation
where αk has been normalized such that λk (αk · αk ) = 1. Note that centering the vectors φ(xi )
and t in F is realized by centering the corresponding kernel matrices (Schölkopf et al., 1998).
where Q has m orthonormal columns, R ∈ R m×n is an upper triangular matrix with positive
diagonal elements, and m is the rank of φ(X). Note that the matrix R could be evaluated row
by row without computing φ explicitly. The partial Gram-Schmidt procedure pivots the sam-
ples and selects the linearly independent samples in the feature space F . The orthogonalized
version of the selected independent samples, i.e. the columns of Q, is thus used as a set of
basis. All φ-mapped data points could be linearly represented using the basis. Specifically,
the ith column of the matrix R are the coefficients for the linear representation of φ(xi ) using
the columns of Q as the basis. So, the columns of R are, in fact, the new coordinates in the
feature space of the corresponding data points of φ(X) using the basis. By (9), the following
decomposition of the kernel matrix is yielded:
which is the incomplete Cholesky decomposition (Fine & Scheinberg, 2001; Bach & Jordan,
2002).
From (10), if defining a new mapping
φ̃ : F → R m , φ(xi ) → ri , (i = 1, . . . , n) (11)
where ri is the ith column of R, then the n vectors {r1 , . . . , rn } give rise to the same Gram matrix
K (Shawe-Taylor & Cristianini, 2004); that is
The space R m is referred to as the primal space, and ri (i = 1, . . . , n) are viewed as “samples”.
From (9), we see that, if φ(xi ) are centered, then ri are also centered. In other words, ri could
be centered by centering the kernel matrix K. By using the samples ri created in the primal
space R m , we have the following
Theorem 1. Given observations xi (i = 1, . . . , n) and kernel function k, KPCA is equivalent to per-
forming linear PCA in the primal space R m using the created samples ri (i = 1, . . . , n), both of which
produce the same kernel principal components.
Proof. It suffices to note that the dot products between the φ-mapped samples in the feature
space F are the same with that between the corresponding φ̃-mapped samples in the primal
space R m , and linear PCA in both the feature space F (i.e., KPCA) and the primal space R m
could be represented through the forms of dot products between samples. The equivalence
between KPCA and linear PCA in the primal space is schematically illustrated as follows:
From Feature Space to Primal Space: KPCA and Its Mixture Model 109
Let λ̃1 ≥ λ̃2 ≥ · · · ≥ λ̃m be the eigenvalues of Cφ̃ , and ṽ1 , . . . , ṽm the corresponding eigenvec-
tors. We proceed to compute the kernel principal components of the testing point t. Firstly, we
need to compute its φ̃-image. We carry out the projections of φ(t) onto the basis vectors, i.e.,
the columns of Q. This is achieved by calculating an extensional column of the matrix R in
the partial Gram-Schmidt procedure (Shawe-Taylor & Cristianini, 2004). The kernel principal
components corresponding to φ are then computed as
where the element-wise lower operator L is defined such as L(wst ) = wst for s ≥ t and is
zero otherwise, the upper operator U is defined such as U (wst ) = wst for s ≤ t and is zero
otherwise, and Z denotes the p × n matrix of latent variables. The matrix Γ at convergence is
equal to Γ = VΛ, where the columns of V = [ṽ1 , . . . , ṽ p ] are the first p eigenvectors of Cφ̃ , with
corresponding eigenvalues λ̃1 , . . . , λ̃ p forming the diagonal matrix Λ. Another extensively
used iterative method for PCA is the generalized Hebbian algorithm (GHA) (Sanger, 1989).
Based on GHA, the m × p eigenvectors matrix V corresponding to the p largest eigenvalues is
updated according to the rule
V(t + 1) = V(t) + δ(t) r(t)y(t)T − V(t)U (y(t)y(t)T ) , (17)
where y = VT r is the principal component of r, δ(t) is a learning rate parameter. Here, the
argument t denotes a discrete time when a sample r(t) is selected randomly from all the sam-
ples ri . It has been shown by Sanger (1989) that, for proper setting of learning rate δ(t) and
initialization V(0), the columns of V converge to the eigenvectors of Cφ̃ as t tends to infinity.
In summary, the procedure of the proposed algorithm for performing KPCA is outlined as
follows:
110 New Advances in Machine Learning
1. Perform the incomplete Cholesky decomposition for the training points as well as test-
ing points to obtain R and r̄;
2. Compute the p leading eigenvectors corresponding to the first p largest eigenvalues of
the covariance matrix Cφ̃ (defined in (13)) by (a) directly diagonalizing Cφ̃ , (b) using the
constrained EM algorithm (according to (15) and (16)), or (c) using the GHA algorithm
(according to (17));
3. Extract kernel principal components by projecting each testing point onto the eigenvec-
tors (according to (14)).
Complexity analysis. The computational complexity of performing the incomplete Cholesky
decomposition is of the order O(m2 n), and the storage requirement is O(mn). Next, if one
explicitly evaluates the covariance matrix Cφ̃ followed by diagonalization to obtain the eigen-
vectors, the computational and storage complexity are O(m2 n + m3 ) and O(( p + m)m), respec-
tively. When p m, it is possible to obtain computational savings by using the constrained
EM algorithm, the time and storage complexity of which are respectively O( pmn) per itera-
tion and O( p(m + n)). If using the GHA method, the time complexity is O( p2 m) per iteration
and storage complexity O( pm). The potential efficiency gains, nevertheless, depend on the
number of iterations needed to reach the required precision and the ratio of m to p. As one
has seen, the proposed method do not need to store the kernel matrix K, the storage of which
is O(n2 ), and its computational complexity compares favorably with that of the traditional
KPCA, which scales as O(n3 ).
associated with each component of the g mixture model, therefore allowing each component
to model the data covariance structure in different regions of the primal space.
Since the true classification of ri into components are unknown, we use the marginal proba-
bility density function (p.d.f.) that is a g-component Gaussian mixture p.d.f.
g
f ( ri ; Θ ) = ∑ π j ϕ ( ri ; µ j , Σ j ) (19)
j =1
for the observations, where ϕ(ri ; µ j , Σ j ) is the Gaussian p.d.f. with mean µ j
and variance Σ j = Γ j ΓT 2 the model parameters are given by Θ =
j + σj Im ,
(µ1 , . . . , µ g , Γ1 , . . . , Γ g , σ12 , . . . , σg2 , π1 , . . . , π g ). For log-likelihood maximization, the model
parameters could be estimated via the EM algorithm as follows (Tipping & Bishop, 1999).
The E-step is
(k) (k) (k)
(k)
π̂ j ϕ(ri ; µ̂ j , Σ̂ j )
ẑij = g (k) (k) (k)
. (20)
∑i=1 π̂i ϕ(ri ; µ̂i , Σ̂i )
In the M-step, the parameters are updated according to
( k +1) 1 n (k)
n i∑
π̂ j = ẑij , (21)
=1
(k)
( k +1) ∑in=1 ẑij ri
µ̂ j = (k)
, (22)
∑in=1 ẑij
( k +1) (k+1/2) (k) 2( k ) 2( k ) (k)T (k) −1 (k)T (k+1/2) (k) −1
Γ̂ j = Ŝ j Γ̂ j σ̂j Im + (σ̂j I p + Γ̂ j Γ̂ j ) Γ̂ j Ŝ j Γ̂ j , (23)
1
2( k +1) (k+1/2) (k+1/2) (k) 2(k) ( k )T ( k ) ( k +1)T
σ̂j = tr Ŝ j − Ŝ j Γ̂ j (σ̂j I p + Γ̂ j Γ̂ j )−1 Γ̂ j , (24)
m
where
n
(k+1/2) 1 (k) ( k +1) ( k +1) T
(k+1) ∑ ij
Ŝ j = ẑ (ri − µ̂ j )(ri − µ̂ j ) . (25)
nπ̂ j i =1
In the case σj2 → 0, the converged Γ̂ j contains the (scaled) eigenvectors of the local covariance
matrix Ŝ j . Each component performs a local PCA weighted by the mixing proportion. And
the Γ̂ j in the limit case are updated as
where
(k) (k)T (k) −1 (k)T (k+1/2)
Ẑ j = (L(Γ̂ j Γ̂ j )) Γ̂ j R̂ j ,
and
(k) (k)
ẑ1j ẑnj
(k+1/2) ( k +1) ( k +1)
R̂ j = ( k +1)
(r1 − µ̂ j ), . . . , ( k +1)
(rn − µ̂ j ) .
nπ̂ j nπ̂ j
112 New Advances in Machine Learning
Table 1. Comparison of training time and storage space on the toy example with 2000 data
points.
Table 2. Recognition rates of the 2007 testing points of the USPS handwritten digit database
using the standard KPCA, proposed KPCA and MKPCA methods with polynomial kernel of
degree two through four.
(k) (k)
When the noise level becomes infinitesimal, the component p.d.f. ϕ(ri ; µ̂i , Σ̂i ) in (20) is
(k)
singular. It, in probability 1, falls in the p-dimensional subspace span{Γ̂ j }, i.e.,
(k) (k) (k)T (k) −1/2
ϕ(ri ; µ̂i , Σ̂i ) = (2π )− p/2 det(Γ̂ j Γ̂ j ) exp −â(k)T â(k) /2 , (27)
( k )T ( k ) ( k )T (k)
where â(k) = (Γ̂ j Γ̂ j )−1 Γ̂ j (ri − µ̂ j ).
As can be seen, by applying the KPCA mixture model, all the observations are softly divided
into g clusters each modelled by a local KPCA. We use the most appropriate local KPCA for
a given observation. Based on the probabilistic framework, a natural choice is to assign the
observation to the cluster belong to which its posterior probability is the largest.
5. Experiments
In this section, we will use both artificial and real data sets to compare the performance of the
proposed method with that of the standard KPCA (Schölkopf et al., 1998). In the first example,
we use toy data to visually compare the results by projecting testing points onto extracted
principal axes, and to show that the proposed method is superior to the standard KPCA in
terms of time and storage complexity. In the second example, we perform the experiment of
handwritten digital character recognition to further illustrate the effectiveness of the proposed
method. All these experiments are run with the settings of 3.06GHz CPU and 3.62GB RAM
using Matlab software.
From Feature Space to Primal Space: KPCA and Its Mixture Model 113
Standard KPCA
1 1 1
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
(a) (b) (c)
Proposed KPCA
1 1 1
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
(a) (a) (c)
1 1 1
MKPCA
0 0 0
−1 −1 −1
−1 0 1 −1 0 1 −1 0 1
(a) (b) (c)
Fig. 1. From left to right, the first three kernel principal components extracted by the standard
KPCA (top), proposed KPCA (middle), and MKPCA method with g = 2 (bottom), respectively,
using the Gaussian kernel k (xi , x j ) = exp(−xi − x j 2 /0.1). The feature values are illustrated
by shading and constant values are connected by contour lines.
Table 3. Comparison of training time and storage space on the USPS handwritten digit
database with 5000 training points.
nected by contour lines. From Fig. 1, we see that the proposed KPCA method obtains almost
the same results with that of the standard KPCA method (ignoring the sign difference), and
both methods nicely separate the four clusters. For the data points, the average relative devi-
ation of the principal components found by the standard KPCA (by diagonalizing the kernel
matrix K) and by the proposed KPCA is less than 0.01. In this simple simulated experiment,
the toy data are compact in the feature space, and are well modelled by the KPCA method.
So, the MKPCA method doesn’t show its advantage; in fact, most of the toy data belong to
one component of the MKPCA model, since the estimated mixing proportions π̂1 = 0.9645
and π̂2 = 0.0355. As a result, the MKPCA method produces the similar result with that of the
KPCA method.
are chosen as training data and all the 2007 testing points are used as testing data. The polyno-
mial kernel with various degree d is utilized in each trial to compute the kernel function. We
employ the nearest neighbor classifier for classification role. In using MKPCA, we set g = 2. The
recognition rates obtained by the three approaches are reported in Table 2, while the training
times and storage spaces consumed are listed in Table 3. From Table 2, we see that the MKPCA
method achieves the best recognition rate among the three systems. The standard KPCA and
the proposed approach to performing KPCA have similar recognition rates. Nevertheless, the
proposed KPCA reduces the time and storage complexity significantly.
6. Conclusion
We have presented an improved algorithm for performing KPCA especially when the size of
training samples is large. This is achieved by viewing KPCA as a primal space problem with
the “samples” produced via the incomplete Cholesky decomposition. Since the spectrum
of the kernel matrix tends to decay rapidly, the incomplete Cholesky decomposition, as an
elegant low-rank approximation to the kernel matrix, arrives at sufficient accuracy. Compared
with the standard KPCA method, the proposed KPCA method reduces the time and storage
requirement significantly for the case of large scale data set.
In order to provide a locally linear model for the data projection onto a low dimensional
subspace, we extend KPCA to a mixture of local KPCA models by applying mixture model
of PCA in the primal space. MKPCA supplies an alternative choice to model data with large
variation. The mixture model outperforms the standard KPCA in terms of recognition rate.
The methodology introduced in this chaper could be applied to other kernel-based algorithms,
provided the algorithm could be expressed through dot products.
Acknowledgement
This work was supported by Specialized Research Fund for the Doctoral Program of Higher
Education of China under grant 20070286030, and National Natural Science Foundation of
China under grants 60803059 and 10871001.
7. References
Ahn, J. H. & Oh, J. H. (2003). A constrained EM algorithm for principal component analysis,
Neural Computation 15: 57–65.
Bach, F. R. & Jordan, M. I. (2002). Kernel independent component analysis, Journal of Machine
Learning Research 3: 1–48.
Cristianini, N., Lodhi, H. & Shawe-Taylor, J. (2002). Latent semantic kernels, Journal of Intelli-
gent Information System 18: 127–152.
Dempster, A. P., Laird, N. M. & Rubin, D. B. (1977). Maximum likelihood from incomplete
data using the EM algorithm (with discussion), J. R. Statist. Soc. Ser. B. 39: 1–38.
Fine, S. & Scheinberg, K. (2001). Efficient SVM training using low-rank kernel representation,
Technical Report RC 21911, IBM T.J. Watson Research Center.
Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components,
J. Educat. Psychology 24: 417–441.
Jolliffe I.T. (2002). Principal Component Analysis, Second edition, Springer-Verlag, New York.
Kim, K. I., Franz, M. O. & Schölkopf, B. (2005). Iterative kernel principal component analysis
for image modelling, IEEE Transactions on Pattern Analysis and Machine Intelligence 27:
1351–1366.
116 New Advances in Machine Learning
Kim, K. I., Jung, K. & Kim, H. J. (2002a). Face recognition using kernel principal component
analysis, IEEE Signal Processing Letters 19: 40–42.
Kim, H. C., Kim, D. & Bang, S. Y. (2002b). Face recognition using the mixture-of-eigenfaces
method, Pattern Recognition Letters 23: 1549–1558.
Kong, H., Wang, L., Teoh, E. K., Li, X., Wang, J. G. & Venkateswarlu, R. (2005). Generalized 2D
principal component analysis for face image representation and recognition, Neural
Networks 18: 585–594.
Li, J., Li, M. L. & Tao, D. C. (2008). KPCA for semantic object extraction in images, Pattern
Recognition 41: 3244-3250.
McLachlan, G. J. & Krishnan, T. (1997). The EM Algorithm and Extensions, Wiley, New York.
Mika, S., Schölkopf, B., Smola, A. J., Müller, K. R., Scholz, M. & Rätsch, G. (1999). Kernel PCA
and de-Noising in feature spaces, in M. S. Kearns, S. A. Solla & D. A. Cohn (eds.),
Advances in Neural Information Processing Systems, Vol. 11, MIT Press, Cambridge,
Mass., pp. 536–542.
Rosipal, R. & Girolami, M. (2001). An expectation-maximization approach to nonlinear com-
ponent analysis, Neural Computation 13: 505–510.
Rosipal, R., Girolami, M., Trejo, L. J. & Cichocki, A. (2001). Kernel PCA for feature extraction
and de-noising in nonlinear regression, Neural Computing & Applications 10: 231–243.
Roweis, S. & Ghahramani, Z. (1999). A unifying review of linear gaussian models, Neural
Computation 11: 305–345.
Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neu-
ral network, Neural Networks 2: 459–473.
Schölkopf, B., Mika, S., Burges, C., Knirsch, P., Müller, K.-R., Rätsch, G. & Smola, A. J. (1999).
Input space vs. feature space in kernel-based methods, IEEE Transactions on Neural
Networks 10: 1000–1017.
Schölkopf, B., Smola, A. & Müller, K.-R. (1998). Nonlinear component analysis as a kernel
eigenvalue prolem, Neural Computation 10: 1299–1319.
Shawe-Taylor, J. & Cristianini, N. (2004). Kernel Methods for Pattern Analysis, Cambridge Uni-
versity Press, England.
Smola, A. J. & Schölkopf, B. (2000). Sparse greedy matrix approximation for machine learn-
ing, Proceeding of the Seventeenth International Conference on Mchine Learning, Morgan
Kaufmann.
Tipping, M. E. & Bishop, C. M. (1999). Mixtures of probabilistic principal component analy-
sers, Neural Computation 11: 443–482.
Williams C. K. I. & Seeger, M. (2001). Using the Nyström method to speed up kernel machines,
Advances in Neural Information Processing Systems, Vol. 13, MIT Press.
Yang, M. H. (2002). Kernel eigenfaces vs. kernel fisherfaces: face recognition using kernel
methods, Proceedings of the Fifth IEEE International Conference on Automatic Face and
Gesture Recognition, Washington, DC, pp. 215–220.
Zheng, W., Zou, C. & Zhao, L. (2005). An improved algorithm for kernel principal component
analysis, Neural Processing Letters 22: 49–56.
Machine Learning for Multi-stage Selection of Numerical Methods 117
0
8
Erika Fuentes
Innovative Computing Laboratory, University of Tennessee
USA
Abstract
In various areas of numerical analysis, there are several possible algorithms for solving a prob-
lem. In such cases, each method potentially solves the problem, but the runtimes can widely
differ, and breakdown is possible. Also, there is typically no governing theory for finding
the best method, or the theory is in essence uncomputable. Thus, the choice of the optimal
method is in practice determined by experimentation and ‘numerical folklore’. However, a
more systematic approach is needed, for instance since such choices may need to be made in
a dynamic context such as a time-evolving system.
Thus we formulate this as a classification problem: assign each numerical problem to a class
corresponding to the best method for solving that problem.
What makes this an interesting problem for Machine Learning, is the large number of classes,
and their relationships. A method is a combination of (at least) a preconditioner and an itera-
tive scheme, making the total number of methods the product of these individual cardinalities.
Since this can be a very large number, we want to exploit this structure of the set of classes,
and find a way to classify the components of a method separately.
We have developed various techniques for such multi-stage recommendations, using auto-
matic recognition of super-clases. These techniques are shown to pay off very well in our
application area of iterative linear system solvers.
We present the basic concepts of our recommendation strategy, and give an overview of
the software libraries that make up the Salsa (Self-Adapting Large-scale Solver Architecture)
project.
1. Introduction
In various areas of numerical analysis, there are several possible algorithms for solving a prob-
lem. Examples are the various direct and iterative solvers for sparse linear systems, or rou-
* This work was funded in part by the Los Alamos Computer Science Institute through the subcontract
# R71700J- 29200099 from Rice University, and by the National Science Foundation under grants 0203984
and 0406403.
118 New Advances in Machine Learning
aim is then to find an algorithm that will reliably solve the problem, and do so in the minimum
amount of time.
Since several algorithm candidates exist, without an analytic way of decid-
ing on the best algorithm in any given instance, we are looking at the classifi-
cation problem of using non-numerical techniques for recommending the best
algorithm for each particular problem instance.
In this chapter we will focus on linear system solving. The textbook algorithm, Gaussian
elimination, is reliable, but for large problems it may be too slow or require too much memory.
Instead, often one uses iterative solution methods, which are based on a principle of successive
approximation. Rather than computing a solution x directly, we compute a sequence n → xn
that we hope converges to the solution. (Typically, we have residuals rn that converge to zero,
rn ↓ 0, but this behaviour may not be monotone, so no decision method can be based on its
direct observation.)
or
à x̃ = f , where à = AB and x̃ = B−1 x so x = B x̃.
Since there are many choices for the preconditioner, we now have a set of classes of cardinality
the number of preconditioners times the number of iterative schemes.
Additionally, there are other transformations such as permutations Pt AP that can be applied
for considerations such as load balancing in parallel calculations. In all, we are faced with a
120 New Advances in Machine Learning
large number of classes, but in a set that is the cartesian product of sets of individual decisions
that together make up the definition of a numerical method. Clearly, given this structure,
treating the set of classes as just a flat collection will be suboptimal. In this chapter we consider
the problem of coming up with better strategies, and evaluating their effectiveness.
3. Formalization
In this section we formalize the notion of numerical problems and numerical methods for
solving problems. Our formalization will also immediately be reflected in the design of the
libraries that make up the Salsa project.
1 After this example we will tacitly omit the typedef line and only give the structure definition.
Machine Learning for Multi-stage Selection of Numerical Methods 121
struct Problem_ {
LinearOperator A;
Vector RHS,KnownSolution,InitialGuess;
DesiredAccuracy constraint;
};
typedef struct Problem_* Problem;
struct PerformanceMeasurement_ {
int success;
double Tsetup,Tsolve;
double *ConvergenceHistory;
double BackwardError,ForwardError;
}
struct Result_ {
Vector ComputedSolution;
PerformanceMeasurement performance;
}
ComputeQuantity(Problem problem,
char *feature,ReturnValue *result,TruthValue *success);
Here, the ReturnValue type is a union of all returnable types, and the success parameter
indicates whether the quantity was actually computed. Computation can fail for any number
of reasons: if sequential software is called in parallel, if properties such as matrix bandwidth
are asked of an implicitly given operator, et cetera. Failure is not considered catastrophic, and
the calling code should allow for this eventuality.
For a general and modular setup, we do not hardwire the existence of any module. Instead,
we use a function
4. Numerical methods
In a simple-minded view, method space can be considered as a finite, unordered, collection
of methods { M1 , . . . , Mk }, and in some applications this may even be the most appropriate
view. However, in the context of linear system solving a method is a more structured entity:
each method consists at least of the choice of a preconditioner and the choice of an iterative
scheme (QMR, GMRES, et cetera), both of which are independent of each other. Other possi-
ble components of a method are scaling of the system, and permutations for improved load
balancing. Thus we arrive at a picture of a number of preprocessing steps that transform the
original problem into another one with the same solution – or with a different solution that
can easily be transformed into that of the original problem – followed by a solver algorithm
that takes the transformed problem and yields its solution.
This section will formalize this further structure of the method space.
2 In the context of linear system solving, these will be Krylov methods, hence the choice of the letter ‘K’.
Machine Learning for Multi-stage Selection of Numerical Methods 123
Other possibilities are permutations of the linear system, or approximations of the coeffient
matrix prior to forming the preconditioner.
Applying one preprocessor of each kind then gives us the definition of a method:
m ∈ M : m = k ◦ p n ◦ · · · ◦ p1 , k ∈ K, pi ∈ Pi (2)
We leave open the possibility that certain preprocessors can be applied in any sequence (for
instance scaling and permuting a system commute), while for others different orderings are
allowed but not equivalent. Some preprocessors may need to be executed in a fixed location;
for instance, the computation of a preconditioner will usually come last in the sequence of
preprocessors.
Typically, a preprocessed problem has a different solution from the original problem, so each
preprocessor has a backtransformation operation, to be applied to the preprocessed solution.
4.2 Implementation
The set of system preprocessors, like that of the analysis modules above, has a two level struc-
ture. First, there is the preprocessor type; for instance ‘scaling’. Then there is the specific
choice within the type; for instance ‘’left scaling’. Additionally, but not discussed here, there
can be parameters associated with either the type or the specific choice; for instance, we can
scale by a block diagonal, with the parameter indicating the size of the diagonal blocks.
We implement the sequence of preprocessors by a recursive routine:
PreprocessedSolving
(char *method,Problem problem,Result *solution)
{
ApplyPreprocessor(problem,&preprocessed_problem);
if ( /* more preprocessors */ )
PreprocessedSolving(next_method,
preprocessed_problem,&preprocessed_solution);
else
Solve(final_method,
preprocessed_problem,&preprocessed_solution);
UnApplyPreprocessor(preprocessed_solution,solution);
}
The actual implementation is more complicated, but this pseudo-code conveys the essence.
We again adopt a modular approach where preprocessors are dynamically declared:
124 New Advances in Machine Learning
The SysPro (System Preprocessor) library provides a number of preprocessors, including the
forward and backward transformation of the systems. It also includes a framework for loop-
ing over the various choices of a preprocessor type, for instance for an exhaustive test.
5. Method selection
Our method selection problem can be formalized as of constructing a function
Π : A → M: the problem classification function
that maps a given problem to the optimal method. Including feature extraction, we can also
define
Π : F → M: the classification function in terms of features
We start with a brief discussion of precisely what is meant by ‘optimal’. After that, we will
refine the definition of Π to reflect the preprocessor/solver structure, and we will address the
actual construction of Π.
Π( A) = M ≡ ∀ M ∈M : T ( A, M) ≤ T ( A, M ) (3)
Several variant definitions are possible. Often, we are already satisfied if we can construct a
function that picks a working method. For that criterium, we define Π non-uniquely as
Also, we usually do not insist on the absolutely fastest method: we can relax equation (3) to
Π( A) = M ≡ ∀ M ∈M : T ( A, M) ≤ (1 − ) T ( A, M ) (5)
which, for sufficient values of , also makes the definition non-unique. In both of the previous
cases we do not bother to define Π as a multi-valued function, but implicitly interpret Π( A) =
M to mean ‘M is one possible method satisfying the selection criterion’.
Formally, we define two classification types:
classification for reliability This is the problem equation (4) of finding any method that will
solve the problem, that is, that will not break down, stall, or diverge.
classification for performance This is the problem equation (3) of finding the fastest method
for a problem, possibly within a certain margin.
In a logical sense, the performance classification problem also solves the reliability problem.
In practice, however, classifiers are not infallible, so there is a danger that the performance
classifier will mispredict, not just by recommending a method that is slower than optimal, but
also by possibly recommending a diverging method. Therefore, in practice a combination of
these classifiers may be preferable.
We will now continue with discussing the practical construction of (the various guises of) the
selection function Π.
Machine Learning for Multi-stage Selection of Numerical Methods 125
5.2 Examples
Let us consider some adaptive systems, and the shape that F, M, and Π take in them.
5.2.1 Atlas
Atlas Whaley et al. (2001) is a system that determines the optimal implementation of Blas
kernels such as matrix-matrix multiplication. One could say that the implementation chosen
by Atlas is independent of the inputs3 and only depends on the platform, which we will
consider a constant in this discussion. Essentially, this means that F is an empty space. The
number of dimensions of M is fairly low, consisting of algorithm parameters such unrolling,
blocking, and software pipelining parameters.
In this case, Π is a constant function defined by
Π( f ) ≡ min T ( A, M)
M ∈M
where A is a representative problem. This minimum value can be found by a sequence of line
searches, as done in Atlas, or using other minimization techniques such a modified simplex
method Yi et al. (2004).
5.2.2 Scalapack/LFC
The distributed dense linear algebra library Scalapack Choi et al. (1992) gives in its manual a
formula for execution time as a function of problem size N, the number of processors Np , and
the block size Nb . This is an example of a two-dimensional feature space (N, Np ), and a one-
dimensional method space: the choice of Nb . All dimensions range through positive integer
values. The function involves architectural parameters (speed, latency, bandwidth) that can
be fitted to observations.
Unfortunately, this story is too simple. The LFC software Roche & Dongarra (2002) takes
into account the fact that certain values of Np are disadvantageous, since they can only give
grids with bad aspect ratios. A prime number value of Np is a clear example, as this gives a
degenerate grid. In such a case it is often better to ignore one of the available processors and
use the remaining ones in a better shaped grid. This means that our method space becomes
two-dimensional with the addition of the actually used number of processors. This has a
complicated dependence on the number of available processors, and this dependence can
very well only be determined by exhaustive trials of all possibilities.
3 There are some very minor caveats for special cases, such as small or ‘skinny’ matrices.
126 New Advances in Machine Learning
can be done by bisection. If the topology depends on both m and p, we need to find the areas
in (m, p) space, which can again be done by some form of bisection.
5.3 Database
In our application we need to be explicit about the database on which the classification is
based. That issue is explored in this section.
4 There is actually no objection to having D( f , m) return 1 for more than one method m; this allows us to
equivocate methods that are within a few percent of each other’s performance. Formally we do this by
extending and redefining the arg min function.
5 Please ignore the fact that the symbol S already had a meaning, higher up in this story.
Machine Learning for Multi-stage Selection of Numerical Methods 127
This is to be interpreted as follows. For each numerical method there will be one func-
tion σ ∈ S, and σ ( f ) = 0 means that the method is entirely unsuitable for problems with
feature vector f , while σ( f ) = 1 means that the method is eminently suitable.
We formally associate suitability functions with numerical methods:
Since elements of S are defined on the whole space F, this is a well-posed definition.
The remaining question is how to construct the suitability functions. For this we need con-
structor functions
The mechanism of these can be any of a number of standard statistical techniques, such as
fitting a Gaussian distribution through the points in the subset ∪ f ∈ F f where F ∈ P(F ).
Clearly, now B = C ◦ B , and with the function B defined we can construct the selection
function Π as in equation (9).
FeatureSet symmetry;
NewFeatureSet(&symmetry);
AddToFeatureSet(symmetry,
"simple","norm-of-symm-part",&sidx);
AddToFeatureSet(symmetry,
"simple","norm-of-asymm-part",&aidx);
After problem features have been computed, a suitability function for a specific method can
then obtain the feature values and use them:
128 New Advances in Machine Learning
C p,k = { A ∈ A : T ( p, k, A) is minimal}
The main disadvantage to this approach is that, with a large number of methods to choose
from, some of the classes can be rather small, leading to insufficient data for an accurate clas-
sification.
In an alternative derivation of this approach, we consider the C p,k to be classes of preproces-
sors, but conditional upon the choice of a solver. We then recommend p and k, not as a pair
Machine Learning for Multi-stage Selection of Numerical Methods 129
but sequential: we first find the k for which the best p can be found:
conditional let k := arg maxk max p σp,k ( f )
Π =
return arg max p σp,k ( f ), k
Πorthogonal ( f ) = ΠP ( f ), ΠK ( f )
and
C p = { A : mink T ( p, k, A) is minimal over all p},
giving functions σp , σk . (Instead of classifying by minimum over the other method component,
we could also use the average value.) The recommendation function is then
DK = ∪ p∈P DK,p ,
DK,p = { f , k, t | k ∈ K, ∃ a∈A : f = φ( p( a)), t = T ( p, k, a)}
130 New Advances in Machine Learning
which gives us a single function ΠP and individual functions ΠK,p . This gives us
This approach to classification is potentially the most accurate, since both the preconditioner
and iterator recommendation are made based on the features of the actual problem they apply
to. This also means that this approach is the most expensive; both the combined and the or-
thogonal approach require only the features of the original problem. In practice, with a larger
number of preprocessors, one can combine these approaches. For instance, if a preprocessor
such as scaling can be classified based on some easy to compute features, it can be tackled
sequentially, while the preconditioner and iterator are then recommended with the combined
approach based on a full feature computation of the scaled problem.
For Bayesian classification, we can adopt the following strategy. For each M ∈ M, define the
set of problems on which it converges:
C M = { A : T ( M, A) < ∞}
Machine Learning for Multi-stage Selection of Numerical Methods 131
C̄ M = { A : T ( M, A) = ∞}.
Now we construct functions σM , σ̄M based on both these sets. This gives a recommendation
function:
Π( f ) = { M : σM ( f ) > σ̄M ( f )}
This function is multi-valued, so we can either pick an arbitrary element from Π( f ), or the
element for which the excess σM ( f ) − σ̄M ( f ) is maximized.
The above strategies give only a fairly weak recommendation from the point of optimizing
solve time. Rather than using reliability classification on its own, we can use it as a preliminary
step before the performance classification.
7. Experiments
In this section we will report on the use of the techniques developed above, applied to the
problem of recommending a preconditioner and iterative method for solving a linear system.
The discussion on the experimental setup and results will be brief; results with much greater
detail can be found in Fuentes (2007).
We start by introducing some further concepts that facilitate the numerical tests.
Scaling Certain transformations on a test problem can affect the problem features, without
affecting the behaviour of methods, or being of relevance for the method choice. For
instance, scaling a linear system by a scalar factor does not influence the convergence
behaviour of iterative solvers. Also, features can differ in magnitude by order of mag-
nitude. For this reason, we normalize features, for instance scaling them by the largest
diagonal element. We also mean-center features for classification methods that require
this.
Elimination Depending on the collection of test problems, a feature may be invariant, or
dependent on other features. We apply Principal Component Analysis Jackson (2003)
to the set of features, and use that to weed out irrelevant features.
7.2.3 Evaluation
In order to evaluate a classifier, we use the concept of accuracy. The accuracy α of a classifier
is defined as
#problems correctly classified
α=
total #problems
A further level of information can be obtained looking at the details of misclassification:
a ‘confusion matrix’ is defined as A = (αij ) where αij is the ratio of problems belonging in
class i, classified in class j, to those belonging in class i. With this, αii is the accuracy of classi-
fier i, so, for an ideal classifier, A is a diagonal matrix with a diagonal ≡ 1; imperfect classifiers
have more weight off the diagonal.
A further measure of experimental results is the confidence interval z, which indicates an
interval in which the resulting accuracy for a random trial will be away for the presented
average accuracy α by ±z Douglas & Montgomery (1999) of the time. We use z to delimit the
confidence interval since we have used the Z-test Douglas & Montgomery (1999), commonly
used in statistics. The confidence interval is a measure of how ‘stable’ the resulting accuracy
is for an experiment.
• B={ bcgs, bcgsl, bicg }, where bcgs is BiCGstab van der Vorst (1992), and bcgsl is
BiCGstab() Sleijpen et al. (n.d.) with ≥ 2.
• G={ gmres, fgmres } where fgmres is the ‘flexible’ variant of GMRES Saad (1993).
• T={ tfqmr }
• C={ cgne }, conjugate gradients on the noral equations.
for iterative methods and
• A = { asm, rasm, bjacobi }, where asm is the Additive Schwarz method, and rasm is its
restricted variant Cai & Sarkis (1999); bjacobi is block-jacobi with a local ILU solve.
• BP = { boomeramg, parasails, pilut }; these are three preconditioners from the hypre pack-
age Falgout et al. (2006); Lawrence Livermore Lab, CASC group (n.d.)
• I = { ilu, silu }, where silu is an ILU preconditioner with shift Manteuffel (1980).
for preconditioners.
In table 1 we report the accuracy (as defined above) for a classification of all individual meth-
ods, while table 2 gives the result using superclasses. Clearly, classification using superclasses
is superior. All classifiers were based on decision trees Breiman et al. (1983); Dunham (2002).
Fig. 1. Confusion matrix for combined approach for classifying ( pc, ksp).
Finally, in figures 1, 2 we give confusion matrices for two different classification strategies for
the preconditioner / iterative method combination. The orthogonal approach gives superior
results, as evinced by the lesser weight off the diagonal. For this approach, there are fewer
classes to build classifiers for, so the modeling is more accurate. As a quantative measure of
the confusion matrices, we report in table 3 the average and standard deviation of the fraction
of correctly classified matrices.
Fig. 2. Confusion matrix for orthogonal approach for classifying ( pc, ksp).
8. Conclusion
We have defined the relevant concepts for the use of machine learning for algorithm selection
in various areas of numerical analysis, in particular iterative linear system solving. An innova-
tive aspect of our approach is the multi-leveled approach to the set of objects (the algorithms)
to be classified. An important example of levels is the distinction between the iterative process
and the preconditioner in iterative linear system solvers. We have defined various strategies
for classifying subsequent levels. A numerical test testifies to the feasibility of using machine
learning to begin with, as well as the necessity for our multi-leveled approach.
9. References
Arnold, D., Blackford, S., Dongarra, J., Eijkhout, V. & Xu, T. (2000). Seamless access to adaptive
solver algorithms, in M. Bubak, J. Moscinski & M. Noga (eds), SGI Users’ Conference,
Academic Computer Center CYFRONET, pp. 23–30.
Balay, S., Gropp, W. D., McInnes, L. C. & Smith, B. F. (1999). PETSc home page.
http://www.mcs.anl.gov/petsc.
Bhowmick, S., Eijkhout, V., Freund, Y., Fuentes, E. & Keyes, D. (2006). Application of ma-
chine learning to the selection of sparse linear solvers, Int. J. High Perf. Comput. Appl.
. submitted.
Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1983). CART: Classification and Regression
Trees.
Cai, X.-C. & Sarkis, M. (1999). A restricted additive Schwarz preconditioner for general sparse
linear systems, SIAM J. Sci. Comput. 21: 792–797.
136 New Advances in Machine Learning
Choi, Y., Dongarra, J. J., Pozo, R. & Walker, D. W. (1992). Scalapack: a scalable linear algebra
library for distributed memory concurrent computers, Proceedings of the fourth sym-
posium on the frontiers of massively parallel computation (Frontiers ’92), McLean, Virginia,
Oct 19–21, 1992, pp. 120–127.
Czyzyk, J., Mesnier, M. & More, J. (1996). The NetworkEnabled optimization system (neos)
server, Technical Report MCS-P615-0996, Argonne National Laboratory, Argonne, IL.
Czyzyk, J., Mesnier, M. & Moré, J. (1998). The NEOS server, IEEE J. Comp. Sci. Engineering
5: 68–75.
Dongarra, J., Bosilca, G., Chen, Z., Eijkhout, V., Fagg, G. E., Fuentes, E., Langou, J., Luszczek,
P., Pjesivac-Grbovic, J., Seymour, K., You, H. & Vadiyar, S. S. (2006). Self adapt-
ing numerical software (SANS) effort, IBM J. of R.& D. 50: 223–238. http://www.
research.ibm.com/journal/rd50-23.html, also UT-CS-05-554 University of
Tennessee, Computer Science Department.
Douglas, C. & Montgomery, G. (1999). Applied Statistics and Probability for Engineers.
Dunham, M. (2002). Data Mining: Introductory and Advanced Topics, Prentice Hall PTR Upper
Saddle River, NJ, USA.
Eijkhout, V. & Fuentes, E. (2007). A standard and software for numerical metadata, Techni-
cal Report TR-07-01, Texas Advanced Computing Center, The University of Texas at
Austin. to appear in ACM TOMS.
Falgout, R., Jones, J. & Yang, U. (2006). The design and implementation of hypre, a library
of parallel high performance preconditioners, Numerical Solution of Partial Differential
Equations on Parallel Computers, A.M. Bruaset and A. Tveito, eds., Vol. 51, Springer-
Verlag, pp. 267–294. UCRL-JRNL-205459.
Fuentes, E. (2007). Statistical and Machine Learning Techniques Applied to Algorithm Selection
for Solving Sparse Linear Systems, PhD thesis, University of Tennessee, Knoxville TN,
USA.
Greenbaum, A. & Strakos, Z. (1996). Any nonincreasing convergence curve is possible for
GMRES, SIMAT 17: 465–469.
Gropp, W. D. & Smith, B. F. (n.d.). Scalable, extensible, and portable numerical libraries,
Proceedings of the Scalable Parallel Libraries Conference, IEEE 1994, pp. 87–93.
Houstis, E., Verykios, V., Catlin, A., Ramakrishnan, N. & Rice, J. (2000). PYTHIA II: A knowl-
edge/database system for testing and recommending scientific software.
Jackson, J. (2003). A User’s Guide to Principal Components, Wiley-IEEE.
Lawrence Livermore Lab, CASC group (n.d.). Scalable Linear Solvers. http://www.llnl.
gov/CASC/linear_solvers/.
Manteuffel, T. (1980). An incomplete factorization technique for positive definite linear sys-
tems, Math. Comp. 34: 473–497.
Otto, S., Dongarra, J., Hess-Lederman, S., Snir, M. & Walker, D. (1995). Message Passing Inter-
face: The Complete Reference, The MIT Press.
Ramakrishnan, N. & Ribbens, C. J. (2000). Mining and visualizing recommendation spaces for
elliptic PDEs with continuous attributes, ACM Trans. Math. Software 26: 254–273.
Roche, K. J. & Dongarra, J. J. (2002). Deploying parallel numerical library routines to cluster
computing in a self adapting fashion. Submitted.
Saad, Y. (1993). A flexible inner-outer preconditioned GMRES algorithm, SIAM J. Sci. Stat.
Comput. 14: 461–469.
Salsa Project (n.d.a). SALSA: Self-Adapting Large-scale Solver Architecture. http://icl.
cs.utk.edu/salsa/.
Hierarchical Reinforcement Learning Using a Modular Fuzzy Model for Multi-Agent Problem 137
x9
1. Introduction
Reinforcement learning (Sutton & Barto, 1998; Watkins & Dayan, 1998; Grefenstette, 1988;
Miyazaki et al., 1999; Miyazaki et al., 1999) among machine learning techniques is an
indispensable approach to realize the intelligent agent such as autonomous mobile robots.
The importance of the technique is discussed in several literatures. However there exist a lot
of problems compared with the other learning techniques such as Neural Networks in order
to apply reinforcement learning to actual applications. One of the main problems of
reinforcement learning application of actual sized problem is “curse of dimensionality”
problem in partition of multi-inputs sensory states. High dimension of input leads to huge
number of rules in the reinforcement learning application. It should be avoided maintaining
computational efficiency for actual applications. Multi-agent problem such as the pursuit
problem (Benda et al., 1985; Ito & Kanabuchi, 2001) is typical difficult problem for
reinforcement learning computation in terms of huge dimensionality. As the other related
problem, learning of complex task is not easy essentially because the reinforcement learning
is based only upon rewards derived from the environment.
In order to deal with these problems, several effective approaches are studied. For relaxation
of task complexity, several types of hierarchical reinforcement learning have been proposed
to apply actual applications (Takahashi & Asada, 1999; Morimoto & Doya, 2000). To avoid
the curse of dimensionality, there exists modular hierarchical learning (Ono & Fukumoto,
1996; Fujita & Matsuno, 2005) that construct the learning model as the combination of
subspaces. Adaptive segmentation (Murano & Kitamura, 1997; Hamagami et al.,2003) for
constructing the learning model validly corresponding to the environment is also studied.
However more effective technique of different approach is also necessary in order to apply
reinforcement learning to actual sized problems.
In this chapter, I focus on the well-known pursuit problem and propose a hierarchical
modular reinforcement learning that Profit Sharing learning algorithm is combined with Q
Learning reinforcement learning algorithm hierarchically in multi-agent environment. As
the model structure for such huge problem, I propose a modular fuzzy model extending
SIRMs architecture (Seki et al., 2006; Yubazaki et al., 1997). Through numerical experiments,
I show the effectiveness of the proposed algorithm compared with the conventional
algorithms.
138 New Advances in Machine Learning
: Prey Agent
: Hunter Agent
Wall
Wall
: Prey Agent
: Hunter Agent
Successful Capturing
(utilizing the wall)
The hunter agents can utilize walls for surrounding as well as surrounding by whole hunter
agents. When the surrounding is successfully performed, related hunter agents receive
Hierarchical Reinforcement Learning Using a Modular Fuzzy Model for Multi-Agent Problem 139
reward from the environment to carry out reinforcement learning. As for behavior of the
prey agent, it behaves to run away from the nearest hunter agent for playing a fugitive role.
For actual computer simulations or mobile robot applications, it is indispensable to avoid
huge memory consumption for the state space, i.e. “curse of dimensionality”, and to
improve slow learning speed caused by its sparsity(e.g. acquired Q-value through
reinforcement learning). In this study, I focus on the 4-agent pursuit problem to improve
precision and efficiency of reinforcement learning in multi-agent environment and to
demonstrate settlement of “curse of dimensionality”.
For simulation study, I adopt “soft-max” strategy for selecting the action of the hunter
agents. The conditional probability based on Boltzmman distribution for action selection is
as follows:
exp w( s, a) / Tt
p a | s , Tt 1 Tt (1)
dN
exp w( s, d ) / Tt
where Tt is temperature at t-th iteration, s is state vector, a is the action of the agent, β is the
parameter for temperature cooling(0<β<1), w denotes evaluation value for state-and-action
pair, and N denotes the set of all alternative action at the state s. Owing to this mechanism,
the hunter agent act like random walk(exploring) with high temperature value in the early
simulation trials and act definitely based on acquired evaluation values in the later
simulation trials according to the lowered temperature value.
important to keep learning capability as well as task decomposition. According to the two-
layered decomposition, rules in the lower layer can be adapted corresponding to the agent
behavior in every step as Markov Decision Process, as shown in Fig.4.
Agent
Sensory Rules in Upper Layer
Input
IF (Monitored State) Then (Target Position)
Profit Sharing
Q-Learning
Fig. 3. Internal Hierarchical Structure of Hunter Agent
explanation simplicity. Higher dimension can also be considered as the same manner.
Original state space of each agent is expressed as the modular model by covering with three
subspaces of oneself-and-another pair as shown in Fig.6.
State Space
(g, s1, s2 , s3, s4 )
The weights of rules in the upper layer are updated by Profit Sharing learning
algorithm(Miyazaki et al., 1999), when capturing succeeds, as the following formulations:
142 New Advances in Machine Learning
u e, g , , hq
p arg max q (q e, 1) (3)
v h e
where he denotes the current position of the agent, v denotes candidate of the target position,
q denotes the other agent, and μ is the parameter. Due to these state selections, the target
position as valid sub-goal is generated and sent to the lower layer.
Q se ,t , ae ,t , c Q se ,t , ae ,t , c rt max Q se ,t , , c Q se ,t , ae ,t , c
(4)
where Q is Q-value, se,t is the state vector of the agent e at t-th step, ae,t is action of the agent e
at t-th step, c denotes the state for updating, r denotes the reward, and α, γ are parameters. It
should be noted that the current state of the agent moved from the other position always
receive rewards considered as the virtual targeted state, internally.
huge multi-dimensional problems. I extend the SIRMs method by relaxing the restriction of
the input space, i.e. single, to arbitrary subspace of the rule.
I propose a “Modular Fuzzy Model”, for constructing the model of huge multi-dimensional
space. Description of the model is as follows:
Rules i :{ if Pi ( x) is Aij then yi f ji ( Pi ( x))}mj 1i
(5)
Rules n :{ if Pn ( x) is Anj then yn f jn ( Pn ( x))}mj1 n
where “Rules-i” stands for the i-th fuzzy rule module, Pi(x) denotes predetermined
projection of the input vector x in i-th module, yi is the output variable, and n is the number
of rule modules. The number of constituent rules in the i-th fuzzy rule module is mi. f is the
function of consequent part of the rule like TSK-fuzzy model (Takagi & Sugeno, 1985). A ij
denotes the fuzzy sets defined in the projected space.
The membership degree of the antecedent part of j-th rule in “Rules-i” module is calculated
as:
where h denotes the membership degree and x0 is an input vector. The output of fuzzy
reasoning of each module is decided as the following equation.
mi
h i
k
f ki ( Pi ( x 0 ))
y
0
i
k 1
mi
(7)
h
k 1
i
k
where wi denotes the parameter of importance of the i-th rule module. The parameter can be
also formulated as the output of rule based system like modular neural network structure
(Auda & Kamel, 1999). Figure 7 shows the structure of Modular Fuzzy Model.
144 New Advances in Machine Learning
weight parameters
Output
Module 2 IF (x 1, xx3 is
) is A
1 j THEN y2=b
2 2
IF
IF x11 is A
A1jj THEN
THEN y=w
y=w
11 j
jj y
Rules i :{ if Pi ( x) is Aij then yi bij }mj 1
i
(9)
Rules n :{ if Pn ( x) is Anj then yn b jn }mj1 n
The importance parameter in Eq.(8) is set as 1.0 in this study. Instead of “crisp type”
modular model described in section 3.3, I apply the modular fuzzy model to the upper layer
model in the hierarchical reinforcement learning for pursuit problem. In addition to the
usual crisp partition of the agent position as shown in Fig.8, fuzzy sets of the position are
defined as shown in Fig.9. The antecedent fuzzy sets are defined by Cartesian products of
each fuzzy set on the state of the agent position.
Hierarchical Reinforcement Learning Using a Modular Fuzzy Model for Multi-Agent Problem 145
Membership Functions of
Horizontal Position
1.0
H1 H2 H3 H4 H5
0.0
Membership Functions of
V5 V4 V3 V2 V1
Vertical Position
0.0
1.0
5x5 = 25 partitions
Fig. 8. Usual Crisp Partition of Agent Position
u in Eq.(2) is calculated by the modular fuzzy model and is learned considering the
membership degree of the rules by the profit sharing algorithm. In this study, I assume that
the number of fuzzy sets and parameters in the premise part is decided in advance. The
parameters of real value in the consequent part are learned by the profit-sharing algorithm.
The parameters are modified as:
hij
bij mi
k (10)
hki
k 1
where k denotes the reinforcement function in Eq.(2). The denominator in Eq.(10) can be
omitted in actual processing because its value is always 1.0 from the definition of fuzzy sets
described above.
Membership Functions of
Horizontal Position
1.0
HL HM HH
0.5
0.0
Membership Functions of
VH
Vertical Position
VM
VL
0.5
0.0
1.0
3x3 = 9 partitions
Fig. 9. Fuzzy Partition of Agent Position
146 New Advances in Machine Learning
5. Numerical experiments
5.1 Results compared with conventional learning methods
In the pursuit problem, the performance of the proposed hierarchical modular
reinforcement learning method is compared with conventional methods through computer
simulations. The size of the pursuit problem is 5x5. The absolute coordinate of the agent
position is used in the experiments. The reason why relative coordinate is not used in the
experiments is to evaluate essential performance of the proposed algorithm in terms of
precision of learning, learning speed, and the memory consumption. As basic simulation
conditions, each agent cannot communicate each other but can monitor the position of the
other agents. The rule of the prey agent behavior is set as random behavior because the
random behavior theoretically involves every action strategies. The initial placement of the
prey agent and the hunter agents is shown in Fig.10.
The proposed methods are compared with the simple Q-Learning algorithm in order to
evaluate basic performance of the methods. In the experiments, it is assumed that the Q-
Learning agent(not hierarchically structured) can only utilize the position of the prey agent
in addition to own position. The Q-Learning agent decides the action by calculating Q-value
defined as Q(g, se, ae) from the sensed position of the prey agent and own position, where se
is the position of the agent e, ae is the corresponding action of the agent e, and g is the
position of the prey agent.
As for hierarchical modular reinforcement learning agents, three methods are simulated.
The expressions of the upper layer are different, though their hierarchical structures and the
lower layer driven by Q-Learning are the same. The first method is structured as the
complete expressed upper layer. From all positions of the hunter agents and the prey agent,
the target position to move is decided. The number of rules in upper layer is
25*25*25*25*25=9,765,625. The second method is “crisp” modular model for upper layer.
The number of rules in upper layer of each agent is (25*25*25*25)*3= 1,171,875. The last
method is the modular fuzzy model for upper layer. Detailed constructions of the model are
described in next subsection. For example, the 1st agent of the modular fuzzy model for
upper layer is constructed as:
700
400
300
200
100
0
0 20 40 60 80 100 120 140 160 180 200
The Number of Trials(x100)
Fig. 11. Simulation Results
where g is the position of the prey agent, h is the position of the hunter agent, and b is the
parameter of consequent part of the fuzzy rule. The fuzzy set A is constructed by
combining the crisp sets of own agent position and prey agent position with the fuzzy sets
of the other two hunter agent positions defined by partitioning the grid into 33 as shown
in Fig.9. The number of rules in upper layer is much smaller than the others, i.e.
(25*25*9*9)*3=151,875.
I perform the simulation 20 times for each method. The number of trials in the simulation
are 20,000. The results are shown in Fig.11. The depicted data is averaged value of 20 series
after averaging each sequential 100 trials. The results by the modular fuzzy model(depicted
as ModFuzzy) show the best performance compared with the other methods. Both the
learning speed and the precision of learning are desirable. Furthermore required memory
amount is much smaller than the other methods. The results by “crisp” modular
model(depicted as CrispMod) show also good performance. The complete expression
model(depicted as NonMod) cannot acquire rules efficiently and the performance is
deteriorated over time. This seems to be caused by the sparsity of model expression. The
simple Q-Learning agent (NonH-Q) is not so bad unexpectedly in the small 55 grid world.
The strategy only to approach to the prey agent acquired by the simple non-hierarchical Q-
Learning might be reasonable in such small world. However, as the knowlede about
surrounding task cannot be learned at all in such model expression, successful surrounding
completely depends upon accidental behavior of the prey agent.
changed as well as the dimension. The results are summarized in Table 1. In this Table,
averaged value, standard deviation, and standard error of episode lengh average of last 100
trials in 20 times simulation are shown as well as the number of partition and the number of
Model The Number of Partition of Agent Position The Number of Episode Length of Last 100 trials (20 times)
ID Target Own Other1 Other2 Other3 Rules for One Agent Average Standard Deviation Standard Error
m333xx 9 9 9 2,187 225.77 310.71 69.48
m533xx 25 9 9 6,075 142.76 68.08 15.22
m335xx 9 9 25 6,075 98.27 44.75 10.01
m353xx 9 25 9 6,075 8.25 1.70 0.38
m535xx 25 9 25 16,875 121.99 85.53 19.12
m553xx 25 25 9 16,875 5.97 0.50 0.11
m355xx 9 25 25 16,875 10.94 1.06 0.24
c555xx 25 25 25 46,875 11.30 22.20 4.96
m3355x 9 9 25 25 151,875 115.76 33.90 6.92
m5533x 25 25 9 9 151,875 5.81 0.33 0.07
c5555x 25 25 25 25 1,171,875 9.07 0.67 0.14
u55555 25 25 25 25 25 9,765,625 271.49 283.88 63.48
12
10
Episode Length with SD
4
m5533x
2 c5555x
0
181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
The number of Trials(x100, Averaged)
Fig. 12. Comparison of Modular Fuzzy Model and Crisp Modular Model
rules corresponding to the model. From the results of first four models, own position of the
agent might be partitioned by crisp sets, i.e. m353xx. From further results of next four
models, own position of the agent and position of the target, i.e. prey agent, might be
partitioned by crisp sets, i.e. m553xx. From these obserbations, the model construction is
heuristically performed as shown in the last four results in the Table. From the results
m5533x model has best performance among the models. Compared results with good
Hierarchical Reinforcement Learning Using a Modular Fuzzy Model for Multi-Agent Problem 149
model(c5555x) are shown in Fig.12. The significance of the m5533x model performance
compared with the other good model performance is also investigated by the t test. The
result compared with m553xx model is that null hypothesis, i.e. the means do not differ, is
rejected with statistical significance level of 0.01. As the results compared with the other
model are obvious, the description is omitted.
The results by the proposed model are considered that the learned agent can perform
surroundig task within six times movement against almost all behavior pattern of the prey
agent. This level cannot be attained without collaborative behavior of the learned agent. In
addition to its drastically improved learning speed, it can be said that the precision level of
learning is sufficient compared with the conventional techniques.
6. Conclusion
In this chapter, I focused on the pursuit problem and proposed a hierarchical modular
reinforcement learning that Profit Sharing learning algorithm is combined with Q Learning
reinforcement learning algorithm hierarchically in multi-agent environment. As the model
structure for such huge problem, I proposed a modular fuzzy model extending SIRMs
architecture. Through numerical experiments, I showed the effectiveness of the proposed
algorithm compared with the conventional algorithms. My future plan concerning with the
proposed methods includes application of another multi-agent problem or complex task
problem.
7. References
Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning, MIT Press
Watkins, C. J. & Dayan Y. (1988). Technical Note: Q-Leaning, Machine Learning, Vol.8, pp.58-
68
Grefenstette, J. J. (1988). Credit Assignment in Rule Discovery Systems Based on Genetic
Algorithms, Machine Learning, Vol.3, pp.225-245
Miyazaki. K.; Kimura. H. & Kobayashi. S. (1999). Theory and Application of Reinforcement
Learning Based on Profit Sharing, Journal of JSAI, Vol.14, No.5, pp. 800-807
Miyazaki. S.; Arai. S. & Kobayashi. S. (1999). A Theory of Profit Sharing in Multi-agent
Reinforcement Learning, Journal of JSAI, Vol. 14, No.6, pp.1156-1164
Benda. M.; Jagannathan. V. & Dodhiawalla. R. (1985). On Optimal Cooperation of
Knowledge Sources, Technical Report, BCS-G2010-28, Boeing AI Center
Ito. A. & Kanabuchi. M. (2001). Speeding up Multi-Agent Reinforcement Learning by
Coarse-Graining of Perception –Hunter Game as an Example-, Transaction of IEICE,
Vol.J84-D-1, No.3, pp.285-293
Takahashi. Y. & Asada. M. (1999). Behavior Acquisition by Multi-Layered Reinforcement
Learning, Proceedings of the 1999 IEEE International Conference on Systems, Man, and
Cybernetics., pp.716-721
Morimoto. J. & Doya. K. (2000). Acquisition of Stand-up Behavior by a Real Robot using
Hierarchical Reinforcement Learning, Proceedings of International Conference on
Machine Learning, pp. 623-630
150 New Advances in Machine Learning
0
10
1. Introduction
Combining multiple classifiers (e.g., decision trees) to build an ensemble is an advanced ma-
chine learning technique with substantially improvement over single-based classifiers. Ran-
dom forests (RFs) (1), a representative decision tree-based ensemble has been emerged as a
principle machine learning tool combining properties of efficient classifier and feature selec-
tion model running on general-purpose processor (GPP-based) custom-hardware and opti-
mized operating systems. Rather than minimizing training error, RF minimizes the general-
ization
√ error, while being fast to train, proven not to overfit, and computationally effective,
(O( VTlogT ), where V is the number of variables and T is the number of observations).
These merits make RF a potential tool suited for adaptive classification problems. RF has
been applied to vision problems such as object recognition (2–7). It has also been used for
OCR (8) and for key point recognition (9). Despite of the appearance success of RF virtually
no work has been done to map from its ideal mathematical model to compact and reliable
hardware design.
In this chapter we present object recognition system implemented on a field programmable
gate array (FPGA), enables learning algorithm to scale up. Fig.1 shows the general architec-
ture of the proposed recognition system, composed of two main steps, each comprises several
computational models. In the first step, objects are automatically represented as covariance
matrices followed by a tree-based RF detector that operates on-line. We have shown in (4)
utilizing a bag of covariance matrices as object descriptor improves the accuracy of object
recognition while speed up the learning process, so we are extending this technique, present
its hardware architecture. The on-line RF detector is designed using Logarithmic Number
System (LNS) (10), RF-LNS, allows the reduction of the required word-length to 16 bits, and
consequently a general-purpose microprocessor of the same word-length can be used. For the
compact architecture we made RF-LNS comprises few computation modules, referred to as
‘Tree Units’, ‘Majority Vote Unit’, and ‘Forest Units’. The main contribution of our approach
(in addition to its impacts on the tradeoff between algorithmic setting accuracy and hardware
implementation cost) is three-fold: (1) its direction towards arithmetic complexity reduction
using a modified RF based on LNS (RF-LNS), (2) it has been designed in order to be easily
integrated in a system-on-chip (SoC), which can perform both automatic feature selection and
recognition, and (3) it allows for fair comparison with floating-point (FP) and fixed-point (FX)
implementations. We test and verified the model functionality using numerical simulation,
present results obtained using examples from GRAZ02 dataset (11). First, in Section 2 we
present related works and highlight on general constrains in implementing hardware-based
recognition systems. Section 3 shows the object descriptor we used and overview on RF al-
152 New Advances in Machine Learning
gorithmic settings. In Section 4 we present full architecture and design of our recognition
system. We follow with experimental evaluation and estimation of the required precision in
Section 5. A brief conclusion appears in Section 6.
• GPP vs. FPGA: A general purpose processor’s (GPP) hardware contains all the basic
blocks needed to build any logic of mathematical function imaginable but the limita-
tions are in the parallelism available in the program, i.e. performance, and power con-
sumption. FPGA provides flexibility to cope with the current evolving applications but
at the cost of large performance, area, power and reconfiguration time penalties.
3. Algorithmtic Considerations
The proposed object recognition approach consists of two basic models, a model for object
descriptor based on covariance matrices (4; 15) and a classifier based on on-line variant of RF
implemented on FPGA using LNS. First we introduce the algorithmtic settings of each model.
F ( x, y) = φ( I, x, y) (1)
where function φ can be any feature maps (such as intensity, color, etc). For a given region
R ⊂ F, let {zk }k=1···n be the d dimensional feature points inside R. We represent region R with
d × d covariance matrix CR of feature points.
n
1
CR = ∑ (z − µ)(zk − µ) T (2)
n − 1 k =1 k
Fig. 2. (A) Rectangles are examples of possible regions for histogram features. Stable appear-
ance in Rectangles A, B and C are good candidates for a car classifier while regions D is not.
(C) Top, points sampled to calculate the LBP around a point ( x, y). Bottom, the use of standard
invariant feature (SIFT). (D) Any region can be represented by a covariance matrix. Size of the
bag is proportional to the number of features used, while the size of the covariance matrix
depends on the dimension of the features.
noting that features in the covariance matrix may be used in multiple image locations.
Color. Color is described by taken Ohta space histogram values of pixels ( I1 = R + G + B/3,
I2 = R − B, I3 = (2G − R − B)/2). This histogram is chosen because it is less sensitive to vari-
ations in illumination. Ohta values for each pixel in an image are clustered using k-means,
e.g., each pixel in image I is assigned to the nearest cluster center, then histogram frequency
is normalized.
Appearance. We have used histograms of Local Binary Patterns (LBPs) for representing each
feature’s appearance in some appearance space. Fig.2C depicts the points that must be sam-
pled around a particular point ( x, y) in order to calculate the LBP. In our implementation, each
sample point lies at a distance of 2 pixels from ( x, y). Instead of the traditional 3 × 3 rectangu-
lar neighborhood, we sample neighborhood circularly with two different radii (1 and 3). The
resulting operators are denoted by LBP8,1 and LBP8,1+8,3 , where subscripts tell the number of
samples and the neighborhood radii.
A bag of covariance matrices. A bag of covariance which is a concatenation of Ohta color
space histogram, and appearance model based on LBP and Scale Invariant Feature Transform
(SIFT) of different features of an image region is presented in Fig.1E. Then estimate the bag of
covariance matrix likelihoods P( Ii |C, Ii ) and the likelihood that each bag of covariance matri-
ces is homogeneously labeled. We use this representation to automatically detect any target
in images. We then apply on-line RF learner to select object descriptors and to learn an object
classifier.
positive (contains the object relevant to the class) and negative (does not contain the object)
image I, and covariance feature Ck . The decision generated by a random tree corresponds to
a covariance feature selected by learning algorithm. Each tree casts a unit vote for a single
matrix, resulting in a classifier h ( I, Ck ).
Forest. Given a set of M decision trees, a forest is computed as ensemble of these tree-
generated base classifiers h ( I, Ck ), k = 1, . . . , n, using a majority vote.
Majority vote. If there are M Decision Trees, the majority voting method will give a correct
decision if at least f loor ( M/2) + 1 decision trees gives correct outputs. If each tree has prob-
ability p to make a correct decision, then the forest will have the following probability P to
make a correction decision.
b
M
P= ∑ i
p (1 − p ) (3)
i=floor(M/2)+1
4. Hardware Architecture
4.1 FPGA Architecture
All FPGAs consist of three major components: 1) logic blocks (LBs); 2) I/O blocks; and 3)
programmable routing, as shown in Fig.3(A). A logic block (LB) is functionally complete logic
circuits, partitioned to LB size, mapped and routed, and place in an interconnect framework to
perform a desired operation. Field programmability is achieved through switches (transistors
controlled by memory element or fuses) and each I/O block is programmed to act as an input
or output, as required, i.e., N-input LUTs can implement any n-input boolean function. The
programmable routing is also configured to make the necessary connections between logic
blocks, and from logic blocks to I/O blocks. The processing power of an FPGA is highly
dependent on the processing capabilities of its LBs and the total number of LBs available in
the array. Generally, FPGAs use logic blocks that contain one or more LUT, typically with at
least four-inputs. A four-input LUT can implement any binary function of four logic inputs.
Fig.3(B) shows the architecture of a simple LB containing one four-input LUT and one flip-flop
for storage.
Fig. 3. (A) Granularity and interconnection structure of generic Xilinx FPGA. (B) An architec-
ture of a logic block with one, four-input LUT use for implementation of memory and shift
registers.
for the sign of the exponent, i.e., with a total of ( G + F + 1) bits. If the radix is considered
to be 2, then the smallest number that can be represented using the scheme is 2− N , where
N = (s G − 1) + (1 − 2− F ) = (2G − 2− F ). The ratio between two consecutive numbers is equal
−F
to r2 , and the corresponding precision e is roughly (lnr )2− F . Typically, if G = 5, F = 30, and
r = 2, we can have a precision of 30 bits in radix 2. However, for the purpose of comparison
with the precision of FP representation, e will be assumed as 2−23 (≈ 10−7 ). Numbers closer
to zero, are represented with better precision in LNS than FP systems. However, LNS depart
from FP in that, the relative error of LNS is constant and LNS can often achieve equivalent
signal-to-noise ratio with fewer bits of precision relative to FP architectures.
5. Evaluation
The functionality of the proposed system was simulated, and the hardware is programmed.
We now demonstrate the usefulness of this frame work in the area of recognition generic
objects such as bikes, cars, and persons.
5.1 Dataset
We have used data derived from the GRAZ021 dataset (11), a collection of 640 × 480 24-bit
color images. As can be seen in Fig.5, this dataset has three object classes, bikes, cars, persons,
and in addition to the background class (270 images). This database contains variability with
respect to scale and clutter. Objects of interest are often occluded, and they are not dominant in
the image. According to (16) the average ratio of object size to image size counted in number
of pixels is 0.22 for bikes, 0.17 for people, and 0.9 for cars. Thus this dataset is more complex
dataset to learn detectors from, but of more interest because it better reflects the real world
complexity. Table 1 reports the number of images and objects in each class, 380 images are
available for background class .
Table 1. Number of images and objects in each class in the GRAZ02 dataset.
6. Performances
GRAZ02 images contain only one object category per image so the recognition task can be
seen as a binary classification problem: bikes vs. background (i.e., non-bikes), people vs.
background, and car vs. background. Generalization performances in these object recogni-
tion experiments were estimated by statistic measure; the Area Under the ROC Curve (AUC)
to measure the classifiers performance. AUC measures of classifier performance that is inde-
pendent of the threshold, meaning it summarizes how true positive and false positive rates
1 available at http://www.emt.tugraz.at/˜pinz/data/
Random Forest-LNS Architecture and Vision 159
Fig. 5. Examples from GRAZ02 dataset (11) for four different categories: A) cars and ground
truth, B) bikes and ground truth, C) persons and ground truth, and D) background.
change as the threshold gradually increases from 0.0 to 1.0, i.e., it does not summarize accu-
racy. An ideal perfect classifier has an AUC of 1.0 and a random classifier has an AUC of
0.5.
= 7. For example a figure of 85% means that 85% of object images were correctly classified but
15% of the background images were incorrectly classified (i.e. thought to be foreground). For
RF-LNS to maintain acceptable performance, 16 bits of precision are sufficient for all GRAZ02
categories, even when only 10% training examples are used. Such low precision required by
RF-LNS makes it competitive with FP arithmetic for our generic object recognition application.
Table 2. Mean AUC performance of RF-LNS on the Bikes vs. Background dataset, by amount
of training data. Performance of RF-LNS is reported for different Depths (D).
Table 3. Mean AUC performance of RF-LNS on the Cars vs. Background dataset, by amount
of training data. Performance of RF-LNS is reported for different Depths (D).
Table 4. Mean AUC performance of RF-LNS on the Persons vs. Background dataset, by
amount of training data. Performance of RF-LNS is reported for different Depths (D).
also achieved high speed clock rate processing. For the 1-bit RF-LNS, the power dissipation is
small, and the area usage on FPGA is less than 2 percents.
Table 5. Slices used for different tree units for each dataset.
8. References
[1] L. Breiman, “Random Forests,” Machine Learning, 45(1):5-32, 2001.
[2] F. Moomsmann, B. Triggs, and F. Jurie. “Fast discriminative visual codebooks using ran-
domized clustering forests,” In Proc. NIPS 2006.
[3] J. Winn and A. Criminisi. “Object class recognition at a glance,” In Proc. CVPR, 2006.
[4] H. Elgawi Osman, “A binary Classification and Online Vision,” In Proc. IJCNN, 2009.
pp.1142-1148
[5] A.Bosch, A.Zisserman, X.Munoz, “Image Classification Using Random Forests and
Ferns,” ICCV, pp.1-8, 2007.
[6] J. Shotton, M. Johnson, R. Cipolla, “Semantic Texton Forests for Image Categorization and
Segmentation,” In Proc. CVPR, pp.1-8 2008.
[7] F. Schroff, A. Criminisi, and A. Zisserman, “Object Class Segmentation using Random
Forests,” In Proc. BMVC 2008.
162 New Advances in Machine Learning
[8] Y. Amit and D. Geman. “Shape quantization and recognition with randomized trees,”
Neural Computation 9(7):15451588, 1997.
[9] V. Lepetit, P. Lagger, and P. Fua. “Randomized trees for real-time keypoint recognition,”
In Proc. CVPR, 2005.
[10] H. Elgawi Osman, “Hardware-based solutions utilizing Random Forests for Object
Recognition,” In Proc. ICONIP, Part II, LNCS 5507, pp. 760-767, 2008.
[11] A. Oplet, M. Fussenegger, A. Pinz and P. Auer. “Generic object recognition with boost-
ing,” TPAMI 28(3):416-431, 2006.
[12] R. Genov and G. Cauwenberghs. “Kerneltron: Support Vector Machine in Silicon,” IEEE
Transactions on Neural Networks, 14(5):1426-1434, 2003.
[13] R. Genov, S. Chakrabartty and G. Cauwenberghs. “Silicon Support Vector Machine with
On-Line Learning,” IJPRAI, 17(3):385-404, 2003.
[14] M. Muselli and D. Liberati. “Binary Rule Generation via Hamming Clustering,” IEEE
Transactions on Knowledge and Data Engineering, 14(6):1258-1268, 2002.
[15] O. Tuzel, F. Porikli, and P. Meer. “Region covariance: A fast descriptor for detection and
classification,” In Proc. ECCV, pp.589-600, 2006.
[16] A. Opelt. and Pinz A. “Object Localization with boosting and weak supervision for
generic object recognition,” In Kalvianen H. et al. (Eds.) SCIA 2005, LNCS 3450, pp.862-871,
2005
An Intelligent System for Container Image Recognition using
ART2-based Self-Organizing Supervised Learning Algorithm 163
11
x
Korea
1. Introduction
Recently, the quantity of goods transported by sea has increased steadily since the cost of
transportation by sea is lower than other transportation methods. Various automation
methods are used for the speedy and accurate processing of transport containers in the
harbor. The automation systems for transport container flow processing are classified into
two types: the barcode processing system and the automatic recognition system of container
identifiers based on image processing. However, these days the identifier recognition
system based on images is more widely used in the harbors.
Identifiers of shipping containers are given in accordance with the terms of ISO standard,
which consist of 4 code groups such as shipping company codes, container serial codes,
check digit codes and container type codes (ISO-6346, 1995; Kim, 2003). And, only the first
11 identifier characters are prescribed in the ISO standard and shipping containers are able
to be discriminated by automatically recognizing the first 11 characters. But, other features
such as the foreground and background colors, the font type and the size of container
identifiers, etc., vary from one container to another since the ISO standard doesn’t prescribes
other features except code type (Kim, 2004; Nam et al., 2001). Since identifiers are printed on
the surface of containers, shapes of identifiers are often impaired by the environmental
factors during the transportation by sea. The damage to a container surface may lead to a
distortion of shapes of identifier characters in a container image. So, the variations in the
feature of container identifiers and noises make it quite difficult the extraction and
recognition of identifiers using simple information like color values (Kim, 2004).
Generally, container identifiers have another feature that the color of characters is black or
white. Considering such a feature, in a container image, all areas excepting areas with black
or white colors are regarded as noises, and areas of identifiers and noises are discriminated
by using a fuzzy-based noise detection method. Noise areas are replaced with a mean pixel
value of the whole image area, and areas of identifiers are extracted and binarized by
applying the edge detection by Sobel masking operation and the vertical and horizontal
164 New Advances in Machine Learning
block extraction to the conversed image one by one. In the extracted areas, the color of
identifiers is converted to black and one of background to white, and individual identifiers
are extracted by using a 8-directional contour tacking algorithm. An ART2-based self-
organizing supervised learning algorithm for the identifier recognition is proposed in this
chapter, which creates nodes of the hidden layer by applying ART2 between the input layer
and the hidden one and improves performance of learning by applying generalized delta
learning and the Delta-bar-Delta algorithm (Vogl et al., 1998). Experiments using many
images of shipping containers show that the presented identifier extraction method and the
ART2-based supervised learning algorithm is more improved compared with the methods
proposed previously.
C D E
1.0
Degree of
membership
0.5
0.0
50 110 170
Intensity value of pixel
Fig. 1. Membership function(G) for gray-level pixels
An Intelligent System for Container Image Recognition using
ART2-based Self-Organizing Supervised Learning Algorithm 165
if G 50 or G 170 then u G 0
G 50 (1)
else if G 50 or G 110 then u G
110 50
110 G
else if G 110 or G 170 then u G
170 110
To observe the effectiveness of the fuzzy-based noise detection, results of edge detection by
Sobel masking are compared between the original image and the noise-removed image by
the proposed method. Fig. 2 is the original container image, and Fig. 3 is the output image
generated by applying only Sobel masking to a grayscale image of Fig. 2. Fig. 4 is results of
edge detection obtained by applying the fuzzy-based noise removal and Sobel masking to
Fig.2. First, the fuzzy-based noise detection method is applied to a grayscale image of the
original image and pixels detected as noises are replaced with a mean gray value. Next,
edges of container identifiers are detected by applying Sobel masking to the noise-removed
image. As shown in Fig. 3, noise removal by the proposed fuzzy method generates more
efficient results in the extraction of areas of identifiers.
Fig. 5. Identifier area with a general color and successful results of edge extraction
An Intelligent System for Container Image Recognition using
ART2-based Self-Organizing Supervised Learning Algorithm 167
Fig. 6. Identifier area with white color and failed results of edge extraction
Fig. 7. Reversed binarized area of Fig. 6 and successful result of edge detection
The procedure of extracting individual identifiers using the 8-directional contour tracking
method is as follow: Pi r and Pi c are pixels of horizontal and vertical directions being
currently scanned in the identifier area, respectively, and Pi r 1 and Pi c 1 are pixels of the two
directions being next scanned in the identifier area. And Psr and Psc are pixels of horizontal
and vertical directions in the first mask of the 8-directional contour tracking.
Step 1. Initialize with Eq. (2) in order to apply the 8-neighborhood contour tracking
algorithm to the identifier area, and find the pixel by applying tracking mask as
shown in Fig. 8.
Pi r 1 Pi r , Pi c 1 Pi c (2)
Step 2. When a black pixel is found after applying the tracking mask in the current pixel,
calculate the value of Pi r and Pi c as shown in Eq. (3)
7 7
Pi r Pi r 1 , Pi c Pi c 1 (3)
i 0 i 0
Step 3. For the 8 tracking masks, apply Eq. (4) to decide the next tracking mask.
Step 4. Stop if Pi r and Pi c return back to Psr and Psc or go back to the Step 1 and repeat. If
Pi r Psr 1 and Pi c Psc 1 then break, else go back to Step 1.
168 New Advances in Machine Learning
Fig. 5 and Fig. 7 shows extraction results of individual identifiers by using the 8-directional
contour tracking method.
6 5 4 7 6 5 0 7 6 1 0 7
7 3 0 4 1 5 2 6
0 1 2 1 2 3 2 3 4 3 4 5
EE SE SS SW
2 1 0 3 2 1 4 3 2 5 4 3
3 7 4 0 5 1 6 2
4 5 6 5 6 7 6 7 0 7 0 1
WW NW NN NE
feature drastically differs from the cluster center but its impact is minimized due to
averaging all differences. When the traditional ART2 algorithm was applied to the
recognition of container identifiers, it was observed that the recognition rate declined due to
the classification of such different input patterns to the same cluster. Therefore, we propose
a novel ART2-based hybrid network architecture where the middle layer neurons have RBF
(Radial Basis Function) properties and the output layer neurons have a sigmoid function
property.
An ART2-based self-organizing supervised learning algorithm for the recognition of
container identifiers, is proposed in this chapter. First, a new leaning structure is applied
between the input and the middle layers, which applies ART2 algorithm between the two
layers, select a node with maximum output value as a winner node, and transmits the
selected node to the middle layer. Next, generalized Delta learning algorithm and Delta-bar-
Delta algorithm are applied in the learning between the middle and the output layers,
improving the performance of learning. The proposed learning algorithm is summarized as
follows:
1. The connection structure between the input and the middle layers is like ART2
algorithm and the output layer of ART2 becomes the middle layer of the proposed
learning algorithm.
2. Nodes of the middle layer mean individual classes. Therefore, while the proposed
algorithm has a fully-connected structure on the whole, it takes the winner node
method that compares target vectors and output vectors and back-propagates a
representative class and the connection weight.
3. The proposed algorithm performs the supervised learning by applying generalized
Delta learning as the learning structure between the middle and the output layers.
4. The proposed algorithm improves the performance of learning by applying Delta-bar-
Delta algorithm to generalized Delta learning for the dynamical adjustment of a
learning rate. When defining the case that the difference between the target vector and
the output vector is less than 0.1 as an accuracy and the opposite case as an inaccuracy,
Delta-bar-Delta algorithm is applied restrictively in the case that the number of
accuracies is greater than or equal to inaccuracies with respect to total patterns. This
prevents no progress or an oscillation of learning keeping almost constant level of error
by early premature situation incurred by competition in the learning process.
The detailed description of ART2-based self-organizing supervised learning algorithm is
like Fig. 9.
3. Performance evaluation
The proposed algorithm is implemented by using Microsoft Visual C++ 6.0 on the IBM-
compatible Pentium-IV PC for performance evaluation. 79 container images with size of
640x480 are used in the experiments for extraction and recognition of container identifiers.
In the extraction of identifier areas, the previously proposed method fails to extract in
images containing noises vertically appearing by an external light and the rugged surface
shape of containers. On the other hand, the proposed extraction method detects and
removes noises by using a fuzzy method, improving the success rate of extraction compared
with the previously proposed. The comparison of the success rate of identifier area
extraction between the proposed method in this chapter and the previously proposed
method is like Table 2.
170 New Advances in Machine Learning
For the experiment of identifier recognition, applying the 8-directional contour tracking
method to 72 identifier areas extracted by the proposed extraction algorithm, 284 alphabetic
characters and 500 numeric characters are extracted. The recognition experiments are
x w t
N 1
1
Oj i ji k (Tk Ok )Ok (1 Ok )
N i 0
performed with the FCM-based RBF network and the proposed ART2-based self-organizing
supervised learning algorithm using extracted identifier characters and compared the
recognition performance in Table 3.
In the experiment of identifier recognition, the learning rate and the momentum are set to
0.4 and 0.3 for the two recognition algorithms, respectively. And, for ART2 algorithm
An Intelligent System for Container Image Recognition using
ART2-based Self-Organizing Supervised Learning Algorithm 171
generating nodes of the middle layer in the proposed algorithm, vigilance variables of two
character types are set to 0.4.
When comparing the number of nodes of the middle layer between the two algorithms, the
proposed algorithm creates more nodes than FCM-based RBF network, but via the
comparison of the number of Epochs, it is known that the number of iteration of learning in
the proposed algorithm is less than FCM-based RBF network. That is, the proposed
algorithm improves the performance of learning. Also, comparing the success rate of
recognition, it is able to be known that the proposed algorithm improves the performance of
recognition compared with FCM-based RBF network. Failures of recognition in the
proposed algorithm are incurred by the damage of shapes of individual identifiers in
original images and the information loss of identifiers in the binarzation process.
ART2-base self-organizing
FCM-based RBF network
supervised learning algorithm
# of success of # of success of
# of Epoch # of Epoch
recognition recognition
Alphabetic 240 280
236 221
Characters(284) (84.5%) (98.5%)
Numeric 422 487
161 151
Characters(500) (84.4%) (97.4%)
Table 3. Evaluation of recognition performance
4. Conclusion
This chapter proposes an automatic recognition system of shipping container identifiers
using fuzzy-based noise removal method and ART2-based self-organizing supervised
learning algorithm. In the proposed method, after detecting and removing noises from an
original image by using a fuzzy method, areas of identifiers are extracted. In detail, the
performance of identifier area extraction is improved by removing noises incurring errors
using a fuzzy method based on the feature that the color of container identifiers is white or
black on the whole. And, individual identifiers are extracted by applying the 8-directional
contour tracking method to extracted areas of identifiers. Experiments using 79 container
images show that 72 areas of identifiers and 784 individual identifiers are extracted
successfully and 767 identifiers among the extracted are recognized by the proposed
recognition algorithm. Failures of recognition in the proposed algorithm are incurred by the
damage of shapes of individual identifiers in original images and the information loss of
identifiers in the binarzation process.
5. References
Chen, Y.S. & Hsu, W.H. (1989). A systematic approach for designing 2-subcycle and pseudo
1-Subcycle parallel thinning algorithms, Pattern Recognition, Vol.22, No.3, (1989) pp.
267-282
ISO-6346, (1995). Freight Containers-Coding -Identification and Marking
172 New Advances in Machine Learning
0
12
Jörn Mehnen
Decision Engineering Centre, Cranfield University Cranfield, Bedfordshire MK43 0AL
United Kingdom
In this chapter, we explore difficulties one often encounters when applying machine learning
techniques to real-world data, which frequently show skewness properties. A typical example
from industry where skewed data is an intrinsic problem is fraud detection in finance data. In
the following we provide examples, where appropriate, to facilitate the understanding of data
mining of skewed data. The topics explored include but are not limited to: data preparation,
data cleansing, missing values, characteristics construction, variable selection, data skewness,
objective functions, bottom line expected prediction, limited resource situation, parametric
optimisation, model robustness and model stability.
1. Introduction
In many contexts like in a new e-commerce website, fraud experts start investigation proce-
dures only after a user makes a claim. Rather than working reactively, it would be better for
the fraud expert to act proactively before a fraud takes place. In this e-commerce example, we
are interested in classifying sellers into legal customers or fraudsters. If a seller is involved in
a fraudulent transaction, his/her license to sell can be revoked by the e-business. Such a de-
cision requires a degree of certainty, which comes with experience. In general, it is only after
a fraud detection expert has dealt with enough complains and enough data that he/she ac-
quired a global understanding of the fraud problem. Quite often, he/she is exposed to a huge
number of cases in a short period of time. This is when automatic procedures, commonly
computer based, can step in trying to reproduce expert procedures thus giving experts more
time to deal with harder cases. Hence, one can learn from fraud experts and build a model for
fraud. Such a model requires fraud evidences that are commonly present in fraudulent behav-
ior. One of the difficulties of fraud detection is that fraudsters try to conceal themselves under
a “normal” behavior. Moreover, fraudsters rapidly change their modus operandi once it is
174 New Advances in Machine Learning
discovered. Many fraud evidences are illegal and justify a more drastic measure against the
fraudster. However, a single observed indicator is often not strong enough to be considered
a proof and needs to be evaluated as one variable among others. All variables taken together
can indicate high probability of fraud. Many times, these variables appear in the literature
by the name of characteristics or features. The design of these characteristics to be used in a
model is called characteristics extraction or feature extraction.
2. Data preparation
2.1 Characteristics extraction
One of the most important tasks on data preparation is the conception of characteristics. Un-
fortunately, this depends very much on the application (See also the discussions in Section 4
and 5). For fraud modelling for instance, one starts from fraud expert experience, determine
significant characteristics as fraud indicators, and evaluates them. In this evaluation, one is
interested in measuring how well these characteristics:
• covers (is present in) the true fraud cases;
• and how clearly they discriminate fraud from non-fraud behavior.
In order to cover as many fraud cases as possible, one may verify how many of them are
covered by the characteristics set. The discrimination power of any of these characteristics
can be evaluated by their odds ratio. If the probability of the event (new characteristics) in
each of two compared classes (fraud and non-fraud in our case) are p f (first class) and pn
(second class), then the odds ratio is:
p f / (1 − p f ) p f (1 − p n )
OR = = .
p n / (1 − p n ) p n (1 − p f )
An odds ratio equals to 1 describes the characteristics as equally probable in both classes
(fraud and non-fraud). The more this ratio is greater/less than 1, the more likely this charac-
teristic is in the first/second class than in the other one.
and others. It can also be useful to check the univariate distribution of each variable including
the percentage of outliers, missing and miscellaneous values.
Having identified the characteristics that contain errors, the next step is to somehow fix the
inconsistencies or minimise their impact in the final model. Here we list, in the form of ques-
tions, some good practices in data cleansing used by the industry that can sometimes improve
model performance, increase generalisation power and finally, but no less important, make
models less vulnerable to fraud and faults.
1. Is it possible to fix the errors by running some codes on the dataset? Sometimes wrong
values have a one-to-one mapping to the correct values. Therefore, the best strategy is
to make the change in the development dataset and to carry on with the development.
It is important that these errors are fixed for the population the model will be applied to
as well. This is because both developing and applying populations must be consistent,
otherwise fixing the inconsistency would worsen the model performance rather than
improving it;
2. Is a small number of attributes1 (less than 5%) impacting only few rows (less than 5%)?
In this case, one can do a bivariate analysis to determine if it is possible to separate these
values into a default (or fault) group. Another option is to drop the rows. However, this
tactic might turn out to be risky (see section about missing values);
3. Is the information value of the problematic attribute(s) greater than for the other at-
tributes combined? Consider dropping this characteristic and demand fixing;
4. Is it possible to allow outliers? Simply dropping them might be valid if there are few or
there are invalid values. Change their values to the appropriate boundary could also be
valid. For example, if an acceptable range for yearly income is [1,000;100,000] MU2 and
an applicant has a yearly income of 200,000 MU then it should be changed to 100,000
MU. This approach is often referred to as truncated or censored modelling Schneider
(1986).
5. Finally, in an industry environment, when an MIS is available, one can check for the
acceptance rate or number of rows to be similar to the reports? It is very common for
datasets to be corrupted after transferring them from a Mainframe to Unix or Windows
machines.
3. Data skewness
A dataset for modelling is perfectly balanced when the percentage of occurrence of each class
is 100/n, where n is the number of classes. If one or more classes differ significantly from the
others, this dataset is called skewed or unbalanced. Dealing with skewed data can be very
tricky. In the following sections we explore, based on our experiments and literature reviews,
some problems that can appear when dealing with skewed data. Among other things, the
following sections will explain the need for stratified sampling, how to handle missing values
carefully and how to define an objective function that takes the different costs for each class
into account.
Investigating even further, by analysing the fraud rates by ranges as shown in table 2, one
can see that the characteristic being analysed really helps to predict fraud; on the top of this,
missing values seem to be the most powerful attribute for this characteristic.
When developing models with balanced data, in most cases one can argue that it is good prac-
tice to avoid giving prediction to missing values (as a separate attribute or dummy), especially,
if this attribute ends up with dominating the model. However, when it comes to unbalanced
data, especially with fraud data, some specific value may have been intentionally used by the
fraudster in order to bypass the system’s protection. In this case, one possible explanation
could be a system failure, where all international transaction are not being correctly currency
converted when passed to the fraud prevention system. This loophole may have been found
by some fraudster and exploited. Of course, this error would have passed unnoticed had one
not paid attention to any missing or common values in the dataset.
4. Derived characteristics
New or derived characteristics construction is one of, if not the, most important part of mod-
elling. Some important phenomena mapped in nature are easily explained using derived
variables. For example, in elementary physics speed is a derived variable of space over time.
In data mining, it is common to transform date of birth into age or, e.g., year of study into
primary, secondary, degree, master, or doctorate. Myriad ways exist to generate derived char-
acteristics. In the following we give three typical examples:
Data mining with skewed data 177
5. Categorisation (grouping)
Categorisation (discretising, binning or grouping) is any process that can be applied to a char-
acteristic in order to turn it into categorical values Witten & Franku (2005). For example, let
us suppose that the variable age ranges from 0 to 99 and all values within this interval are
possible. A valid categorisation in this case could be:
1. category 1: if age is between 1 and 17;
2. category 2: if age is between 18 and 30;
3. category 3: if age is between 31 and 50;
4. category 4: if age is between 51 and 99.
Among others, there are three main reasons for categorising a characteristic: firstly, to increase
generalisation power; secondly, to be able to apply certain types of methods, such as, e.g.
a Generalised Linear Model4 (GLM) Witten & Franku (2005), or a logistic regression using
Weight of Evidence5 (WoE) formulations Agterberg et al. (1993); thirdly, to add stability to the
model by getting rid of small variations causing noise. Categorisation methods include:
1. Equal width: corresponds to breaking a characteristic into groups of equal width. In the
age example we easily break age into 5 groups of 20 decimals in each: 0-19, 20-39, 40-59,
60-79, 80-99.
2. Percentile: this method corresponds to breaking the characteristic into groups of equal
volume, or percentage, of occurrences. Note that in this case groups will have different
widths. In some cases breaking a characteristic into many groups may not be possible
because occurrences are concentrated. A possible algorithm in pseudo code to create
percentile groups is:
Nc <- Number of categories to be created
Nr <- Number of rows
Size <- Nr/Nc
Band_start [0] <- Minimum (Value (characteristic[0..Nr]))
//Dataset needs to be sorted by the characteristic to be grouped
For j = 1 .. Nc {
For i = 1 .. Size {
3 Arrears is a legal term for a type of debt which is overdue after missing an expected payment.
4 One example of this formulation is logistic regression using dummy input variables
5 This formulation replaces the original characteristic grouped attribute for its weight of evidence
178 New Advances in Machine Learning
The result of the first step, eliminating intervals without monotonic odds can be seen in Ta-
ble 4. Here bands 50-59 (odds of 3.00), 60-69 (odds of 4.00) and 70-79 (odds of 3.75) have been
merged, as shown in boldface. One may notice that merging bands 50-59 and 60-69 would
Data mining with skewed data 179
result in a group with odds of 3.28; hence resulting in the need to merge with band 70-79 to
yield monotonic odds.
By using, for example, 0.20 as the minimum allowed odds difference, Table 5 presents the re-
sult of step two where bands 30-39 (odds of 5.30) and 40-49 (odds of 5.18) have been merged.
This is done to increase model stability. One may notice that odds retrieved from the devel-
opment become expected odds in a future application of the model. Therefore, these values
will vary around the expectation. By grouping these two close odds, one tries to avoid that a
reversal in odds may happen by pure random variation.
For the final step, if we assume 2% to be be the minimum allowed percentage of the population
in each group. This forces band 0-9 (1.83% of total) to be merged with one of its neighbours;
in this particular case, there is only the option to merge with band 10-19. Table 6 shows the
final result of the bivariate grouping process after all steps are finished.
6. Sampling
As computers become more and more powerful, sampling, to reduce the sample size for
model development, seems to be losing attention and importance. However, when dealing
with skewed data, sampling methods remain extremely important Chawla et al. (2004); Elkan
(2001). Here we present two reasons to support this argument.
First, to help to ensure that no over-fitting happens in the development data, a sampling
method can be used to break the original dataset into training and holdout samples. Fur-
thermore, a stratified sampling can help guarantying that a desirable factor has similar per-
centage in both training and holdout samples. In our work Gadi et al. (2008b), for example,
we executed a random sampling process to select multiple splits of 70% and 30%, as training
and holdout samples. However, after evaluating the output datasets we decided to redo the
sampling process using stratified sampling by fraud/legitimate flag.
180 New Advances in Machine Learning
Second, to improve the model prediction, one may apply an over- or under- sampling pro-
cess to take the different cost between classes into account. Cost-sensitive procedure Elkan
(2001) replicates (oversampling) the minority (fraud) class according to its cost in order to bal-
ance different costs for false positives and false negatives. In Gadi et al. (2008a) we achieved
interesting results by applying a cost-sensitive procedure.
Two advantages of a good implementation of a cost-sensitive procedure are: first, it can enable
changes in cut-off to the optimal cut-off, For example, in fraud detection, if the cost tells one,
a cost-sensitive procedure will consider a transaction with as little as 8% of probability of
fraud as a potential fraud to be investigated; second, if the cost-sensitive procedure considers
cost per transaction, such an algorithm may be able to optimise decisions by considering the
product [probability of event] x [value at risk], and decide on investigating those transactions
in which this product is bigger.
7. Characteristics selection
Characteristics selection, also known as feature selection, variable selection, feature reduction,
attribute selection or variable subset selection, is commonly used in machine learning and sta-
tistical techniques to select a subset of relevant characteristics for the building of more robust
models Witten & Franku (2005).
Decision trees do characteristics selection as part of their training process when selecting only
the most powerful characteristics in each subpopulation, leaving out all weak or highly cor-
related characteristics. Bayesian nets link different characteristics by cause and effect rules,
leaving out non-correlated characteristics Charniak (1991). Logistic Regression does not use
any intrinsic strategy for removing weak characteristics; however, in most implementations
methods such as forward, backward and stepwise are always available. In our tests, we have
applied a common approach in the bank industry that is to consider only those characteristics
with information value greater than a given percentage threshold.
8. Objective functions
When defining an objective function, in order to compare different models, we found in our
experiments that two facts are especially important:
1. We have noticed that academia and industry speak in different languages. In the aca-
demic world, measures such as Kolmogorov Smirnov (KS) Chakravarti et al. (1967)
or Receiver Operating Characteristic (ROC curve) Green & Swets (1966) are the most
common; in industry, on the other hand, rates are more commonly used. In the fraud
detection area for example it is common to find measures such as hit rate (confidence)
and detection rate (cover). Hit rate and detection rate are two different dimensions and
they are not canonical. To optimise a problem with an objective having two outcomes
is not a simple task Trautmann & Mehnen (2009). In our work in fraud detection we
avoided this two-objective function by calculating one single outcome value: the total
cost of fraud;
2. In an unbalanced environment it is common to find that not only the proportion be-
tween classes differs, but also the cost between classes. For example, in the fraud de-
tection environment, the loss by fraud when a transaction is fraudulent is much bigger
than the cost to call a customer to confirm whether he/she did or did not do the trans-
action.
Data mining with skewed data 181
In order to assess methods many factors can be used including the chosen optimisation crite-
ria, scalability, time for classification and time spent in training, and sometimes more abstract
criteria as time to understand how the method works. Most of the time, when a method is
published, or when an implementation is done, the method depends on parameter choices
that may influence the final results significantly. Default parameters, in general, are a good
start. However, most of the time, they are far from producing the best model. This comprises
with our experience with many methods in many different areas of Computer Science. This is
particular true for classification problems with skewed data.
Quite often we see comparisons against known methods where the comparison is done by
applying a special parameter variation strategy (sometimes a parameter optimisation) for the
chosen method while not fairly conducing the same procedure for the other methods. In
general, for the other methods, default parameters, or a parameter set published in some
previous work is used. Therefore, it is not a surprise that the new proposed method wins. At
a first glance, the usage of the default parameter set may seem to be fair and this bias is often
reproduced in publications. However, using default sets can be biased by the original training
set and, thus, not be fair.
Parameter optmisation takes time and is rarely conduced. For a fair comparison, we argue that
one has to fine tune the parameters for all compared method. This can be done, for instance,
via an exhaustive search of the parameter space if this search is affordable, or some kind of
sampling like in Genetic Algorithm (GA)7 (see Figure 1). Notice, that the final parameter set
cannot be claimed to be optimal in this case.
Unfortunately, this sampling procedure is not as easy as one may suppose. There is not a sin-
gle best universal optimisation algorithm for all problems (No Free Lunch theorem - Wolpert
and Macready 1997 Wolpert & Macready (1997)). Even the genetic algorithm scheme as shown
in Figure 1 might require parameter adjustment. According to our experience, we verified
that a simple mistake in the probability distribution computation may drive the results to
completely different and/or misleading results. A good genetic algorithm requires expertise,
knowledge about the problem that should be optimised by the GA, an intelligent design, and
resources. The more, the better. These considerations also imply that comparisons involv-
ing methods with suboptimal parameter sets depend very much on how well each parameter
space sampling was conduced.
7 One could also use some kind of Monte Carlo, Grid sampling or Multiresolution alternatives.
Data mining with skewed data 183
Initial Population
(50 randomly executions)
GA – start
generation pool
Mutation
new population
20 generations?
Children
(15 new children)
Fig. 1. Genetic Algorithm for parameters optimisation. We start with an initial pool of e.g.
50 random individuals having a certain fitness, followed by e.g. 20 Genetic Algorithm (GA)
generations. Each GA generation combines two randomly selected candidates among the best
e.g. 15 from previous generation. This combination performs: crossover, mutation, random
change or no action for each parameter independently. As the generation goes by, the chance
of no action increases. In the end, one may perform a local search around the optimised
founded by GA optimisation. Retrieved from Gadi et al. Gadi et al. (2008b).
optimisation and choose the set of parameters which is the best in average over all splits at
the same time.
In our work, in order to rewrite the optimisation function that should be used in a GA algo-
rithm, we have used a visualization procedure with computed costs for many equally spaced
parameter sets in the parameter space. After having defined a good optimisation function,
due to time constraints, we did not proceed with another GA optimisation, but we reused
our initial runs used in the visualization, with the following kind of multiresolution optimisa-
tion Kim & Zeigler (1996) (see Figure 2):
• we identified those parameters that have not changed, and we frozen these values for these
respective parameters;
• with any other parameter, we screened the 20 best parameter sets for every split and iden-
tified a reasonable range;
• for all non-robust parameters, we chose an integer step s so the search space did not explode;
184 New Advances in Machine Learning
90
80
70
Min_2
60
Min_1
50
Min_final
40
30
20
10
0
0 10 20 30 40 50 60 70 80 90
• we evaluated the costs for all possible combinations according to the search space defined
above and found the parameter set P that brings the minimum average cost among all the
different used splits;
• if the parameter set P was at the border of the search space, we shifted this search space by
one step in the direction of this border and repeated last step until we found this minimum
P in the inner area of the search space;
• we zoomed the screening in on the neighborhood of P, refined steps s, and repeated the
process from then on until no refinement was possible.
actions to expand the model’s life span. In this section we explore advantages of using out-of-
time samples, monitoring reports, stability by vintage, vintage selection and how to deal with
different scales over time.
13.1 Out-of-time:
an Out-Of-Time sample (OOT) is any sample of the same phenomena used in the model de-
velopment that is not in the development window8 , historic vintages or observation point
selected for development. In most cases in reality a simple split of the development sample
into training and testing data cannot identify a real over-fitting of the model Sobehart et al.
(2000). Therefore, the most appropriated approach to identify this change is either to select a
vintage or observation point posterior to the development window or select this previously to
the development window. The second approach gives the extra advantage of using the most
up-to-date information for the development.
8 Here we understand development window as being the period from where the training samples were
extracted. This can be hours, days, months, years, etc.
186 New Advances in Machine Learning
15. References
Agterberg, F. P., Bonham-Carter, G. F., Cheng, Q. & Wright, D. F. (1993). Weights of evidence
modeling and weighted logistic regression for mineral potential mapping, pp. 13–32.
Chakravarti, I. M., Laha, R. G. & Roy, J. (1967). Handbook of Methods of Applied Statistics, Vol. I,
John Wiley and Sons, USE.
Charniak, E. (1991). Bayesians networks without tears, AI Magazine pp. 50 – 63.
Chawla, N. V., Japkowicz, N. & Kotcz, A. (2004). Special issue on learning from imbalanced
data sets, SIGKDD Explorations 6(1): 1–6.
Chawla, N. V., Japkowicz, N. & Kotcz, A. (eds) (2003). Proceedings of the ICML’2003 Workshop
on Learning from Imbalanced Data Sets.
Delwiche, L. & Slaughter, S. (2008). The Little SAS Book: A Primer, SAS Publishing.
Dietterich, T., Margineantu, D., Provost, F. & Turney, P. (eds) (2000). Proceedings of the
ICML’2000 Workshop on Cost-Sensitive Learning.
Elkan, C. (2001). The foundations of cost-sensitive learning, IJCAI, pp. 973–978.
URL: citeseer.ist.psu.edu/elkan01foundations.html
Gadi, M. F. A., Wang, X. & Lago, A. P. d. (2008a). Comparison with parametric optimization
in credit card fraud detection, ICMLA ’08: Proceedings of the 2008 Seventh International
Data mining with skewed data 187
13
x
1. Introduction
The overwhelming amount of data that is available nowadays in any field of research poses
new problems for machine learning methods. This huge amount of data makes most of the
existing algorithms inapplicable to many real-world problems. Two approaches have been
used to deal with this problem: scaling up machine learning algorithms and data reduction.
Nevertheless, scaling up a certain algorithm is not always feasible. On the other hand, data
reduction consists of removing from the data missing, redundant and/or erroneous data to
get a tractable amount of data. The most common methods for data reduction are instance
selection and feature selection.
However, these algorithms for data reduction have the same scaling problem they are trying
to solve. For example, in the best case, most existing instance selection algorithms are
O n2 , n being the number of instances. For huge problems, with hundreds of thousands or
even millions of instances, these methods are not applicable. The same happens with feature
selection algorithms.
The alternative is scaling up the machine learning algorithm itself. In the best case, this is an
arduous task, and in the worst case and impossible one. In this chapter we present a new
paradigm for scaling up machine learning algorithms based on the philosophy of divide-
and-conquer. One natural way of scaling up a certain algorithm is dividing the original
problem into several simpler subproblems and applying the algorithm separately to each
subproblem. In this way we might scale up instance selection dividing the original dataset
into several disjoint subsets and performing the instance selection process separately on
each subset. However, this method does not work well, as the application of the algorithm
to a subset suffers from the partial knowledge it has of the dataset. However, if we join this
divide-and-conquer approach with the basis of the construction of ensembles of classifiers,
the combination of weak learners into a strong one, we obtain a very powerful and fast
method, applicable to almost any machine learning algorithm. This method can be applied
in different ways. In this chapter we propose two algorithms, recursive divide-and-conquer
and democratization, that are able to achieve very good performance and a dramatic
reduction in the execution time of the instance selection algorithms.
190 New Advances in Machine Learning
We will describe these methods and we will show how they can achieve very good results
when applied to instance selection. Furthermore, the methodology is applicable to other
machine learning algorithms, such as feature selection and cluster analysis.
Instance selection (Liu & Motoda, 2002) consists of choosing a subset of the total available
data to achieve the original purpose of the data mining application as if the whole data were
used. Different variants of instance selection exist. Many of the approaches are based on
some form of sampling (Cochran, 1997) (Kivinen & Mannila, 1994). There are other more
modern methods that are based on different principles, such as, Modified Selective Subset
(MSS) (Barandela et al., 2005), entropy-based instance selection (Son & Kim, 2006),
Intelligent Multiobjective Evolutionary Algorithm (IMOEA) (Chen et al., 2005), and
LVQPRU method (Li et al., 2005).
The problem of instance selection for instance based learning can be defined as (Brighton &
Mellish, 2002) “the isolation of the smallest set of instances that enable us to predict the class
of a query instance with the same (or higher) accuracy than the original set”. It has been
shown that different groups of learning algorithms need different instance selectors in order
to suit their learning/search bias (Brodley, 1995). This may render many instance selection
algorithm useless, if their philosophy of design is not suitable to the problem at hand.
We can distinguish two main models of instance selection (Cano et al., 2003): instance
selection as a method for prototype selection for algorithms based on prototypes (such as k-
Nearest Neighbors) and instance selection for obtaining the training set for a learning
algorithm that uses this training set (such as decision trees or neural networks). This chapter
is devoted to the former methods.
Regarding complexity, in the best case, most existing instance selection algorithms are of
efficiency O n2 , n being the number of instances. For huge problems, with hundreds of
thousands or even millions of instances, these methods are not applicable. Trying to develop
algorithms with a lower efficiency order is likely to be a fruitless search. Obtaining the
nearest neighbor of a given instance is O n . To test whether removing an instance affects
the accuracy of the nearest neighbor rule, we must measure the effect on the other instances
of the absence of the removed one. Measuring this effect involves recalculating, directly or
indirectly, the nearest neighbors of the instances. The result is a process of O n2 . In this
way, the attempt to develop algorithms of an efficiency order below this bound is not very
promising.
Thus, the alternative is reducing the size n of the set to which instance selection algorithms
are applied. In the construction of ensembles of classifiers the problem of learning from
huge datasets has been approached by means of learning many classifiers from small
disjoint subsets (Chawla et al., 2004). In that paper, the authors showed that it is also
possible to learn an ensemble of classifiers from random disjoint partitions of a dataset, and
combine predictions from all those classifiers to achieve high classification accuracies. They
applied their method to huge datasets with very good results. Furthermore, the usefulness
of applying instance selection to disjoint subsets has also been shown in (García-Pedrajas et
al., 2009). In that work, a cooperative evolutionary algorithm was used. The training set was
divided into several disjoint subsets and an evolutionary algorithm was performed on each
subset of instances. The fitness of the individuals was evaluated only taking into account the
instances in the subset. To account for the global view needed by the algorithm a global
Scaling up instance selection algorithms by dividing-and-conquering 191
population was used. This method is scalable to medium/large problems but cannot be
applied to huge problems. Zhu & Wu (2006) also used disjoint subsets in a method for
ranking representative instances.
Following this idea, we will present in this chapter two approaches for scaling up instance
selection algorithms that are based on a divide-and-conquer approach. The presented
methods are able to achieve very good performance with a drastic reduction in the time
needed for the execution of the algorithms. The general idea underlying this work is
dividing the original dataset into subsets and performing the instance selection process in
each subset separately. Then, we must find a method for combining the separate
applications of the instance selection algorithm to a final global result.
The rest of this paper is organized as follows: Section 2 revised some related work; Section 3
describes in depth our proposal; Section 4 shows the experiments performed with our
methods; and finally Section 5 states the conclusions of our work.
2. Related work
As stated in the previous section, scaling up instance selection algorithms is a very relevant
issue. The usefulness of applying instance selection to disjoint subsets has also been shown
in (García-Pedrajas et al., 2009). In this work a cooperative evolutionary algorithm is used.
Several evolutionary algorithms are performed on disjoint subsets of instances and a global
population is used to account for the global view. This method is scalable to medium/large
problems but cannot be applied to huge problems.
There are not many previous works that have dealt with instance selection for huge
problems. Cano et al. (2005) proposed an evolutionary stratified approach for large
problems. Although the algorithm shows very good performance, it is still too
computationally expensive for huge datasets. Kim & Oommen (2004) proposed a method
based on a recursive application of instance selection to smaller datasets.
In a recent paper, De Haro-García and García-Pedrajas (2009) showed that the application of
a recursive divide-and-conquer approach is able to achieve a good performance while
attaining a dramatic reduction in the execution time of the instance selection process.
This method is applicable to any instance selection algorithm, as the instance selection
algorithm is a parameter of the method. More formally, first, our method divides the
whole training set, T, into disjoint subsets,
t i , of size s such as T = t . s is the only
i
Scaling up instance selection algorithms by dividing-and-conquering 193
parameter of the algorithm. In this study the dataset is randomly partitioned, although
other methods may be devised. Then, the instance selection algorithm of our choice is
performed over every subset independently. The selected instances in each subset are
joined again. With this new training set constructed with the selected instances, the process
is repeated until a certain stop criterion is fulfilled. The process of combining the instances
selected by the execution of the instance selection algorithm over each dataset can be
performed in different ways. We can just repeat the partition process as in the original
dataset. However, as the first partition is performed we can take advantage of this
performed task. In this way, instead of repeating the partitioning process, we join together
the subsets of selected instances until new subsets of approximately size s are obtained.
The detailed process is shown in Fig. 2.
S=T
t : T=�t i of size s
divide instances into disjoint subsets i
t
for each subset i do
t s � ti
apply instance selection algorithm to i to obtain i
t
remove from S the instances removed from i
end for
s t
fusion subsets i to obtain new subsets j of size s
end repeat
return S
The stop criterion may be obtained in different ways. We can have a goal in terms of testing
error or reduction of storage and stop the algorithm when that goal is achieved. However, to
avoid the necessity of setting any additional parameter, we obtain the stop criterion by
means of cross-validation. We apply the algorithm using a cross-validation setup and obtain
the number of steps before the testing error starts to grow. This number of steps gives the
stopping criterion.
194 New Advances in Machine Learning
In classification, several weak learners are combined into an ensemble which is able to
improve the performance of any of the weak learners isolated (García-Pedrajas et al., 2007).
In our method, the instance selection algorithm applied to a partition into disjoint subsets of
the original dataset can be considered a weak instance selector, as it has a partial view of the
dataset. The combination of these weak selectors using a voting scheme is similar to the
combination of different learners in an ensemble using a voting scheme. Fig. 3 shows a
general outline of the method.
An important issue in our method is determining the number of votes needed to remove an
instance from the training set. Preliminary experiments showed that this number highly
depends on the specific dataset. Thus, it is not possible to set a general pre-established value
usable in any dataset. On the contrary, we need a way of selecting this value directly from
the dataset in run time.
A first natural choice would be the use of a cross-validation procedure. However, this
method is very time consuming. A second choice is estimating the best value for the number
of votes from the effect on the training set. This latter method is the one we have chosen. The
election of the number of votes must take into account two different criteria: training error,
ε t , and storage, or memory, requirements m . Both values must be minimized as much as
possible. Our method of choosing the number of votes needed to remove an instance is
based on obtaining the threshold number of votes, v , that minimizes a fitness criterion,
f v , which is a combination of these two values:
f v = αεt v + 1 α m v , (1)
where α is a value in the interval [0, 1] which measures the relative relevance of both
values. In general, the minimization of the error is more important than storage reduction,
as we prefer a lesser error even if the reduction is smaller. Thus, we have used a value
of α= 0 . 75 . Different values can be used if the researcher is more interested in reduction
ε
than in error. m is measured as the percentage of instances retained, and t is the training
error. However, estimating the training error is time consuming if we have large datasets.
To avoid this problem the training error is estimated using only a small percentage of the
whole dataset, which is 1% for medium and large datasets, and 0.1% for huge datasets.
Data: A training set T = x1,y1 , , xn ,yn , subset size s, and number of rounds r.
Result: The set of selected instances S T .
for i = 1 to r do
divide instances into disjoint subsets ti : ti = T of size s
t
for each i do
t
apply instance selection algorithm to i
t
store votes of removed instances from i
end for
196 New Advances in Machine Learning
return S
Fig. 4. Democratic instance selection algorithm
More formally, the process is the following: We perform r rounds of the algorithm and store
the number of votes received by each instance. Then, we must obtain the threshold number
of votes, v, to remove an instance. This value must be v 1,r . We calculate the criterion
f v (eq. 1) for all the possible threshold values from 1 to r, and assign v to the value which
minimizes the criterion. After that, we perform the instance selection removing the instances
whose number of votes is above or equal to the obtained threshold v. Fig. 4 shows the steps
of this algorithm.
Fe a ture s
Da ta s e t Ins ta nc e s Re a l Bina ry No mina l Cla s s e s 1-NN e rro r
a b a lo ne 4 17 7 7 - 1 29 0 .8 0 34
a dult 48842 6 1 7 2 0 .20 0 5
car 17 28 - - 6 4 0 .158 1
g e ne 317 5 - - 60 3 0 .27 6 7
g e rma n 10 0 0 6 3 11 2 0 .3120
hypo thyro id 37 7 2 7 20 2 4 0 .0 6 9 2
is o le t 7797 6 17 - - 26 0 .14 4 3
krko pt 28 0 56 6 - - 18 0 .4 356
kr vs . kp 319 6 - 34 2 2 0 .0 8 28
le t te r 20 0 0 0 16 - - 26 0 .0 4 54
ma g ic 0 4 19 0 20 10 - - 2 0 .20 8 4
mfe a t-fa c 20 0 0 216 - - 10 0 .0 350
mfe a t-fo u 20 0 0 76 - - 10 0 .20 8 0
mfe a t-ka r 20 0 0 64 - - 10 0 .0 4 35
mfe a t-mo r 20 0 0 6 - - 10 0 .29 25
mfe a t-pix 20 0 0 24 0 - - 10 0 .0 27 0
mfe a t-ze r 20 0 0 47 - - 10 0 .214 0
nurs e ry 129 6 0 - 1 7 5 0 .250 2
o ptdig its 56 20 64 - - 10 0 .0 256
pa g e -b lo c ks 54 7 3 10 - - 5 0 .0 36 9
pe ndig its 10 9 9 2 16 - - 10 0 .0 0 6 6
pho ne me 54 0 4 5 - - 2 0 .0 9 52
s a tima g e 6 4 35 36 - - 6 0 .0 9 39
s e g me nt 2310 19 - - 7 0 .0 39 8
s huttle 58 0 0 0 9 - - 7 0 .0 0 10
s ic k 37 7 2 7 20 2 2 0 .0 4 30
te xture 550 0 40 - - 11 0 .0 10 5
wa ve fo rm 50 0 0 40 - - 3 0 .28 6 0
ye a s t 14 8 4 8 - - 10 0 .4 8 7 9
Table 1. Summary of datasets used in our experiments
Therefore, when evaluating instance selection algorithms for instance learning, the most
usual way of evaluation is estimating the performance of the algorithms on a set of
benchmark problems. In those problems several criteria can be considered, such as (Wilson
& Martínez, 2000): storage reduction, generalization accuracy, noise tolerance, and learning
speed. Speed considerations are difficult to measure, as we are evaluating not only an
198 New Advances in Machine Learning
algorithm but also a certain implementation. However, as the main aim of our work is
scaling up instance selection algorithms, execution time is a basic issue. To allow a fair
comparison, we have performed all the experiments in the same machine, a bi-processor
computer with two Intel Xeon QuadCore at 1.60GHz.
One of the advantages of our approach is that it can be applied to any kind of instance
selection method. As the instance selection method to apply is just a parameter of the
algorithm, there is no restriction in the algorithm selected. In the experiments we have used
several of the most widely used instance selection methods.
In order to obtain an accurate view of the usefulness of our method, we must select some of
the most widely used instance selection algorithms. We have chosen to test our model using
several of the most successful state-of-the-art algorithms. Initially, we used the algorithm
ICF (Brighton & Mellish, 2002). ICF (Iterative Case Filtering) is based on the concepts of
coverage and reachability of an instance c, which are defined as follows:
The local-set of a case c is defined as “the set of cases contained in the largest hypersphere
centered on c such that only cases in the same class as c are contained in the hypersphere”
(Brighton & Mellish, 2002) so the hypersphere is bounded by the first instance of different
class. The coverage set of an instance includes the instances that have this as one of their
neighbors and the reachable set is formed by the instances that are neighbors to this
instance. The algorithm is based on repeatedly applying a deleting rule to the set of retained
instances until no more instances fulfill the deleting rule.
In addition to this method, it is worth mentioning Reduced Nearest Neighbor (RNN) rule
(Gates, 1972). This method is extremely simple, but it also shows an impressive performance
in terms of storage reduction. In fact, it is the best of the methods used in these experiments
in reducing storage requirements, as will be shown in the next section. However, it has a
serious drawback, its computational complexity. Among the standard methods used this is
the one that shows a worst scalability, taking several hundreds hours in the worst case.
Therefore, RNN is the perfect target for our methodology, an instance selection method
highly efficient but with a serious scalability problem. So we have also tested our approach
using RNN, as base instance selection method.
The same parameters were used for the standard version of every algorithm and its
application within our methodology. All the standard methods have no relevant
parameters, the only value we must set is k, the number of nearest neighbors. Both, for ICF
and RNN, we used k = 3 neighbors. This is a fairly standard value (Cano et al., 2003). Our
method has two parameters: subset size, s, for both methods, and number of rounds, r, for
the democratic approach. Regarding subset size we must use a value large enough to allow
for a meaningful application of the instance selection algorithm on the subset, and small
enough to allow a fast execution, as the time used by our method grows with s. As a
compromise value we have chosen s = 100. For the number of rounds we have chosen a
small value to allow for a fast execution, r = 10. The application of our recursive divide-and-
conquer method with a certain instance selection algorithm X will be named RECURIS.X and
the democratic approach named DEMOIS.X.
Scaling up instance selection algorithms by dividing-and-conquering 199
Storage Error
0.2
0.0
-0.2
-0.4
-0.6
-0.8
let
opt
sat
car
ger
kar
mor
z er
nur
krk
fac
pix
wav
sic
tex
aba
adu
gen
hyp
iso
mag
fou
pag
pen
pho
seg
shu
yea
z ip
kr
Fig. 5. Results of standard ICF method and its recursive counterpart for testing error and
storage requirements
Fig. 5 shows the results comparing standard ICF and its recursive counterpart. The figure
(as well as the following ones) shows for each dataset the difference between the standard
method and our approach, a negative value meaning a better results of our proposal. The
figure shows that in terms of storage reduction our method is better in general, achieving for
some datasets, namely car, gene, german, krkopt, krvskp and nursery, significant
improvements over the standard method. In terms of testing error RECURIS.ICF is slightly
worse than standard ICF.
Fig. 6 shows the results for RNN as base instance selection method. This a very good test of
our approach, as RNN is able to achieve very good results in terms of storage reduction
while keeping testing error in moderate bounds. However, RNN has a big problem of
scalability. The results show that our method is able to mostly keep the good performance
of RNN in terms of storage requirements, although with a general worst behavior.
However, this is compensated by a better testing error for most datasets.
200 New Advances in Machine Learning
Storage Error
0.1
0.0
- 0.1
let
opt
sat
car
ger
kar
mor
z er
nur
krk
fac
pix
wav
sic
tex
aba
adu
gen
hyp
iso
mag
fou
pag
pen
pho
seg
shu
yea
z ip
kr
Fig. 6. Results of standard RNN and its recursive counterpart in terms of testing error and
storage requirements
Storage Error
0.1
0.0
- 0.1
- 0.2
- 0.3
- 0.4
opt
sat
let
nur
kar
mor
z er
car
ger
sic
tex
wav
krk
fac
pix
pag
pen
pho
seg
shu
yea
z ip
aba
mag
fou
adu
gen
hyp
iso
kr
Fig. 7. Results of standard CHC and its recursive counterpart in terms of testing error and
storage requirements
In each iteration new solutions are obtained combining two or more individuals (crossover
operator) or randomly modifying one individual (mutation operator). After applying these
two operators a subset of individuals is selected to survive to the next generation, either by
sampling the current individuals with a probability proportional to their fitness, or by
selecting the best ones (elitism). The repeated processes of crossover, mutation and selection
are able to obtain increasingly better solutions for many problems of Artificial Intelligence.
Nevertheless, the major problem addressed when applying genetic algorithms to instance
selection is the scaling of the algorithm. As the number of instances grows, the time needed
for the genetic algorithm to reach a good solution increases exponentially, making it totally
useless for large problems. As we are concerned with this problem, we have used as fifth
instance selection method a genetic algorithm using CHC methodology. The execution time
of CHC is clearly longer than the time spent by ICF, so it gives us a good benchmark to test
our methodology on an algorithm that, as RNN, has a big scalability problem.
For CHC, see Fig. 7, the results show that the recursive approach is able to improve the
results of the standard algorithm in terms of storage requirements but the error is worse
than when using the whole dataset. However, the achieved storage reduction is relevant,
and our method is clearly worse than standard CHC only in magic04 problem.
An interesting side result is the problem of scalability of CHC algorithm, which is more
marked for this algorithm than for the previous ones. In other works, (Cano et al., 2003)
(García-Pedrajas et al., 2009), CHC algorithm was compared with standard methods in small
to medium problems. For those problems, the performance of CHC was better than the
performance of other methods. However, as the datasets are larger, the scalability problem
of CHC manifests itself. In our set of problems, CHC clearly performs worse than ICF and
RNN in terms of storage reduction. We must take into account that for CHC we need a bit in
the chromosome for each instance in the dataset. This means that for large problems, such as
adult, krkopt, letter, magic or shuttle, the chromosome has more than 10000 bits, making the
convergence of the algorithm problematic. Thus, CHC is, together with RNN, an excellent
example of the applicability of our approach.
202 New Advances in Machine Learning
Storage Error
0.1
-0.1
-0.3
-0.5
-0.7
-0.9
sat
opt
let
ger
z er
nur
car
kar
mor
sic
wav
tex
krk
fac
pix
seg
shu
yea
z ip
pag
pen
pho
aba
adu
gen
hyp
iso
mag
fou
kr
0.3
0.2
0.1
-0.1
let
opt
sat
car
ger
kar
mor
z er
nur
krk
fac
pix
sic
tex
wav
aba
adu
gen
hyp
iso
mag
fou
pag
pen
pho
seg
shu
yea
z ip
kr
Fig. 9. Results of standard RNN method and its democratic counterpart for testing error
and storage requirements
Scaling up instance selection algorithms by dividing-and-conquering 203
The figure shows how DEMOIS.RNN is able to solve the scalability problem of RNN. In terms
of testing error, it is able to achieve a similar performance as standard RNN. In terms of
storage reduction our algorithm performs worse than RNN. However, the performance of
DEMOIS.RNN is still very good, in fact, better than any other of the previous algorithms. So,
our approach is able to scale RNN to complex problems, improving its results in terms of
testing error, but with a small worsening of the storage reduction. In terms of execution time
the results are remarkable, the reduction of the time consumed by the selection process is
large, with the extreme example of the two most time consuming datasets, adult and krkopt,
where the speed-up is more than a hundred times (see next section).
Storage Error
0.2
0.1
- 0.1
- 0.2
- 0.3
- 0.4
let
opt
sat
car
ger
kar
mor
z er
nur
krk
fac
pix
sic
tex
wav
aba
adu
gen
hyp
iso
mag
fou
pag
pen
pho
seg
shu
yea
z ip
kr
Fig. 10. Results of standard CHC method and its democratic counterpart for testing error
and storage requirements
Fig. 10 plots the results of CHC algorithm. For this method, the scaling up of CHC provided
by DEMOIS.CHC is evident not only in terms of running time, with a large reduction in all 30
datasets, but also in terms of storage reduction. DEMOIS.CHC is able to improve the
reduction of CHC in all 30 datasets, with an average improvement of more than 20%, from
an average storage of CHC of 31.83% to an average storage of 11.58%. The bad side effect is
a worse testing error, which is however not very marked and compensated by the
improvement in running time and storage reduction. As a summary, for CHC the results
show that the democratic approach is able to improve the results of the standard algorithm
in terms of storage requirements but the error is worse than when using the whole dataset.
However, as it was the case for the recursive approach, there is a clear gaining in storage
reduction with a moderately worse testing error.
4.4 Time
As we have estated our main aim is the scaling up of instance selection algorithms. In the
previous sections we have shown that our methodology is able to match the performance of
standard instance selection algorithms. In this section we show the results of execution time
spent by each algorithm, showing a dramatic advantage of our approach. Fig. 11, 12 and 13
show the execution time of ICF, RNN and CHC methods respectively. The figures show
execution time, in seconds, plotted against problem size.
204 New Advances in Machine Learning
Fig. 11. Execution time using ICF as base instance selection algorithm
All the three figures show the excellent behavior of the two described methods. Both behave
almost linearly as the problem size grows. On the other hand, ICF shows it is a quadratic
complexity method and RNN and CHC behave far worse.
From a theoretical point of view the two algorithms presented in this chapter are of linear
n
complexity. For the recursive approach we divide the dataset into s subsets of size s. Then,
we apply the instance selection algorithm to each subset. The time needed for performing
the selection in each subset will be fixed as the size of each subset is always s, regardless the
number of instances of the datasets. More instances means a larger
n s . Thus, the
n
complexity of each step of the recursive algorithm will be linear as s depends linearly on
n, the size of the dataset. The algorithm performs a few of these steps before reaching the
stopping criterion, and thus the whole method is of linear complexity.
The democratic approach also divides the dataset into partitions of disjoint subsets of size s.
Thus, the chosen instance selection algorithm is always applied to a subset of fixed size, s,
which is independent from the actual size of the dataset. The complexity of this application
of the algorithm depends on the base instance selection algorithm we are using, but will
always be small, as the size sis always small. Let K be the number of operations needed by
the instance selection algorithm to perform its task in a dataset of size s. For a dataset of n
instances we must perform this instance selection process once for each subset, that is n/s
times, spending a time proportional to (n/s)K. The total time needed by the algorithm to
perform r rounds will be proportional to r(n/s)K, which is linear in the number of instances,
as K is a constant value.
Scaling up instance selection algorithms by dividing-and-conquering 205
Fig. 12. Execution time using RNN as base instance selection algorithm.
Thus, the gaining in execution time would be greater as the size of the datasets is larger. If
the complexity of the instance selection algorithm is greater, the reduction of the execution
will be even better. The method has the additional advantage of allowing an easy parallel
implementation. As the application of the instance selection algorithm to each subset is
independent from all the remaining subsets, all the subsets can be processed at the same
time, even for different rounds of votes. Also, the communication between the nodes of the
parallel execution is small.
Fig. 13. Execution time using CHC as base instance selection algorithm
206 New Advances in Machine Learning
An additional process completes the method, the determination of the number of votes.
Regarding the determination of the number of votes, the process can be made in different
ways. If we consider all the training instances, the cost of this step would be O n2 .
However, to keep the complexity linear we use a random subset of the training set for
determining the number of votes, with a limit on the maximum size of this subset that is
fixed for any dataset. In this way, from medium to large datasets we use the 10% of the
training set, for huge problems the 1%, and the percentage is further reduced as the size of
the dataset grows. In fact, we have experimentally verified that we can consider any
reasonable bound1 in the number of instances without damaging the performance of the
algorithm. Using a small percentage does not harm the estimation of the threshold of votes.
With this method the complexity of this step is O 1 as the number of instances used is
bounded regardless the size of the dataset.
Finally, we consider the partition of the dataset apart from the algorithm as many different
partition methods can be devised. The performed random partition is of complexity O n .
7. Conclusions
In this chapter we have shown two new methods for scaling up instance selection
algorithms. These methods are applicable to any instance selection method without any
modification. The methods consist of a recursive procedure, where the dataset is partitioned
into disjoint subsets, an instance selection algorithm is applied to each subset, and then the
selected instances are rejoined to repeat the process, and a democratic approach where
several rounds of approximate instance selection are performed and the result is obtained by
a voting scheme.
Using three well-known instance selection algorithms, ICF, RNN and a CHC genetic
algorithm, we have shown that our method is able to match the performance of the original
algorithms with a considerable reduction in execution time. In terms of reduction of storage
requirements, our approach is even better than the use of the original instance selection
algorithm over the whole dataset. Additionally, our method is straightforwardly
parallelizable without modifications.
The proposed methods allow the application of instance selection algorithms to almost any
problem size. The behavior is linear in the number of instances as it has been shown both
theoretically and experimentally.
Furthermore, this philosophy can be extended to other learning algorithms such as feature
selection or clustering, which means it is a powerful tool for scaling up machine learning
algorithms.
8. References
Barandela, R., Ferri, F. J. & Sánchez, J. S. (2005). Decision boundary preserving prototype
selection for nearest neighbor classification. International Journal of Pattern
Recognition and Artificial Intelligence, vol. 19, no. 6, 787-806.
1 This reasonable bound can be from a few hundreds to a few thousands, even for huge datasets.
Scaling up instance selection algorithms by dividing-and-conquering 207
Kivinen, J. & Mannila, H. (1994). The power of sampling in knowledge discovery, In:
Proceedings of the thirteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles
of database systems, ACM Press, 77–85.
Kuncheva, L. (1995). Editing for the k-nearest neighbors rule by a genetic algorithm. Pattern
Recognition Letters, vol. 16, 809-814.
Li, J., Manry, M. T., Yu, C. & Wilson, D. R. (2005). Prototype classifier design with pruning,
International Journal of Artificial Intelligence Tools, vol. 14, no. 1-2, 261–280.
Liu, H. & Motoda, H. (2002). On issues of instance selection. Data Mining and Knowledge
Discovery, vol. 6, 115-130.
Reeves, C. R. & Bush, D. R. (2001). Using genetic algorithms for training data selection in
RBF networks. In: Instances Selection and Construction for Data Mining, Liu, H. &
Motoda, (Ed.), 339-356, Kluwer, Norwell, USA.
Smith, P. (1998). Into Statistics, Springer-Verlag, Berlin, Germany.
Son, S. H. & Kim, J. Y. (2006). Data reduction for instance-based learning using entropy-
based partitioning, In: Proceedings of the International Conference on Computational
Science and Its Applications - ICCSA 2006, Springer, 590–599.
Whitley, D (1986). The GENITOR Algorithm and Selective Pressure, In: Proceedings of the
3rd International Conference on Genetic Algorithms, 116–121, Morgan Kaufmann.
Wilson, D. R. & Martínez, T. R. (2000). Reduction techniques for instance-based learning
algorithms. Machine Learning, vol. 38, 257-286.
Zhu, X., and Wu, X. (2006). Scalable representative instance selection and ranking. In:
Proceedings of the 18th International Conference on Patter Recognition (ICPR'06), IEEE
Computer Society, 352 – 355.
Ant Colony Optimization 209
14
x
1. Introduction
Swarm intelligence is a relatively novel approach to problem solving that takes inspiration
from the social behaviors of insects and of other animals. In particular, ants have inspired a
number of methods and techniques among which the most studied and the most successful
one is the ant colony optimization.
Ant colony optimization (ACO) algorithm, a novel population-based and meta-heuristic
approach, was recently proposed by Dorigo et al. to solve several discrete optimization
problems (Dorigo, 1996, 1997). The general ACO algorithm mimics the way real ants find
the shortest route between a food source and their nest. The ants communicate with one
another by means of pheromone trails and exchange information indirectly about which
path should be followed. Paths with higher pheromone levels will more likely be chosen
and thus reinforced later, while the pheromone intensity of paths that are not chosen is
decreased by evaporation. This form of indirect communication is known as stigmergy, and
provides the ant colony shortest-path finding capabilities. The first algorithm following the
principles of the ACO meta-heuristic is the Ant System (AS) (Dorigo,1996), where ants
iteratively construct solutions and add pheromone to the paths corresponding to these
solutions. Path selection is a stochastic procedure based on two parameters, the pheromone
and heuristic values, which will be detailed in the following section in this chapter. The
pheromone value gives an indication of the number of ants that chose the trail recently,
while the heuristic value is problem-dependent and it has different forms for different cases.
Due to the fact that the general ACO can be easily extended to deal with other optimization
problems, its several variants has been proposed as well, such as Ant Colony System
(Dorigo,1997), rank-based Ant System (Bullnheimer,1999), and Elitist Ant System
(Dorigo,1996) . And the above variants of ACO have been applied to a variety of different
problems, such as vehicle routing (Montemanni,2005), scheduling (Blum,2005), and
travelling salesman problem (Stützle,2000). Recently, ants have also entered the data mining
domain, addressing both the clustering (Kanade,2007), and classification task (Martens et
al.,2007).
This chapter will focus on another application of ACO to track initiation in the target
tracking field. To the best of our knowledge, there are few reports on the track initiation
using the ACO. But in the real world, it is observed that there is a case in which almost all
ants are inclined to gather around the food sources in the form of line or curve. Fig. 1 shows
the evolution process of ants searching for foods. Initially, all ants are distributed randomly
210 New Advances in Machine Learning
in the plane as in Fig.1 (a), and a few hours later we find that most of ants gather together
around the food sources as shown in Fig.1 (b). Taking inspiration from such phenomenon,
we may regard these linear or curvy food sources as tentative tracks to be initialized, and
the corresponding ant model is established from the optimal aspect to solve the problem of
multiple track initiation.
food source 1 food source 1
(a) Initial distribution of ants (b) The distribution of ants a few hours later
Fig. 1. The evolution process of ant search for foods
The remainder of this chapter is structured as follows. First, in section 2, the widely used ant
system and its successors are introduced. Section 3 gives the new application of ACO to the
track initiation problem, and the system of ants of different tasks is modeled to coincide
with the problem. The performance comparison of ACO-based techniques for track
initiation is carried out and analysized in Section 4. Finally, some conclusions are drawn.
rule, called random proportional rule, to decide which city to visit next. In particular, the
probability with which ant k , located at city i , chooses to go to city j is
[ ij ] [ij ]
pijk , if j N ik , (1)
lNik
[ il ] [il ]
where ij 1/ d ij is a heuristic value that is computed in advance, and are two
parameters which determine the relative importance of the pheromone trail and the
heuristic information, and N ik is the set of cities that ant k has not visited so far. By this
probabilistic rule, the probability of choosing the arc (i, j ) may increase with the bigger
value of the associated pheromone trail ij and of the heuristic information value ij . The
role of the parameters and is described as below. If 0 , the closest cities are more
likely to be selected: this corresponds to a classic stochastic greedy algorithm. If 0 , it
means that the pheromone is used alone, without any heuristic bias. This generally leads to
rather poor results and, in particular, for values of 1 it leads to earlier stagnation
situation, that is, a situation in which all the ants follow the same path and construct the
same tour, which, in general, is strongly suboptimal.
Each ant maintains a memory which records the cities already visited. And moreover, this
k
memory is used to define the feasible neighbourhood N i in the construction rule given by
equation (1). In addition, such a memory allows ant k both to compute the length of the
tour T k it generated and to retrace the path to deposit pheromone for upcoming global
pheromone update.
Concerning solution construction, there are two different ways of implementing it: parallel
and sequential solution construction. In the parallel implementation, at each construction
step all ants move from their current city to the next one, while in the sequential
implementation an ant builds a complete tour before the next one starts to build another one.
In the AS case, both choices for the implementation of the tour construction are equivalent
in the sense that they do not significantly influence the algorithm’s behaviour.
Update of Pheromone Trails
After all the ants have constructed their tours, the pheromone trails are updated. This is
done by first lowering the pheromone value on all arcs by a factor, and then adding an
amount of pheromone on the arcs the ants have crossed in their tours. Pheromone
evaporation is implemented by the following law
ij (1 ) ij , (i, j ) L (2)
where 0 1 is the pheromone evaporation rate. The parameter is used to avoid
unlimited accumulation of the pheromone trails and it enables the algorithm to “forget’’ bad
decisions previously taken. In fact, if an arc is not chosen by the ants, its associated
pheromone value decreases exponentially with the number of iterations. After evaporation,
all ants deposit pheromone on the arcs they have crossed in their tour:
212 New Advances in Machine Learning
m
ij (1 ) ij ijk , (i, j ) L (3)
k 1
where ijk is the amount of pheromone ant k deposits on the arcs it has visited. It is
defined as follows:
m
ij ij ijk e ijbs (5)
k 1
iteration, assume that total W best-ranked ants are considered, and only the (W 1) best-
ranked ants and the ant that produced the best-so-far tour are allowed to deposit
pheromone. The best-so-far tour gives the strongest feedback with weight w ; the r th best
ant of the current iteration contributes to pheromone updating with the value 1/ C r
multiplied by a weight given by max 0,W r . Thus, the AS rank pheromone update rule is
W 1
ij ij (W r ) ijr w ijbs (7)
r 1
ij ij ijbest , (8)
where
best best
ij 1/ C . The ant which is allowed to add pheromone may be either the best-
so-far, in which case ij 1/ C , or the iteration-best, in which case ij 1/ C ,
best bs best ib
ib
where C is the length of the iteration-best tour. In general, in MMAS implementations
both the iteration-best and the best-so-far update rules are used in an alternate way.
Obviously, the choice of the relative frequency with which the two pheromone update rules
are applied has an influence on how greedy the search is: When pheromone updates are
always performed by the best-so-far ant, the search focuses very quickly around T bs ,
whereas when it is the iteration-best ant that updates pheromones, then the number of arcs
that receive pheromone is larger and the search is less directed.
Pheromone Trail Limits
In MMAS, lower and upper limits min and max on the possible pheromone values on any
arc are imposed in order to avoid earlier searching stagnation. In particular, the imposed
214 New Advances in Machine Learning
pheromone trail limits have the effect of limiting the probability pij of selecting a city j
when an ant is in city i to the interval [ pmin , pmax ] , with 0 pmin pij pmax 1 . Only
when an ant k has just one single possible choice for the next city, that is N ik 1 , we have
pmin pmax 1 .
It is easy to show that, in the long run, the upper pheromone trail limit on any arc is
bounded by 1/ C * , where C * is the length of the optimal tour. Based on this result,
MMAS uses an estimate of this value, 1/ C bs , to define max : each time a new best-so-far
tour is found, the value of max is updated. The lower pheromone trail limit is set to
min max / , where is a parameter (Stützle & Hoos, 2000).
Pheromone Trail Initialization and Re-initialization
At the start of the algorithm, the initial pheromone trails are set to an estimate of the upper
pheromone trail limit. This way of initializing the pheromone trails, in combination with a
small pheromone evaporation parameter, causes a slow increase in the relative difference in
the pheromone trail levels, so that the initial search phase of MMAS is very explorative.
Note that, in MMAS, pheromone trails are occasionally re-initialized. Pheromone trail re-
initialization is typically triggered when the algorithm approaches the stagnation behaviour
or if for a given number of algorithm iterations no improved tour is found.
Y
track of target 2
3
track of target 1 4
1
sensor 1
sensor 2
O X
Fig. 2. The generated “ghosts” in case of two-sensor-two-target BOT
3.2 Motive
In the image detection field, the Hough transform (H-T) has been recognized as a robust
technique for line or curve detection and also have been largely applied by scientific
community (Bhattacharya, 2002; Shapiro, 2005). The basic idea of H-T is to transform a point
( x, y ) in the Cartesian coordinate system onto a curve in the ( , ) parameter space, which
is formulated as
of measurement error. Even so, these points are still distributed in a small region, and thus
such a small area could be deemed as an objective function to be optimized.
For the case of two given tracks, the corresponding intersections in the parameter space are
plotted in Fig.3, and for the upper left expanded subfigure, which corresponds to target 1,
the minimum and maximum values of could be obtained and then denoted by min and
max , respectively. Similarly, the related minimum and the maximum values of are also
found and denoted by min and max , respectively.
39.1
39.05
39 1
39 target 1
(max , max )
2
(m)
38.95
target 1
38.9 3
38.85
38
S1
5 4
38.8
(m)
(min , min ) 37
6
38.75
38.7
1.41 1.42 1.43 1.44 1.45 1.46
(rad ) target 2
36
39
target 2 , max
(max )
1
35
2
38.95
38.9
3
(m)
38.85 34
S2
38.8
4 33
6 5
38.75
38.7
(min
, min
) 1 1.2 1.4 1.6 1.8 2 2.2
1.38 1.4 1.42 1.44 1.46 1.48
(rad ) (rad )
Si ( max
i
min
i
)( max
i
min
i
), (10)
and the objective function J is defined as
M
J min S r ( r1 r2 r3 r4 )
r 1
3) Ants of the same task are dedicated to finding their best solution, and a set of all best
solutions found by ants of different tasks constitute the solutions to Eq. (11) we
describe.
4) In the system of ants of different tasks, the search space depends not only on the
measurement returns at the next scan but also on the prior knowledge of target motion.
1
i
r1
r2 2
3
i
4
2
s
arg max i , j
i , j 1
s if q q0
j i , j , (12)
i
i , j i , j
j i 1
J otherwise
and J is a random variable selected according to the following probability distribution
s 1
i , j i , j
i , j i , j is, j
i
if j
P( j ) is,l
1
i 1
(13)
l i i ,l s
i 1
i ,l i ,l i ,l
0 otherwise
where s
i, j denotes the pheromone amount deposited by ants of task s on trail ( i, j ), i , j is
the total pheromone amount deposited by all ants of different tasks on trail ( i, j ), shows
the repulsion on the foreign pheromones left on the trail ( i, j ), q is a random number
uniformly distributed between 0 and 1, and q0 is a parameter which determines the relative
importance of the exploitation of good solutions versus the exploration of search spaces.
According to the search spaces discussed above, Fig. 5 plots the process of how the heuristic
value is calculated from search spaces 1 to 2 , namely, if an ant will move from positions
i to j , the corresponding heuristic value can be defined as
(di , j r0 ) 2
i , j exp
2(r r ) 2
, (14)
2 1
where d i , j denotes the distance between positions i and j , and r0 is equal to (r2 r1 ) / 2 .
, we set 0 , and the search failure is declared for
Note that if position j falls out of
i
2 i, j
r1
i
r2
2
di , j
i , j
r0
j
o r
r1
di , j
r0
r2
i
2
Update of Pheromone
The pheromone update is performed in two phases, namely, local update and global update.
While building a solution, if an ant of task s carries out the transition from positions i to j ,
then the pheromone level of the corresponding trail is changed in the following way:
p
is, j (1 ) is, j is,,jk . (16)
k 1
where is,,jk is the pheromone amount that ant k of task s deposits on the trail ( i, j ) it has
traveled at the current iteration, and p is the number of ants. In the case of bearings-only
multi-sensor-multi-target tracking, is,,jk is set to a constant.
x y x y
Scenarios Targets
(km) (km) (m/s) (m/s)
1 60 30 50 -100
1
2 80 60 150 -150
1 60 30 50 -100
2 2 80 60 150 -150
3 60 50 80 -120
Table 1. The initial position and velocity of each target in the two considered scenarios
220 New Advances in Machine Learning
70 60
target 1 target 1
target 2 target 2
55
60 ghost target 3
50 ghost
50 45
40
Y (km)
Y (km)
40
35
30 30
25
20
20
15
10
10
30 40 50 60 70 80 90 30 40 50 60 70 80 90
X (km) X (km)
Fig. 6. The target position candidates in a “clutter-free” environment (left: Scenario 1, right:
Scenario 2)
60 target 1
target 1 80
target 2 target 2
55 target 3
clutter and ghost
70 clutter and ghost
50
45 60
40
Y (km)
Y (km)
50
35
40
30
30
25
20 20
15
10
40 50 60 70 80 90 20 30 40 50 60 70 80 90 100 110
X (km) X (km)
Fig. 7. The target position candidates in clutter environments (left: Scenario 1, right: Scenario
2)
Figs.6 and 7 depict a part of position candidates obtained by intersecting LOSs at each scan,
and our object is to discriminate the true “positions” of each target of interest. Here, we use
two ACO-based techniques, namely the Ant System (called the traditional ACO) and the
system of ants of different tasks (called the proposed ACO).
Other parameters related to the two ACO-based methods are illustrated in Table 2
Parameter Parameter
Value Value
0.01 0.03
0.2 M 3M 2
2 | vmin | 100m / s
0.8 | vmax | 400m / s
q0 0.7 | amax | 15m / s 2
0 0.05 N 50
Table 2. The Parameter Settings for ACO-related Methods
Ant Colony Optimization 221
N N
F fi n i , (17)
i 1 i 1
where f i denotes the number of false initiated tracks at the i th Monte-Carlo run, and ni is
the total number of initiated tracks.
The probability of correct initiation of at least j tracks: if at least j ( 1 j M ) tracks are
initiated correctly, its corresponding probability is
N
C j lij N , (18)
i 1
4.3 Results
All results in Tables 3 to 6 are averaged over 10,000 Monte-Carlo runs. According to the
evaluation indices we introduce, the traditional ACO algorithm performs as well as the
proposed one, as illustrated in Tables 3 and 4, in clutter-free environments. However, in the
presence of clutter, the proposed ACO algorithm shows a significant improvement over the
traditional one with respect to the probability of false track initiation, as shown in Tables 5
and 6.
C1 1.0000 1.0000
Pro. of correct initiation
of at least j tracks( C j )
C2 0.9998 0.9997
C1 1.0000 1.0000
C3 0.9857 0.9861
C1 1.0000 1.0000
Pro. of correct initiation
of at least j tracks( C j )
C2 0.9787 0.9997
C1 1.0000 1.0000
C3 0.9267 0.9861
Among 10,000 Monte-Carlo runs, only the cases of all tracks being initiated successfully are
investigated and called effective runs later. For the objectivity of comparison, we select the
worst case, in which the maximum running time for each ACO algorithm is evaluated, from
the effective runs.
Ant Colony Optimization 223
Fig. 8 depicts the trends of objective function evolution with the increasing number of
iterations in scenario 2. Compared with the traditional ACO algorithm, the proposed one
requires fewer iterations for convergence in clutter-free or clutter environments. According
to Tables 3 and 4, although the performance of the traditional ACO algorithm is comparable
to that of the proposed one, we find that the proposed ACO one seems more practical due to
less running time needed. Figs. 9 and 10 depict varying curves of pheromone on the true
targets’ tracks, it is observed that the amount of pheromone on each “true” track increases in
a moderate way, which means most ants prefer choosing these tracks and regarded them as
optimal solutions.
1200 400
900 300
Objective value (m.rad)
600 200
500
150
400
300 100
200
50
100
0 0
5 10 15 20 5 10 15 20
Iteration Iteration
0.125 0.11
0.1
0.1 0.09
Pheromone amount
Pheromone amount
0.08
0.075 0.07
On track 1 On track 1
On track 2 0.06 On track 2
On track 3 On track 3
0.05
0.05
0.04
5 10 15 20 5 10 15 20
Iteration Iteration
Fig. 9. Pheromone curves in clutter-free environments (left: The proposed ACO; right: The
traditional ACO)
224 New Advances in Machine Learning
0.13 0.13
0.1 0.1
Pheromone amount
Pheromone amount
0.09 0.09
0.08
0.08
0.07 On track 1
0.07 On track 2
0.06 On track 3
0.06
0.05
0.05
0.04
0.04
0.03
0.03
5 10 15 20 5 10 15 20
Iteration Iteration
Fig. 10. Pheromone curves in clutter environments (left: The proposed ACO; right: The
traditional ACO)
5. Conclusion
This chapter mainly aims to introduce some widely used ACO algorithms and their origins,
such as the AS, EAS, MMAS, and so on. It is found that all concerns are focused on the
pheromone update strategy. Some uses the best-so-far-ant or the iteration-best ant
independently/interactively to update the trail that ants travelled. Meanwhile, the update
law may differ a bit for different ACO algorithms. Among the four ACO algorithms, two
versions have received great popularities in various applications, i.e. AS and MMAS.
Another contribution in this chapter is the extension of the general ACO algorithm to the
system of ants of different tasks, and its behaviour is modelled and implemented in the
track initiation problems. Simulation results are also presented to show the effectiveness of
the novel ACO algorithm. According to the example presented in this chapter, we believe
that the general framework of AS can be modified to solve various optimal or non-optimal
problems.
6. References
B. Bullnheimer; R. F. Hartl & C. Strauss. (1999). A new rank based version of the ant system:
A computational study, Central Eur. J. Oper. Res. Econ., Vol. 7, No. 1, 25–38, ISSN
1435-246X.
C. Blum. (2005). Beam-ACO—hybridizing ant colony optimization with beam search: An
application to open shop scheduling, Comput. Oper. Res., Vol. 32, No. 6, 1565–1591,
ISSN 0305-0548.
David Martens; Manu De Backer & Raf Haesen. (2007). Classification With Ant Colony
Optimization, IEEE Trans. on Evolutional Computation, Vol. 11, No. 5, October 2007,
651-665, ISSN 1089-778X.
Kutluyll Dogancay. (2004). On the bias of linear least squares algorithm for passive target
localization, Signal Processing, Vol. 84, No. 3, 475-486, ISSN 0165-1684.
Ant Colony Optimization 225
Kutluyll Dogancay. (2005). Bearings-only target localization using total least squares, Signal
Processing, Vol. 85, No. 9,1695-1710, ISSN 0165-1684.
M. Dorigo; V. Maniezzo & A. Colorni. (1991). Positive Feedback as a Search Strategy,
Technical Report 91–016, Politecnico di Milano, Milano, Italy.
M. Dorigo; V. Maniezzo & A. Colorni. (1996). The ant system: optimization by a colony of
cooperating agents, IEEE Trans. on System, Man, and Cybernetics-part B, Vol.26, No. 1,
29-42, ISSN 1083-4419.
M. Dorigo & L. M. Gambardella. (1997). Ant colony system: A cooperative learning
approach to the traveling salesman problem, IEEE Trans. on Evolutional
Computation, Vol.1, No. 1, 53-66, ISSN 1089-778X.
P. Bhattacharya; A. Rosenfeld & I. Weiss. (2002). Point-to-line mappings as Hough
transforms, Pattern Recognition Letters, Vol. 23, No. 4, 1705-1710, ISSN 0167-8655.
Parag M. Kanade & Lawrence O. Hall. (2007). Fuzzy Ants and Clustering, IEEE Trans. on
System, Man, and Cybernetics-part A, Vol. 37, No. 5, September 2007, 758-769,ISSN
1083-4427.
R. Montemanni; L. M. Gambardella; A. E. Rizzoli & A. Donati. (2005). Ant colony system for
a dynamic vehicle routing problem, J. Combinatorial Optim., Vol. 10, No. 4, 327–343,
ISSN 1573-2886.
S.C. Nardone; A.G. Lindgren & K.F. Gong. (1984). Fundamental properties and performance
of conventional bearings-only target motion analysis, IEEE Transactions on Aerospace
and Electronic Systems, Vol. 29, No. 9, 775-787, ISSN 0018-9251.
T. Stützle & H. H. Hoos. (2000). MAX-MIN ant system, Future generation computer systems,
Vol.16, 889-914, ISSN 0167-739X.
V. Shapiro. (2006). Accuracy of the straight line Hough transform: the non-voting approach,
Computer Vision and Image Understanding, Vol. 103, No. 1, 1-21, ISSN 1077-3142.
226 New Advances in Machine Learning
Mahalanobis Support Vector Machines Made Fast and Robust 227
15
X
CHINA
1. Introduction
As is known to us, common Euclidean distance based SVMs are easily influenced by outliers
in given samples and might subsequently cause big prediction errors in testing processes.
Therefore, many scholars propose various preprocessing methods such as whitening or
normalizing the data to a sphere shape to remove the outliers and then call the routine SVM
methods to build a more reasonable machine. However, since Euclidean distance is often
sub-optimal especially in high dimension learning problem and might cause the learning
machine fail due to the ill-conditioned Gram kernel, then it is necessary to find some more
efficient and robust way to resolve the problem, which is the motivation of this chapter.
The Mahalanobis distance is superior to Euclidean Distance in handling with outliers and is
widely used in statistics and machine learning area. Currently there are some methods in
building SVMs combined with Mahalanobis distances. Some of them use it in the kernel and
replace common kernel by a Mahalanobis one in SVMs. Some of them use it in
preprocessing phase to remove the outlier first and then build SVMs using common
methods. Others use it in the postprocessing phase to extract key support vectors for
speedup and efficiency. Most of them achieve superior performances compared with SVM
counterpart. However, it should be pointed out that the complexity of the combined
algorithm is the most concerned factor in building such an algorithm.
As is known to us, none of them incorporates the Mahalanobis distance into models, which
tradeoff the complexity and performance in the same algorithm meantime and make the
algorithm more robust. The obvious feature of this new method is that there is no more
necessary to remove the outlier first, since it is already considered and will be identified
automatically in the model. It is also expected to improve and simplify the whole learning
process efficiently.
One Class Classification (OCC) (Scholkopf, 2001) now becomes an active topic in machine
learning domain. One Class Support Vector Machines (OCSVM) is firstly proposed via
constructing a hyperplane in kernel feature space which separates the mapped patterns
from the origin with maximum margin. Support vector domain description (SVDD) (Tax,
1999) is another popular OCC method, which seeks the minimum hypersphere that encloses
all the data of the target class in a feature space. In this way, it finds the descriptive area that
228 New Advances in Machine Learning
covers the data and excludes the superfluous space that results in false alarms existed in
OCSVM.
However, although OCSVM does provide good representation for the classes of interest, it
overlooks the discrimination issue between them. Moreover, the hypersphere model of
SVDD is not flexible enough to give a tight description of the target class generally.
Therefore, in our previous works, we proposed two Mahalanobis distance based learning
machine called QP-MELM and QP-MHLM respectively via solving their duals. However, as
is suggested in (Löfberg, 2004), if both the primal form and dual form of an optimization
problem are solvable, then the primal form is more commendable for approximation ability.
Therefore, (Wei, 2007A) rewrote the MELM as a Second Order Cone Programming (SOCP)
representable form and proposed a SOCP-MELM for class description. Applications to real
world UCI benchmark datasets show promising results.
Recently, Wei etc al proposed a novel learning concept called enclosing machine learning
(Wei, 2007D), which imitates the human being’s cognition process, i.e. cognizing things of
the same kind (To obtain a minimum bounding boundary for class description) and
recognizing unknown things via point detection. Wei illustrated the concept using
minimum volume enclosing ellipsoid learner for one class description and extended it to
imbalanced data set classification. Except this, (Wang, 2005) and (Liu, 2006) proposed two
SVDD based pattern classification algorithms (called SSPC and MEME respectively for
simplicity) for imbalanced data set, which can also been classified to enclosing machine
leaning’s framework .
This chapter will be organized as follows. First, review of Mahalanobis distance\property
and related learning methods will be briefed. Then, the new optimization models based on
linear programming for Data Description, Classification incorporating Mahalanobis distance
will be proposed. Third, benchmark datasets experiments for classification and regression
will be investigated in detail. Finally, conclusions and discussions will be made.
1
μ N X1
(1)
Σ 1 XXT 1 X11T XT
N N2
eigenvalues. This gives the minimum squared error approximation to the true solutions. It
should be noted that pseudoinverse restricts inversion to the range of the operator, i.e. the
subspace where it is not degenerate. This is often unavoidable in high dimensional feature
spaces. If the covariance is real symmetric and positive semidefinite, then the covariance
matrix can be decomposed as Σ PT GP , and thus Σ 1 PT G 1P . Then the
Mahalanobis distance from a sample x to the population X is
1 1 1
KC K EK KE 2 EKE (3)
N N N
N N symmetric matrix.
Using (1), we obtain
Σ X T Z 2 X (4)
1
1 1
where Z( (I E)) 2 is a NN symmetric matrix, I is a NN unit matrix.
N N
The kernel Mahalanobis distance in the feature space can then be written as
1 N
(k ( X, x)
N
k (X, x ))
i 1
i
230 New Advances in Machine Learning
M 2
semidefinite matrix and thus can be calculated via singular value decomposition, i.e.
M UΛUT , M 2 UΛ 2 UT .
Using singular value decomposition method, we can easily conclude following theorem:
Theorem 1: Let the eigenstructures of the centered matrix KC be K C QT ΩQ , then the
covariance matrix Σ can be diagonalized as follows:
1 1
1
Σ (Ω 2 QXT )T ( Ω)(Ω 2 QXT ) (6)
N
1 N
Σ ((xi ) μ )((xi ) μ )T
N i 1
(7)
1 1
X XT 2 XT 11T X PT GP
N N
Notice that the eigenvectors necessarily lie in the span of the centered data, thus P can be
written as following linear combination
1
P θ( X T EXT ) (8)
N
1
(K C ) 2 θT K C θT G (9)
N
1
K C θT θT G (10)
N
Since matrix G is diagonal, and the centered kernel matrix can be decomposed as
K C QT ΩQ (11)
We can obtain
θ DQ (12)
1
G Ω (13)
N
I PPT
1 1
θ( X T EXT )( X X E)θT
N N (14)
T
DQK C Q D
DΩD
Therefore
1
DΩ 2
(15)
K C QT ΩQ
QK C ΩQ Ω 1QK C E QE
1 1 1
Ω 1Q(K EK KE 2 EKE)E QE (16)
N N N
1 1
Ω 1Q(KE EKE KE EKE) QE
N N
1
Ω Q0 QE 0
1
P Ω 2 QXT (17)
1 1
1
Σ (Ω QX ) ( Ω)(Ω 2 QXT )
2 T T
(18)
N
And this ends the complete proof of Theorem 1.
According to Theorem 1, we can calculate the pseudoinverse Σ to approximate Σ 1 as
1 N
d 2 ( (x), X ) N (k ( X, x)
N
k ( X, x ))
i 1
i
T
T
(Q Ω Q) 2
(20)
1 N
(k ( X, x)
N
k (X, x ))
i 1
i
As we mentioned before, there often exists zero eigenvalues condition in high dimensional
feature spaces, (7) is represented to approximate the true inverse of sample covariance
matrix. Later, we will introduce another regularization method for avoiding the zero
eigenvalues condition in Section 3.
1 T 1 N
min
w ,i 0, 2
w w
N
i 1
i
(21)
w T xi i ,
s.t.
i 0, i 1, 2, , N .
1 T
max α Kα
i 0 2
1 (22)
0 α 1
s.t. N
α1 1.
T
is a quadratic convex optimization problem, where α is the dual variable, K X X is
a kernel matrix. By using the Mahalanobis distance metric instead of Euclidean distance
metric, the primal now becomes:
1 T 1 1 N
min
w ,i 0, 2
w Σ w
N
i 1
i
(23)
w Σ xi i ,
T 1
s.t.
i 0, i 1, 2, , N .
1 T T 1
max α X Σ X α
α 2
1 (25)
0 α 1
s.t. N
αT 1 1.
k ()
X T X K , (26)
Using the kernel trick and (21), the kernel Mahalanobis hyperplane learning machine can be
written in kernel form as:
1
max NαT KQT Ω 2QKα
α 2
1 (27)
0 α 1
s.t. N
αT 1 1.
{Σ : Σ Σ 0 r} (28)
F
0
where r 0 is fixed and denotes the Frobenius norm, Σ is estimated via (1).
F
Then the primal in (10) can be modified as
1 T 1 N
min max
u ,i 0, Σ 2
u Σu
N
i 1
i
uT xi i ,
s.t. Σ Σ 0 r ,
i 0, i 1, 2, , N . (30)
1 1 N
min uT Σ r u
u ,i 0, 2
N i 1 i
uT x i ,
s.t. i
i 0, i 1, 2, , N .
Σ r 1 (rI X ZZX T ) 1
1 1 (33)
I X Z(rI ZXT X Z) 1 ZX T
r r
Using the kernel trick, (17) then becomes
1 T
max α (K KZM r 1ZK )α
α 2r
1 (34)
0 α 1
s.t. N
αT 1 1.
whereM r rI ZKZ . Again, we get a standard QP problem, and the inverse of the real
symmetric and positive definite matrix M r can be exactly estimated using stable and
efficient eigenvalue decomposition method.
where R is radius, i 0 is slack variable, c is the center, d () is the given distance
metric (the default is the Euclid norm) , C is a tradeoff that controls the size of sphere and
the errors, N is the number of samples.
The corresponding dual
Mahalanobis Support Vector Machines Made Fast and Robust 237
N N N
max i k (xi , xi ) i j k (xi , x j )
i 0
i 1 i 1 j 1
0 i C , i 1, 2, , N (36)
s.t. N
i 1.
i 1
By using the Mahalanobis distance metric instead of Euclidean distance metric, the primal
now becomes:
N
min R 2 C i
R ,i 0,c
i 1
(37)
(xi c) Σ 1 (xi c) R 2 i ,
T
s.t.
i 0, i 1, 2, , N .
N N N
max i xTi Σ 1xi i j xTi Σ 1x j
i 0
i 1 i 1 j 1
0 i C , i 1, 2, , N (38)
s.t. N
i 1.
i 1
Using the kernel trick, and following equations,
k () N
xTi X k (xi , X) , c i xi , (39)
i 1
the kernel Mahalanobis Ellipsoidal learning machine can be written in kernel form as:
238 New Advances in Machine Learning
N
max i k (xi , X)QT Ω 2Qk ( X, xi )T
i 0
i 1
N N
i j k (xi , X)QT Ω 2Qk ( X, x j ) (40)
i 1 j 1
0 i C , i 1, 2, , N
s.t. N
i 1.
i 1
Accordingly we can obtain the following Mahanlanobis distance of the sample x from the
center c in the feature space:
The parameters R, i can be determined by the following relations via KKT conditions:
d 2 (x , c ) R 2 , i 0, i 0
d 2 (x , c ) R 2 , 0 i C , i 0 (42)
d (x , c ) R i , i C , i 0
2 2
{Σ : Σ Σ 0 r} (43)
F
N
min max R 2 C i
R ,i 0,c Σ
i 1
0
Suppose Σ Σ r Σ . Then for any given v, according to the Cauchy-Schwarz
inequality, we get
vT Σv v 2
Σv 2 v 2
Σ F
v 2 vT v
This holds of compatibility of the Frobenius matrix norm and the Euclidean vector norm
and because Σ F
1 . For Σ the unity matrix, this upper bound is attained.
N N N
max i xTi Σ r 1xi i j xTi Σ r 1x j
i 0
i 1 i 1 j 1
0 i C , i 1, 2, , N (47)
s.t. N
i 1.
i 1
By using the Woodbury formula
N
max i (k (xi , xi ) k (xi , X)ZM r 1Zk ( X, xi ))
i 0
i 1
N N
i j (k (xi , x j ) k (xi , X)ZM r1Zk ( X, x j )) (50)
i 1 j 1
0 i C , i 1, 2, , N
s.t. N
i 1.
i 1
where M r rI ZKZ . Again, the inverse of the real symmetric and positive definite
2
d 2 ( xi ) 1 ( xi )T ( xi ) ( xi )T N Ω 1Qk i (51)
2
N
min
2
R 2 C i
R ,i 0
i 1
2 (52)
N Ω Qk i 2 R i ,
1 2
s.t.
i 0, i 1, 2,, N .
N N N 2
L ( R 2 , i , i , i ) R 2 C i i i i ( R 2 N Ω 1Qk i i ) (53)
2
i 1 i 1 i 1
where i , i 0 are Lagrange multipliers or the dual variables. According to the KKT
conditions, and equating the partial derivatives of L with respect to R 2 , i to zero yields:
L N
R 2
1
i 1
i 0 (54)
L
C i i 0 (55)
i
From (55), we get
i C i 0 (56)
0 i C (57)
Using (57), and substituting (54), (55) into (53) results in the dual problem:
242 New Advances in Machine Learning
N 2
min i N Ω 1Qk i
2
i 1
N (58)
i 1
s.t. i 1
0 C
i
We can see that the optimization problem (58) is in a linear programming form. This LP
form is superior to QP form in computational complexity.
From the solution of (58), the samples with i 0 will fall inside the ellipsoid. The
R N Ω 1Qk sv (59)
1 1 2
1
d ( xi ) ( xi ) ( xi ) ( xi ) ( Ω rI) 2 Ω 2 Qk i
2 1
r
T
r
T
(59)
N 2
N
min
2
R 2 C i
R ,i 0
i 1
1
1
1
(60)
( Ω rI) Ω Qk i
2 2
R 2 i ,
s.t. N 2
0, i 1, 2,, N .
i
where k i : (k ( x1 , xi ), k ( x2 , xi ),, k ( xN , xi )) .
And the dual form is
Mahalanobis Support Vector Machines Made Fast and Robust 243
1 1 2
N
1
min i ( Ω rI) 2 Ω 2 Qk i
i 1 N
N (61)
i 1
s.t. i 1
0 C
i
Accordingly, we can get the ellipsoidal radius function in robust form for any sample x,
1 1
1
f (x) N ( Ω rI) 2 Ω 2 Qk x (62)
N 2
where d is the shortest distance from the hyper-ellipsoid to the closest target and outlier
class samples, z M
: z T Σ1 z for any vector z. Note that the distance is now under
Mahalanobis distance metric and d act s just as the margin of the SVM.
R
Rd
d
Rd
d
R Rd
Rd
Fig. 1. Geometric illustrations of separation between two classes via different algorithms. (a)
SVM. (b) Sphere shell separation (c) Ellipsoidal shell separation.
Obviously, there are many such hyper-ellipsoids which satisfy (63). An ideal criterion is to
Rd
maximize the separation ratio . But this objective is nonlinear and cannot be dealt
Rd
with directly. Yet, it is easy to show that maximization of the separation ratio is equivalent
2
R R2
to minimization of . Using Taylor series formula, can be approximated as
d d2
R 2 R02 1 2 R02 2
R 2 d (64)
d 2 d 02 d 02 d0
Now, the primal of Ellipsoidal shell separation in original space can be written as
min
2 2
R2 d 2
R ,d
(65)
s.t. yi R x i 2 2
M d 2
where is a constant, which controls the ratio of the radius to the separation margin.
Mahalanobis Support Vector Machines Made Fast and Robust 245
min
2 2
R2 d 2
R ,d
2
(66)
s.t. yi R 2 N Ω 1Qk i d2
min
2 2
R2 d 2
R ,d
1
1
1 2
(67)
s.t. yi R 2 ( Ω rI) 2 Ω 2 Qk i d2
N
y R x
s.t.
i 2
i M d i
2 2 (68)
i 0, i 1,, N
N
min
2 2
R 2 d 2 C i
R ,d ,i
i 1
i
s.t.
y R 2 N Ω 1Qk
i
2
d 2
i
(69)
i 0, i 1,, N
N
min
2 2
R 2 d 2 C i
R ,d ,i
i 1
1
1
1 2
(70)
yi R 2 ( Ω rI) 2 Ω 2 Qk i d 2 i
s.t. N
i 0, i 1,, N
In order to obtain the obtain the dual form, the for the primal form (69) will be as follows:
N
L( R 2 , d 2 , i , i , i ) R 2 d 2 C i
i 1
(71)
N N
2
( y ( R
i 1
i i
2
N Ω Qk i ) d i ) ii
1 2
i 1
L N
R 2
1
i 1
i yi 0 (72)
L N
d 2
i 1
i 0 (73)
L
C i i 0 (74)
i
Using (72)-(74), the Lagrange function for the primal form (69) is simplified into following
form:
Mahalanobis Support Vector Machines Made Fast and Robust 247
N
2
min i yi N Ω1Qk i
i 1
N
i yi 1
i 1 (75)
N
s.t. i
i 1
0 i C
i 1,, N
We can see that (75) is in a linear programming form. Therefore, we can easily solve it using
any mature and stable LP solvers. And it is expected to be easily extended to large scale
datasets.
Accordingly, we can also conclude the dual form for (70) as follows:
1 1 2
N
1
min i yi ( Ω rI) Ω 2 Qk i
2
i 1 N
N
i yi 1
i 1 (76)
N
s.t. i
i 1
0 i C
i 1,, N
5. Applications
5.1 Mahalanobis One Class SVMs
We investigate the initial performances of our proposed Mahalanobis Ellipsoidal Learning
Machine (MELM) using three real-world datasets (ionosphere, heart and sonar) from the
UCI machine learning repository. To see how well the MELM algorithm performs with
respect to other learning algorithms, we compared the OCSVM, SVDD and MOCSVM
2
algorithm using Gaussian kernels k ( x, y ) exp( x y ) .As for the MOCSVM, we
only use one single RBF kernel for performance comparison.
We treat each class as the “normal” data in separate experiments. We randomly choose 80%
of points as training data and the rest 20% as testing data. We determined the optimal
248 New Advances in Machine Learning
values of and C for RBF kernels by 5-fold cross validation. For the regularization
constant, we set it as 0.01.
The datasets used and the results obtained by the four algorithms are summarized in Table
1. We can notice that the performance of our proposed Mahalanobis Ellipsoidal Learning
Machine is competitive with or even better than the other approaches for the three datasets
studied.
From Table 1, we also see that, on all 3 datasets, the results obtained by the Mahalanobis
distance based learning algorithm are slightly better than the corresponding results of the
other two Euclidean distance based methods.
6. Conclusions
In this paper, we extended the support vector data description one class support vector
machines via utilizing the sample covariance matrix information and using the Mahalanobis
distance metric instead of Euclidean distance metric. The proposed Mahalanobis Ellipsoidal
Learning Machine can be easily addressed as a robust optimization problem by introducing
an uncertainty model into the estimation of sample covariance matrix. We propose a LP
representable Mahalanobis Data Description Machine for one class classification. We also
address a robust optimization problem by introducing an uncertainty model into the
estimation of sample covariance matrix. The results of applications to the three UCI real
world datasets show promising performances.
We also proposed a LP based Minimum Mahalanobis Enclosing Ellipsoid (MMEE) pattern
classification algorithm for generally two class dataset classification. The MMEE method can
be solved in kernel form of LP. We also address a robust optimization problem by
introducing an uncertainty model into the estimation of sample covariance matrix. Initial
applications to several UCI real world datasets show promising performances. The initial
results show that the proposed methods own both good description and discrimination
character for supervised learning problems. Moreover, the data description with non-
hyperplane bounding decision boundary owns better discrimination performance than
hyperplane counterpart in the context of supervising learning.
7. References
Abe, Shigeo (2005). Training of support vector machines with Mahalanobis kernels. In
Proceeding of ICANN 2005 Conference, LNCS 3697, W. Duch et al. Eds. Springer-
Verlag, Berlin Heidelberg, pp. 571–576
Blake, C.; Keogh, E., Merz, C. J. (1998). UCI repository of machine learning databases,
University of California, Irvine, CA. Available at http://www. ics.uci.edu/~
mlearn/ML-Repository.html
Chapelle, O. (2007). Training a Support Vector Machine in the Primal. Neural Computation,
Vol.19, 1155-1178
Lanckriet, G.; El Ghaoui, L., Jodan M. (2003). Robust novelty detection with single-class
MPM. In Advances in Neural Information Processing Systems 15, S.Becker, S. Thrun,
and K. Obermayor, Eds. MIT Press, Cambridge
Li, Yinghong; Wei, Xunkai. (2004). Engineering applications of support vector machines, Weapon
Industry Press, Beijing, China
250 New Advances in Machine Learning
Liu, Y.; Zheng, Y.-F. (2006). Maximum Enclosing and Maximum Excluding Machine for
Pattern Description and Discrimination. In: Proceeding of ICPR 2006 Conference.
Vol. 3, 129-132
Löfberg, J. (2004). YALMIP: A Toolbox for Modeling and Optimization in MATLAB. In:
Proceedings of the CACSD Conference
Ruiz, A.; Lopez-de-Teruel, P. E. (2001). Nonlinear Kernel-based Statistical Pattern Analysis.
IEEE Transactions on Neural Networks, Vol.2, No.1, 16-32
Scholkopf, B.; Platt, J., Shawe-Taylor, J., Smola, A., Williamson, R. (2001). Estimating the
Support of a High Dimensional Distribution. Neural Computation, Vol.13, No.7,
1443-1471
Tax, D.; Duin, R. (1999). Support Vector Domain Description. Pattern Recognition Letters,
Vol.20, No.14, 1191-1199
Tsang, Ivor W.; Kwok, James T., and Li, Shutao. (2006). Learning the kernel in Mahalanobis
one-class support vector machines. In Proceeding of IJCNN 2006 Conference,
Vancouver, BC, Canada, 1169-1175
Wang, J.; Neskovic, P., Cooper, Leon N. (2005). Pattern Classification via Single Spheres. In:
A. Hoffmann, H. Motoda, and T. Scheffer (eds.): Proceeding of DS 2005 Conference.
LNAI, Vol. 3735, 241–252
Wei, X.-K; Huang, G.-B., Li, Y.-H. (2007A). Mahalanobis Ellipsoidal Learning Machine for
One Class Classification. In: 2007 International Conference on Machine Learning and
Cybernetics. Vol. 6, 3528-3533
Wei, X.-K.; Huang, G.-B., Li, Y.-H. (2007B). A New One Class Mahalanobis Hyperplane
Learning Machine based on QP and SVD. In LSMS2007
Wei, X.-K.; Li, Y.-H., Feng, Y., Huang, G.-B. (2007C) Solving Mahalanobis Ellipsoidal
Learning Machine via Second Order Cone Programming. In: De-Shuang Huang,
et al. (eds.): Proceeding of ICIC 2007 Conference. CCIS, Vol.2, 1186-1194
Wei, X.-K.; Li, Y.-H., Li, Y.-F. (2007D). Enclosing Machine Learning: Concepts and
Algorithms. International Journal of Neural Computing and Applications. Vol. 17, No.
3 , 237-243
Wei, X.-K.; Löfberg J., Feng, Y., Li, Y.-H, Li, Y.-F. (2007E). Enclosing Machine Learning for
Class Description. In: D. Liu et al. (eds.): Proceedings of ISNN2007 Conference.
LNCS, Vol. 4491, 428-437
On-line learning of fuzzy rule emulated networks for a class of unknown nonlinear discrete-time
controllers with estimated linearization 251
0
16
1. Introduction
Recently, the linearization of a class of unknown discrete-time dynamic systems has achieved
considerable topics for the controller design. The unknown functions after system lineariza-
tion have been estimated by several methods including artificial intelligence techniques such
as neural networks, fuzzy logic systems and neurofuzzy networks. In a number of published
articles, the issues of system theoretic analysis have been introduced and addressed in the top-
ics of stabilization, tracking performance and the bounded parameters. For all of these cases,
the results are validated in the domain around the equilibrium point or state (9; 11). These
methods of linearization including local linearization, Taylor series expansion and feedback
linearization impose Lipschitz conditions (4; 6; 10; 14; 18). The closed-loop system stability
and tracking error have been analyzed in the case of neural network adaptive control (5; 7)
but during the learning phase the stability and convergence can not be ensured because of the
special conditions. The system stability or bounded signals analysis has been verified (1; 13)
and references therein. However, these nonlinear systems under control should be obtained
in the format as y(k + 1) = f (k) + g(k)u(k) when y(k) and u(k) are the system output and the
control input at time index k, respectively and f (k) and g(k ) are unknown nonlinear functions.
The small learning rate is often defined to solve the stability problem but the convergence is
very slow. The discrete-time projection has been introduced for adaptive control systems in
(16). The node number of multi-layer neural networks can take more effect of closed-loop sta-
bility and tracking performance. In (15), the unknown nonlinear part has been compensated
by neural networks and the closed-loop system stability has been also guaranteed for a class
on discrete-time systems. Nevertheless, this algorithm needs the renovation when the oper-
ating point is changed. In the case of robust system, the dead-zone function has been applied
for feedback linearization systems (8) but this control algorithm are only limited for the sys-
tem with slow trajectory tracking. In this chapter, we discuss about the controller for a class of
nonlinear discrete-time systems with estimated unknown nonlinear functions by Muti-input
Fuzzy Rules Emulated Networks (MIFRENs). These nonlinear functions are occurred when
252 New Advances in Machine Learning
the control law is constructed and they are completely unknown as a priori. All adjustable
parameters inside MIFRENs are automatically tuned by the proposed leaning algorithm. By
the theoretical analysis, these parameters are all bounded during the system operation with
out any request of off-line learning phase. The closed-loop tracking error is also bounded by
the universal function approximation of MIFREN.
2. Preliminaries
2.1 Formulation of Nonlinear discrete-time systems
In this work, we devote our interest in to the discrete-time systems which can be described by
where f (·, ·) is an unknown nonlinear function, k is time index, y(k) ∈ R denotes the mea-
surable output, u(k) ∈ R is the control effort and p(k) = [y(k), y(k − 1), . . . , y(k − n + 1), u(k −
1), u(k − 2), . . . , u(k − m + 1)] when m ≤ n. For system design in the next section, these follow-
ing assumptions are still needed
Assumption : System derivative Let define two compact sets Ωy and Ωu for the system out-
put y and the control effort u, respectively. The derivative of f (·, ·) in (1) with respect
∂ f (·,u)
to the control effort u(k) is always existed ∀k = 1, 2, · · · and 0 < | ∂u | ≤ y¯u when
y(·) ∈ Ωy and u(·) ∈ Ωu where y¯u is a finite positive value.
Assumption : Existence of controller For any desired trajectories r (k), let the ideal control
effort of the system (1) u∗ (k) be existed by
( )
With the ideal control effort obtained by (2), the controlled system can provide the output to
be the desired trajectory as
r (k + 1) = f ( p(k), u∗ (k)). (3)
Let u∗ (k) ∈ Ωu∗ and r (k) ∈ Ωr , for the output y(k) ∈ Ωy such that Ωr ⊂ Ωy . The function gu (−)
is a one-to-one mapping function of Ωr into Ωu∗ , that is Ωu∗ ⊂ Ωu . With the last assumption,
gu (−) is smooth and Ωr is a compact set, then Ωu∗ must be a compact set also. The clearly
illustration is given in Fig. 1.
On-line learning of fuzzy rule emulated networks for a class of unknown nonlinear discrete-time
controllers with estimated linearization 253
where β T is the target linear parameter of MIFREN, Fµ (.) is the rule vector at MIFREN’s rule-
layer n̂ and m̂ are designed delay-order integers for y and u, respectively and ε(k) stands
for the MIFREN function approximation error. Eventually, the using function approximation
result of MIFREN can be given as
when β̂(k) is the actual linear parameter vector of MIFREN. The vector β̂(k) can be automat-
ically tuned via the proposed algorithm as will be discussed in the next section. In this sub-
section, it will be shown that MIFREN has the property of a universal function approximation
using the Stone-Weierstrass theorem (1; 17).
Proof: The proof is omitted here moreover for the interested reader can refer to (2) and (3).
3. Controller design
In this section, the controller for system given in (1) is constructed with the approximated
linearization and MIFRENs approximation.
where ūk = γu(k) + (1 − γ)u(k − 1) with 0 ≤ γ ≤ 1 and ∆u(k) = u(k) − u(k − 1). To simplify,
(6) can be rewritten as
y(k + 1) = f ( p(k), u(k − 1)) + f 1 ( p(k ), u(k − 1))∆u(k) + f 2 ( p(k), ūk )∆u2 (k), (7)
254 New Advances in Machine Learning
2
∂ f ( p(k ),u) 1 ∂ f ( p(k),u)
when f 1 ( p(k), u(k − 1)) = ∂u and f 2 ( p(k), ūk ) = 2 ∂u2 . By using (2)
u = u ( k −1) u=ū k
and the second assumption mentioned in the previous section, f 2 (·, ·) can be given by
f 2 ( p(k), ūk )∆u2 (k) = f 2 ( p(k), γu(k) + (1 − γ)u(k − 1))[u(k) − u(k − 1)]2 , (8)
f 2 ( p(k), ūk )∆u2 (k) = f 2 ( p(k), γgu ( p(k), y(k + 1)) + (1 − γ)u(k − 1)),
×[ gu ( p(k), y(k + 1)) − u(k − 1)]2 ,
= f¯2 ( p(k), y(k + 1)). (10)
y ( k + 1) = f ( p(k), u(k − 1)) + f 1 ( p(k), u(k − 1))∆u(k) + f¯2 ( p(k), y(k + 1)),
= f 3 ( p(k), y(k + 1)) + f 1 ( p(k), u(k − 1))∆u(k), (11)
where f 3 ( p(k), y(k + 1)) = f ( p(k), u(k − 1)) + f¯2 ( p(k), y(k + 1)). In (11), clearly, we have been
forced with the causality problem. Fortunately, with the second assumption, the ideal control
effort u∗ (k) can provide r (k + 1) as described in (3), thus we have
To continue our design procedure, the ideal control effort u∗ (k) can be obtained by
r (k + 1) − f 3 ( p(k ), r (k + 1))
u∗ (k ) = u ( k − 1) + ,
f 1 ( p(k), u(k − 1))
1 f ( p(k), r (k + 1))
= u ( k − 1) + r ( k + 1) − 3 , (13)
f 1 ( p(k), u(k − 1)) f 1 ( p(k), u(k − 1))
or
u∗ (k) = u(k − 1) + f 1∗ ( p(k))r (k + 1) − f 2∗ ( p(k), r (k + 1)), (14)
1 f 3 ( p(k),r (k+1))
when f 1∗ ( p(k)) = f 1 ( p(k),u(k −1))
and f 2∗ ( p(k), r (k + 1)) = f 1 ( p(k),u(k −1))
. From the control law
1
given by (14), the singularity problem of can be avoided by MIFREN approx-
f 1 ( p(k ),u(k −1))
imation which will be discussed later. Let us consider the ideal control effort in (14), thus
these nonlinear functions f 1∗ (·, ·) and f 2∗ (·, ·) are unknown. In this work, two MIFRENs are
constructed to approximate f 1∗ (·, ·) and f 2∗ (·, ·) by MIFREN1 and MIFREN2 , respectively. We
have
u∗ (k) = u(k − 1) + [ β∗1 T F1 ( p(k)) + ε 1 (k)]r (k + 1) − β∗2 T F2 ( p(k), r (k + 1)) − ε 2 (k), (15)
where F1 (·) and F2 (·) are rule-functions of MIFREN1 and MIFREN2 , respectively, β∗1 =
[ β∗1,1 β∗1,2 · · · β∗1,n1 ] T , β∗2 = [ β∗2,1 β∗2,2 · · · β∗2,n2 ] T are ideal weight vectors, n1 and n2
denote number of rules for each MIFREN and ε 1 (·) and ε 2 (·) are approximation errors. Let
On-line learning of fuzzy rule emulated networks for a class of unknown nonlinear discrete-time
controllers with estimated linearization 255
us neglect these errors and use the actual weight vector as β 1 (k) and β 2 (k) thus the proposed
control law can be given by
With this control equation, the causality problem has been solved by the MIFRENs approxi-
mation of unknown nonlinear functions. In the next subsection, the system performance will
be analyzed with the designed parameters and main theorem.
e ( k ) = r ( k ) − y ( k ), (17)
or
e ( k + 1) = r ( k + 1) − y ( k + 1), (18)
for time index k + 1. Substitute y(k + 1) from (11) into (18), we have
By using Taylor expression and mean value theorem, the control error in (19) can be obtained
as
∂ f ( p(k), y)
e(k + 1) = r (k + 1) − f 3 ( p(k), r (k + 1)) + 3 ( y ( k + 1)
∂y y=ȳk+1
−r (k + 1)) − f 1 ( p(k))∆u(k), (20)
where ȳk+1 is between r (k + 1) and y(k + 1). Let us consider the system in (11) with the control
effort given by (9), we have
y(k + 1) = f 3 ( p(k), y(k + 1)) + f 1 ( p(k))[ gu ( p(k), y(k + 1)) − u(k − 1)]. (21)
Case II For this second case, we reconsider (21) again with y(k + 1) = r (k + 1), take the
derivative with respect to y(k + 1) for the both sides of (21) and use Tayler expansion
256 New Advances in Machine Learning
∂ f 3 ( p(k), y)
y ( k + 1) − r ( k + 1) = + [y(k + 1) − r (k + 1)]
∂y y=ȳ(k +1)
∂gu ( p(k), y)
+ f 1 ( p(k)) [ y ( k + 1)
∂y y=y̆(k +1)
−r (k + 1)]. (24)
With the previous assumption, we still have ȳ(k + 1) = y̆(k + 1) and y(k + 1) = r (k + 1)
thus (24) can be rewritten as
Taking the results from the both cases, the following relation can be obtained
or
∂gu ( p(k), y)
f 1 ( p(k)) e ( k + 1) = r (k + 1) − f 3 ( p(k), r (k + 1))
∂y y=ȳ(k +1)
− f 1 ( p(k))∆u(k). (28)
On-line learning of fuzzy rule emulated networks for a class of unknown nonlinear discrete-time
controllers with estimated linearization 257
For the controllable system in (11), clearly, f 1 ( p(k)) = 0 and u∗ (k) = gu ( p(k), r (k + 1)) or
u(k) = gu ( p(k), y(k + 1)) thus the system sensibility [ ∂u ]−1 should be obtained as
∂y
when βTi (k) = β∗i T − β Ti (k) for i = 1, 2 and ε t (k) = ε 1 (k)r (k + 1) − ε 2 (k).
where η is the selected learning rate which will be discussed next and D(·) is the dead-zone
function which can be defined by
e(k) − ε m if e(k) > ε m
D(e(k)) = 0 if |e(k)| ≤ ε m (35)
e(k) + ε m if e(k) < −ε m ,
258 New Advances in Machine Learning
when |γy (k)ε t (k)| ≤ ε m as a small positive number. In the case of |e(k − 1)| > ε m , with the
dead-zone function (35) and the next time-index error (33), we have
D(e(k + 1)) = α D γy (k) βT1 (k) βT2 (k) F (k), (36)
where 0 < α D ≤ 1.
Let us define βΣ (k + 1) = βT1 (k + 1) β1 (k + 1) + βT2 (k + 1) β2 (k + 1), from the tuning law given
by (34) and βTi (k) = β∗i T − β Ti (k), we have
T
T T 2η β1 (k)
β Σ ( k + 1) = β 1 ( k ) β 1 ( k ) + β 2 ( k ) β 2 ( k ) − F (k)
y¯u || F (k)||2 β 2 (k)
η2
×D(e(k + 1)) + 2 || F ( k )||4
|| F (k)||2 D2 (e(k + 1)). (41)
y¯u
Substitute (41) into (40) and use (36), we obtain
T
2η β1 (k) η2
∆Vβ(k) = − F (k)D(e(k + 1)) + 2 || F (k)||2
y¯u || F (k)|| 2 β 2 (k) y¯u || F (k)||4
×D2 (e(k + 1))
2η 2 η2
= − D ( e ( k + 1 )) + D2 (e(k + 1))
α D γy (k)y¯u || F (k)||2 y¯u 2 || F (k)||2
−2 η η
= + D2 (e(k + 1)). (42)
α D γy ( k ) y¯u y¯u || F (k)||2
On-line learning of fuzzy rule emulated networks for a class of unknown nonlinear discrete-time
controllers with estimated linearization 259
With the selected learning rate defined by (37) and (38) and γy (k) given in (29), the first dif-
ference of Lyapunov function is negative, thus βTi (k) for i = 1, 2 are bounded.
Remark: Normally, with out loss of generality, y¯u is assumed to be positive thus γy (k) < y¯u :
∀k. The bounded tracking error for the closed-loop system is introduced by the following
theorem.
Theorem 3.1 (Bounded tracking error). For the nonlinear discrete-time system given in (1) with
the control law defined in (16), let define a compact set Ωε = {e(k)||e(k) ≤ 4ε m }, thus the ultimate
boundary on the tracking error is limk→∞ |e(k)| ≤ ε m or in a compact set Ωε .
Proof: Let a Lyapunov candidate function be given by
η
Ve (k) = e2 (k) + Vβ(k), (43)
2y¯u 2 Fo2
when Fo is defined by 0 < || F (k)|| ≤ F0 , ∀k. The first difference can be obtained by
∆Ve (k) = Ve (k + 1) − Ve (k),
η
= [e2 (k + 1) − e2 (k)] + ∆Vβ(k). (44)
2y¯u 2 Fo2
Substitute (42) into (44), we have
η 2ηD2 (e(k + 1))
∆Ve (k) = [e2 (k + 1) − e2 (k)] −
2y¯u 2 Fo2 α D γy (k)y¯u || F (k)||2
η 2 D2 (e(k + 1))
+ . (45)
y¯u 2 || F (k)||2
From the learning rate given by (37- 37), we can rearrange (45) as
η η
∆Ve (k) < e2 ( k + 1) − D2 (e(k + 1)),
2y¯u 2 Fo2 α D γy (k)y¯u || F (k)||2
η η
< e2 (k + 1) − 2 2 D2 (e(k + 1)),
2y¯u 2 Fo2 y¯u Fo
η
= [e2 (k + 1) − 2D2 (e(k + 1))]. (46)
2y¯u 2 Fo2
In this proof, we need to provide only the case when |e(k + 1)| > ε m . With |e(k + 1)| > ε m , the
dead-zone function in (35) can be obtained as
D(e(k + 1)) = e(k + 1) − ε m sign {e(k + 1)} . (47)
Substitute (47) into (46), we have
η 2
∆Ve (k) < 2 2
e (k + 1) − 2[e(k + 1) − ε m sign {e(k + 1)}]2 ,
2y¯u Fo
η
= 2 2
− e2 (k + 1) + 4e(k + 1)ε m sign {e(k + 1)} − 2ε2m ,
2y¯u Fo
η
= 2 2
− e2 (k + 1) − 2ε2m + 4|e(k + 1)|ε m ,
2y¯u Fo
η
< 2 2
− e2 (k + 1) + 4|e(k + 1)|ε m . (48)
2y¯u Fo
260 New Advances in Machine Learning
Consider the result in (48), clearly, ∆Ve (k) is always negative where |e(k + 1)| > 4ε m , thus
∆Ve (k) < 0 when |e(k + 1)| is out side a compact set Ωε .
The system performance can be demonstrated into two cases as the nominal system and the
robust control.
Rule 1 If y(k) is N and u(k − 1) is N Then f 1,1 (k) = β 1,1 (k) F1,1 (k),
Rule 2 If y(k) is N and u(k − 1) is Z Then f 1,2 (k) = β 1,2 (k) F1,2 (k),
Rule 3 If y(k) is N and u(k − 1) is P Then f 1,3 (k) = β 1,3 (k) F1,3 (k),
Rule 4 If y(k) is Z and u(k − 1) is N Then f 1,4 (k) = β 1,4 (k) F1,4 (k),
MIFREN1 Rule 5 If y(k) is Z and u(k − 1) is Z Then f 1,5 (k) = β 1,5 (k) F1,5 (k),
Rule 6 If y(k) is Z and u(k − 1) is P Then f 1,6 (k) = β 1,6 (k) F1,6 (k),
Rule 7 If y(k) is P and u(k − 1) is N Then f 1,7 (k) = β 1,7 (k) F1,7 (k),
Rule 8 If y(k) is P and u(k − 1) is Z Then f 1,8 (k) = β 1,8 (k) F1,8 (k),
Rule 9 If y(k) is P and u(k − 1) is P Then f 1,9 (k) = β 1,9 (k) F1,9 (k),
Rule 1 If y(k) is N and r (k + 1) is N Then f 2,1 (k) = β 2,1 (k) F2,1 (k),
Rule 2 If y(k) is N and r (k + 1) is Z Then f 2,2 (k) = β 2,2 (k) F2,2 (k),
Rule 3 If y(k) is N and r (k + 1) is P Then f 2,3 (k) = β 2,3 (k) F2,3 (k),
Rule 4 If y(k) is Z and r (k + 1) is N Then f 2,4 (k) = β 2,4 (k) F2,4 (k),
MIFREN2 Rule 5 If y(k) is Z and r (k + 1) is Z Then f 2,5 (k) = β 2,5 (k) F2,5 (k),
Rule 6 If y(k) is Z and r (k + 1) is P Then f 2,6 (k) = β 2,6 (k) F2,6 (k),
Rule 7 If y(k) is P and r (k + 1) is N Then f 2,7 (k) = β 2,7 (k) F2,7 (k),
Rule 8 If y(k) is P and r (k + 1) is Z Then f 2,8 (k) = β 2,8 (k) F2,8 (k),
Rule 9 If y(k) is P and r (k + 1) is P Then f 2,9 (k) = β 2,9 (k) F2,9 (k),
when N, Z and P denote negative, zero and positive linguistic levels respectively. The mem-
bership functions for these rules are illustrated in Fig. (2) and (3). In this work, we use the
same membership functions of y(k) and r (k + 1) because these variables have equality lin-
guistic levels in the sense of human.
On-line learning of fuzzy rule emulated networks for a class of unknown nonlinear discrete-time
controllers with estimated linearization 261
In Fig. 4, the tracking performance is quite satisfied with out the off-line learning. The control
effort is illustrated in Fig. 5. The convergence of β i (k) is shown by || β i (k)|| in Fig. 6 for both
MIFRENs.
262 New Advances in Machine Learning
when
1 if 0 < k < 125
0.75 if 125 ≤ k < 325
∆ f 1 (k) = (51)
−1.25
if 325 ≤ k < 425
1.25 if 425 ≤ k < 500,
and
−0.5 if 0 < k < 125
1 if 125 ≤ k < 225
∆ f 2 (k) = (52)
−0.75
if 225 ≤ k < 425
−0.5 if 425 ≤ k < 500.
We use the initial setting IF-THEN rules, membership functions, ε m , η, y¯u and parameter
vectors β i , as the same as the previous one. With out any off-line learning for MIFRENs, the
tracking performance is represented in Fig. 8. The control effort u(k) is shown in Fig. 9. The
time variation of || β i (k)|| can be illustrated in Fig. 10. These uncertainty terms ∆ f 1 (k) and
∆ f 2 (k) are varied with time but the tuning vectors are all bounded.
264 New Advances in Machine Learning
troller for moving this Robotino R to reach the desired position in ( x, y ) coordinate as x ( i, k )
d
and yd (i, k ), respectively. During the movement, the desired angular of Robotino R denoted as
φd (i, k) should be maintained as 0◦ for all ith desired position and time index k. The system
configuration can be illustrated in Fig. 12 by the block diagram.
The commercial Robotino R needs velocity to control its movement such as velocity in x −axis
v x (i, k ) for x −direction, vy (i, k ) for y−direction and vφ(i,k) for the rotation. In this work, we
consider these signals as the control efforts which can be generated by the pair of MIFRENs.
The experiment has been demonstrated by 4 desired points and 4 routes as the following:
route 1 [(0.0, 0.5)→(0.5, 0)], route 2 [(0.5, 0)→(0.0, 0.0)], route 3 [(0.0, 0.0)→(0.5, 0.5)] and
route 4 [(0.5, 0.5)→(0.0, 0.5)]. In Fig. 13, the movement of Robotino R is illustrated in x − y
coordinate with errors in x and y axis as ex and ey shown in Fig. 14. Because of the fixed
angular φd = 0, we need to consider only two control efforts v x and vy as presented in Fig.
15. At the beginning, on route 1 and 2, the movement of robot is not strange line because
MIFRENs need to tune the parameters inside. After that the better results can be obtained in
route 3 and 4. In case of losing the wireless signal, we still have the satisfied result as shown
in Fig. 16.
Fig. 16. Experimental result: position x − y (in case of loss signal transmission).
6. Conclusion
In this chapter, an adaptive control for a class of nonlinear discrete-time systems based on
multi-input fuzzy rules emulated networks (MIFREN) is introduced by the approximation
with Taylor and mean value theorem. With out the need of mathematical system model, the
approximation can be existed directly to construct the control law. Two MIFRENs are imple-
mented to estimate these unknown functions obtained by the nonaffine linearization. With
the main theorem, the learning algorithm for parameters inside MIFRENs guarantees the con-
vergence of these parameters and the satisfied tracking performance. The computer simula-
tion system demonstrates the accuracy of our mathematic proof. We already consider both
operating cases for the nominal plant and the plant with some uncertainties. The bounded
parameters || β 1 || and || β 2 || and the satisfied tracking performance can be presented for the
both cases with the same initial setting. The experimental setup the commercial mobile robot
system called Robotino R is given to demonstrate the controller performance.
270 New Advances in Machine Learning
7. References
[1] C.T. Lin, Neural fuzzy systems, Prentice-Hall, 1996.
[2] C. Treesatayapun and S. Uatrongjit, “Adaptive controller with Fuzzy rules emulated struc-
ture and its applications’,” Engineering Applications of Artificial Intelligence, Elsevier, vol. 18,
pp. 603-615, 2005
[3] C. Treesatayapun “Nonlinear Systems Identification Using Multi Input Fuzzy Rules Emu-
lated Network,” ITC-CSCC2006 International Technical Conference on Circuits/Systems, Com-
puters and Communications, Chiangmai, Thailand, 10-13 July 2006
[4] E. Armanda-Bricaire, U. Kotta, and C. H. Moog, “Linearization of discrete-time systems,”
SIAM J. Contr. Optim., vol. 34, pp. 1999-2023, 1996.
[5] G. L. Plett, “Adaptive Inverse control of linear and nonlinear system using dynamic neural
networks,” IEEE Trans. Neural Netw., vol. 14, no. 2, pp. 360-376, Mar. 2003.
[6] H. X. Li and H. Deng, “An approximation internal model-based neural control for un-
known nonlinear discrete processes,” IEEE Trans. Neural Networks, vol. 17, no. 3, pp. 659-
670, May. 2006.
[7] H. Deng and H. X. Li, “A novel neural network approximate inverse control for unknown
nonlinear discrete dynamical systems,” IEEE Trans. Syst. Man Cybern. B, Cybern., vol. 35,
no. 1, pp. 115-123, Feb. 2005.
[8] H. Deng and H. X. Li, “On the new method for the control of discrete nonlinear dynamic
systems using neural networks,” IEEE Trans. Neural Netw., vol. 17, no. 2, pp. 526-529, Mar.
2006.
[9] J. B. D. Cabrera and K. S. Narendra, “Issues in the application of neural networks for
tracking based on inverse control,” IEEE Trans. Automat. Contr., vol. 44, pp. 2007-2027,
Nov. 1999.
[10] J. Slotine and W. Li, Applied Nonlinear Control. Englewood Cliffs, NJ: Prentice-Hall, 1991.
[11] L. Chen and K. Narendra, “Identification and Control of a Nonlinear Discrete-Time Sys-
tem Based on its Linearization: A Unified Framework,” IEEE Trans. Neural Networks, vol.
15, no. 3, pp. 663-673, May. 2004.
[12] M. Chen and D.A. Linkens,“A hybrid neuro-fuzzy PID controllers,” Fuzzy Sets and Sys-
tem, pp. 27-36, 99, 1998.
[13] P. He and S. Jagannathan, “Reinforcement Learning Neural-Network-Based Controller
for Nonlinear Discrete-Time Systems With Input Constraints,” IEEE Trans. Syst., Man.,
Cybern., vol. 37, no. 2, pp. 425-436 Apr. 2007.
[14] Q. Gan and C. J. Harris, “Fuzzy local linearization and local basis function expansion in
nonlinear system modeling,” IEEE Trans. Syst., Man, Cybern. B, vol. 29, pp. 559-565, Aug.
1999.
[15] Q. M. Zhu and L. Guo, “Stable adaptive neurocontrol for nonlinear discrete-time sys-
tems,” IEEE Trans. Neural Netw., vol. 15, no. 3, pp. 653-662, May 2004.
[16] S. S. Ge, J. Zhang, and T. H. Lee, “Adaptive MNN control for a class of non-affine NAR-
MAX systems with disturbances,” Syst. Control Lett., vol. 53, pp. 1-12, 2004.
[17] W. Rudin, Principles of Mathematical Analysis, New York: McGraw Hill, 1976.
[18] Z. Q.Wu and C. J. Harris, “A neurofuzzy network structure for modeling and state esti-
mation of unknown nonlinear systems,” Int. J. Syst. Sci., vol. 28, pp. 335-345, 1997.
Knowledge Structures for Visualising Advanced Research and Trends 271
17
x
Abstract
Due to the enormous amount of research publications are available, perceiving the growth
of scientific knowledge even in one’s own specialise field is a challenge task. It would be
helpful if we could provide scientists with knowledge visualisation tools to discover the
existence of a scientific paradigm and movements of such paradigms. This chapter
introduces the state of the art of visualising knowledge structures. The aim of visualising
knowledge structures is to capture intellectual structures of a particular knowledge domain.
Approaches to the visualisation of knowledge structures with emphasis on the role of
citation-based methods are described. The principal components of factor analysis and
Pathfinder network are utilized to reveal new and signification developments of intellectual
structure in ubiquitous computing research area. Literature published in the online citation
data bases CiteSeer and Web of Science (WoS) are exploited to drive the main research
themes and their inter-relationships in ubiquitous computing. The benefit of the results
obtained could be for someone new to a specific domain in research study, ubiquitous
computing in this case. The outcome uncovers popular topics and important research in the
particular domain. Potential developments can be re-used and utilized in other disciplines
and share across different research domains.
1. Introduction
Computing technology is a paradigm shift where technology becomes virtually invisible in
our lives and is a rapidly advancing and expanding research and development field in this
decade. Due to the enormous amount of available scientific research publications, keeping
up the growth of scientific knowledge even in one’s own specialise field is a challenge task.
It would be helpful if we could provide scientists with knowledge visualisation tools to
detect the existence of a scientific paradigm and movements of such paradigms. The main
scientific research themes are also very difficult to analyze and grasp by using the
traditional methodologies. For example, visualising intrinsic structures among documents in
scientific literatures could only capture some aspects of scientific knowledge.
272 New Advances in Machine Learning
This chapter introduces the state of the art of visualising knowledge structures. The aim of
visualising knowledge structures is to capture and reveal insightful patterns of intellectual
structures shared by scientists in a subject field. This chapter describes approaches to the
visualisation of knowledge structure with emphasis on the role of citation-based methods.
Instead of depending upon occurrence patterns of content-bearing words, we aim to capture
the intellectual structures of a particular knowledge domain.
We focus on the study of ubiquitous computing, also called pervasive computing.
Numerous journals and conferences are now dedicated to the study of ubiquitous
computing and related topics. Ubiquitous computing was first described by Weiser (1991).
Since then, a rich amount of related literatures are published. It would be useful if the
content of those publications could be summarised and presented in an easy way to capture
structures and facilitate the understanding of the research themes and trend in ubiquitous
computing.
The goal of the chapter is to show the scope and main themes of ubiquitous computing
research. We begin by examining the survey studies of visualising knowledge structures.
Next, the data collection method and intellectual structure techniques, factor analysis and
Pathfinder Network, are introduced. The results of the analysis are presented and discussed.
Knowledge Domain Visualisation (KDV) depicts the structure and evolution of scientific
fields (Borner, Chen and Boyak 2002). Some recent works in knowledge discovery and data
mining systems compose analysis of engineering domain (Mothe and Dousset 2004; Mothe
et al 2006).
P w ir
k r
W
i 1
Y tk Yt k Yt k
Vtk Vt k Vt k
Y tk Yt k Yt k
Vtk Vt k Vt k
Y tk Yt k Yt k
Vtk Vt k Vt k
Fig. 5. A probabilistic directed graphical model for visual motion estimation. Here, t = t + 1
and t = t + 2 denote future timesteps and k = k + 1 and k = k + 2 denote finer scales.
Observable nodes are shaded gray, hidden nodes are white.
Elad & Feuer (1998); Singh (1991) to recursively estimate the optical flow over time including
a prediction model that defines some temporal relation between pixel movements (see Fig. 4
third row). For prediction, a model for the underlying dynamics is needed to predict image
motion.
Some authors work in a probabilistic framework assuming that velocity distributions are
Gaussian parameterized by a mean and covariance. Kalman filtering can then be used to
properly combine the information from scale to scale or time to time taking into account un-
certainties of the measurements Simoncelli (1999); Singh (1991). The presumption of Gaussian
distributed velocity measurements is sometimes incomplete because velocity distributions are
often multimodal or ambiguous Simoncelli et al. (1991); Weiss & Fleet (2002), especially at mo-
tion boundaries. To circumvent this problem, particle filtering methods for non-Gaussian ve-
locity distributions have recently been used to improve motion estimation for tracking single
or multiple objects in a scene Isard & Blake (1998); Rosenberg & Werman (1997).
Markov chains along time that are defined for each scale k via Markov chains along scale defined
at each time step t. Note that the DBN forces an independency structure. The probability that
a node is in one of its states depends directly only on the states of its parents Yedidia et al.
(2003).
We assume a generative model for the observables Ytk of an image sequence I1:T,1:K with T
images at equidistant points in time t ∈ T at K spatial resolution scales k ∈ K with t = t + 1
and k = k + 1 being the next time step and the next finer scale, respectively. Without loss of
generalization we define the time intervals ∆t = 1 and scale intervals ∆k = 1 to be unity. Here,
the observable Ytk comprises image data of several frames within a time interval around t at
the same scale k. For example, Ytk = (Itk , It k ) has to be at least a pair of images with both
images being defined over the same image range X k at the same scale k but at consecutive
points in time t and t . Each image Itk consists of image intensities Ixtk at each image position
x ∈ X k . Similarly, the hidden state Vtk is a flow field at time slice t and scale k defined over
the image range X k with velocity vectors vtk x at each image position x.
P(V11 ) = ∏ P(v11
x ), (11)
x
defining some preference for the speed and direction of the velocities in the flow field. Often
this is chosen to be a product of zero mean Gaussian distributions to prefer slow and smooth
velocities Weiss & Fleet (2002). Second, the specification of the observation likelihood for the
images Ytk given the flow Vtk for all times t ∈ T and scales k ∈ K
This factorisation assumption is somewhat unusual because we do not assume the image ob-
servation to factorize in pixel observations but assume the observation likelihood to factorize
in the velocities only. And third, the specification of the transition probabilities for the flow
fields Vt k at the new timestep t at finer scale k given the flow field Vt k at the same time t
tk
but coarser scale k and the flow field V from last time t but at the same scale k . For the first
time slice t = 1 and the coarsest scale k = 1 the transitions are conditioned only on one flow
field V1k or Vt1 .
These equations explicitly express that the probability distribution for each flow field fac-
torises into independent distributions for each velocity vector. Nevertheless, although each
velocity vector is not dependent on velocity vectors from the flow field at the same time and
scale it heavily depends on all the velocity vectors from the flow fields at coarser scale and
Dynamic Visual Motion Estimation 291
Vt k
vxt k
k (vx
tk
, Vt k )
Vtk
vxt k
vxtk
tk Vt k
t (vx , Vtk )
past time. Further on, the conditional dependence P(vtx k |Vt k , Vtk ) can be split in two pair-
wise potentials φk , φt . This will allow us to maintain only factored beliefs during inference,
which makes the approach computationally practicable.
field at every spatial scale k transforms from t → t according to itself. The second potential
φk (vtx k , Vt k ) realizes a refinement from coarser to finer scale k → k at every time t . A sketch
speed of a flow vector vtx k at position x at time t is functionally related to a previous flow
vector vtk
x at some corresponding position x at time t,
including some free parameters θt that allow for adaptation of the temporal relation. Now,
asking what the corresponding position x in the previous image was, we assume that we can
infer it from the flow field itself as follows
x ∼ f xt (x , x − vtx k ; θ xt ) .
(15)
In principle f xt can be any arbitrary function that defines the relation between neighboring
positions. Again the free parameters θ xt allow for adaptation of the spatial relation. Note that
here we use vtx k to retrieve the previous corresponding point x . This is a suitable approxima-
Since it is uncertain how strong a position x at coarser scale k influences the velocity estimate
at position x at finer scale k , we assume that we can infer it from the neighborhood similar to
(15)
x ∼ f xk (x , x; θ xk ) . (18)
The considerations for the scale transition are analogous to the ones for the temporal tran-
sition. Again, combining both factors (17) and (18) and integrating x we get the second
pairwise potential
φk (vtx k , Vt k ) = ∑ f xk (x , x; θ xk ) f k (vtx k , vtxk ; θk ) ,
(19)
x
that imposes a spatial smoothness constraint on the flow field via spatial weighting of motion
estimations from coarser scale. The combination of both potentials (16) and (19) results in the
complete conditional flow field transition probability as given in (13). The transition factors
(16) and (19) allow us to unroll two different kinds of spatial constraints along the temporal
and the scale axes while adapting the free parameters for scale and time transition differ-
ently. This is done by splitting not only the transition in two pairwise potentials, one for the
temporal- and one for the scale-transition, but also every potential in itself in two factors, one
for the transition noise and the other one for an additional spatial constraint. In this way, the
coupling of the potentials (16) and (19) realizes a combination of (A) scale-time prediction and
(B) an integration of motion information neighboring in time, in space, and in scale.
4.4 Inference
The overall data likelihood P(Y1:T,1:K , V1:T,1:K ) is assumed to factorize as defined by the di-
rected graph in figure 5
T K
P(Y1:T,1:K , V1:T,1:K ) = ∏ ∏ P(Ytk |Vtk ) × P(V11 ) ×
t =1 k =1
T −1 K −1
∏ P(Vt 1 |Vt1 ) ∏ P(V1k |V1k ) P(Vt k |Vt k , Vtk ) .
(20)
t =1 k =1
What we are usually interested in is the probability for some flow field given all the data
aquired so far. For the offline case where all the data of a sequence is accessible this would be
the probability P(Vtk |Y1:T,1:K ). For the online case where only the past data is accessible the
probability P(Vtk |Y1:t,1:K ) would be interesting. To infer these probabilities, Bayes’ rule and
marginalization has to be applied. For the offline case this reads
P(Y1:T,1:K , V1:T,1:K )
P(Vtk |Y1:T,1:K ) = ∑ . (21)
V1:T,1:K \Vtk
P(Y1:T,1:K )
Dynamic Visual Motion Estimation 293
P(Y1:t,1:K , V1:t,1:K )
P(Vtk |Y1:t,1:K ) = ∑ . (22)
V1:t,1:K \Vtk
P(Y1:t,1:K )
Either for the online or offline case, the direct computation of the marginals using equation
(21) or (22) would take exponential time Yedidia et al. (2003). The most prominent solution to
this problem is Belief Propagation (BP) which is a very efficient approximate inference algorithm
especially applicable if the graph has a lot of loops and many hidden nodes like it is the case
for our graphical model for dynamic motion estimation (see figure 5).
Let us start with the inference of the flow field at first time slice t = 1 and coarsest scale k = 1
just having access to the observable Y11 . Applying Bayes’ rule we get
(Y11 , v11 11
x ) P(vx )
α(v11 11 11
x ) = P(vx |Y ) = 11
. (24)
P(Y )
This is the initial belief that has to be propagated along time and scale. To derive an approxi-
mate forward filter suitable for online applications we propose the following message passing
scheme that realizes a recurrent update of the beliefs. Let us assume, we isolate one time slice
at time t and neglect all past and future beliefs, then we would have to propagate the mes-
sages mk→k from coarse to fine and the messages mk →k from fine to coarse to compute a belief
over the scale Markov chain. Similarly, if we isolate one scale k for all time slices and neglect
all coarser and finer beliefs, then we would have to propagate the messages mt→t from the
past to the future and the messages mt →t from the future to the past to compute a belief over
the temporal Markov chain. For the realization of a forward scale-time filter, we combine the
forward passing of temporal messages mt→t and the computation of the likelihood messages
mY →v = (Yt k , vtx k ) at all scales k. As a simplification we restrict ourselves to propagating
messages only in one direction k → k and neglect passing back the message mk →k . The con-
sequence of this is that not all the V-nodes at time t have seen all the data Y1:t,1:K but only all
past data up to the current scale Y1:t,1:k . This reduces computational costs but the flow field
on the finest scale Vt,K is now the only node that sees all the data Y1:t,1:K . Nevertheless, we
also tested passing back the messages mk →k which only slightly improved the accuracy but
increased computational costs.
The factored observation likelihood and the transition probability we introduced in (12) and
(13) ensure that the forward propagated joint belief
will remain factored. Similar to BP in a Markov random field, we assume independency for
all neighboring nodes in the Markov blanket. This means the belief over Vtk and Vtk at time
t is assumed to be factored which implies that also the belief over Vt k and Vtk factorizes.
P(Vt k , Vtk |Y1:t ,1:k \ Yt k ) = P(Vt k |Y1:t ,1:k ) P(Vtk |Y1:t,1:k ) = ∏ α(vtx k )α(vtk
x ), (25)
x
where we used \ as the notation for excluding Yt k from the set of measurements Y1:t ,1:k .
The two-dimensional forward filter propagates the belief over Vt k and Vtk from (25) via
multiplying with the scale-time transition (13) and marginalizing over V and Vtk . The
t k
result is multiplied with the new observation likelihood (12) and normalized by P(Yt k ) to
P(vtx k |Y1:t ,1:k ) ∑ P(vtx k |Vt k , Vtk ) P(Vt k , Vtk |Y1:t ,1:k
(Yt k , vtx k ) ∑
\ Yt k ) ,
1
= P ( Yt k )
V t k Vtk
α(vtx k )
mY →v (vtx k ) ∑
t k t k t k tk t k tk
∝ ∑ k x
φ ( v , V ) φ (
t xv , V ) ∏ x
α ( v ) α ( v x ),
Vt k Vtk x
As can be seen, the complete scale-time forward filter can now be defined by the computation
of updated beliefs α as the product of incoming messages,
α(vtk tk tk tk
x ) ∝ mY → v ( v x ) m k → k ( v x ) m t → t ( v x ) . (27)
Inserting the proposed class of temporal transitions (16) into (26) leads to the following tem-
poral message
mt→t (vtx k )
∑ φt (vtx k , Vtk ) ∏ α(vtkx ) ,
=
Vtk x
Note that the summation ∑Vtk is summing over all possible flow fields, i.e. ∑Vtk repre-
sents X k summations ∑vtk ∑vtk ∑vtk · · · over each local flow field vector. We separated these
1,1 1,2 2,1
into a summation ∑vtk over the flow field vector at x and a summation ∑Vtk \vtk over all
x x
other flow field vectors at x = x . Then, we use the equivalence ∑Vtk \vtk ∏z=x α(vtk
z ) =
x
Dynamic Visual Motion Estimation 295
for t = 1 to T do
for k = 1 to K do
for x = 1 to X k do
mY →v (vtx k )
x
x
Update the beliefs
end for
end for
end for
mk→k (vtx k )
∑ φk (vtx k , Vt k ) ∏ α(vtx k ) ,
=
Vt k x
vtx k x
Vt k z=x
\vtx k
1
Finally, the three equations (27), (28), and (29) define a very efficient tightly coupled scale-time
forward filter for visual motion estimation. It realizes a complete probabilistic recurrent esti-
mation of a set of flow fields Vt,1:K with different resolutions k swept along the time dimension
t. It follows the principle that the longer you observe a scene and the finer the resolution of
the data is the more accurate the flow can be estimated.
296 New Advances in Machine Learning
5. Filter realisations
The pseudo-code, Algorithm 1, shows the very compact form of the derived scale-time filter
suitable for an algorithmic implementation. What remains to be done, is the specification of
the observation likelihood (12) and the potentials of the transition probability (16) and (19).
Without loss of generalization the derivation for the filter assumes discrete state variables
which is reflected in using summations ∑ for marginalization. If continuous
state variables
are given the summations ∑ simply have to be replaced by integrals . Everything else keeps
being the same. To show the applicability of the framework, we derive two realizations: One
for continuous Gaussian and one for discrete grid-based observation likelihoods as well as
Mixture of Gaussians and Mixture of Student’s-t-distributions transitions. Both realizations
have already been published at the International Conference on Machine Learning and Ap-
plications Willert et al. (2007; 2008). Here, we summarize the essentials of the modelling in
relation to the general filter framework. For optical flow estimation results, discussions on the
parameters, and benchmark tests we refer to the published material.
(vtk
x ) = N (−Itk tk T tk tk
t,x |(∇ Ix ) vx , Σ,x ) , (30)
..
. ... 0
. ..
Σtk
,x = .
. σtk,xx
. , (31)
..
0 ... .
(∇ Ixtk ) T Sv ∇ Ixtk + st
σtk,xx = . (32)
f (x , x, t, k)
In notation (30), the patches can be regarded as vectors and the covariance matrix Σtk ,x is a
tk
diagonal with entries σ,xx that depend on the position x relative to the center x, the time t, the
scale k, the flow field covariance Sv and the variance on the temporal derivatives st . Here, f
takes into account the spatial uncertainty of the velocity measurement and can implement any
kind of spatial weighting, such as a binomial blurring filter proposed in Simoncelli (1999) or an
Dynamic Visual Motion Estimation 297
anisotropic and inhomogenous Gaussian weighting f = N (x |x, Σtk I,x ) which is investigated
in Willert et al. (2008).
In contrast to Simoncelli (1999), we introduced time t as an additional dimension and derived
a more compact notation by putting the spatial weighted averaging directly into the likeli-
hood formulation defining multivariate Gaussian distributions for vectors that describe image
patches centered around image locations. Allowing for uncertainties Σtk ,x that are adaptive in
location x, scale k and time t we are able to tune the local motion measurements dynamically
e.g. dependent on the underlying structure of the intensity patterns.
which says that the change in time of the flow field is white with undirectional transition
noise between Vtk and Vt k . For the spatial interaction (15) an inhomogeneous anisotropic
Gaussian is assumed
x ∼ N (x |x − vxt k , Σtk
t,x ) . (34)
to be able to steer the orientation and to adapt the strength of the uncertainty in spatial iden-
tification Σtk
t,x between corresponding positions in time. Combining both factors (33) and (34)
and integrating x we get a Mixture of Gaussians (MoG) as the first pairwise potential (16)
with the Gaussian spatial coherence constraint being the mixing coefficients. Equivalent to
(33) for the scale transition factor (19) we chose a Gaussian
assuming white transition noise σk . The influence of neighboring velocity states from coarser
scale is also modelled as an adaptive Gaussian kernel similar to (34)
Again, combining both factors (36) and (37) and integrating x we get a MoG as the second
pairwise potential
that imposes a spatial smoothness constraint on the flow field via adaptive spatial weighting
of motion estimations from coarser scale. The combination of both potentials (16) and (19)
results in the complete conditional flow field transition probability as given in (13).
298 New Advances in Machine Learning
α(vtk tk tk tk tk tk tk
x ) ∝ mY → v ( v x ) m t → t ( v x ) m k → k ( v x ) : ≈ N ( v x | µ x , Σ x ) . (39)
We fulfill this constraint by making all single messages Gaussian distributed. This already
holds for the observation likelihood mY →v (vtk x ). A more accurate technique (following as-
sumed density filtering) would be to first compute the new belief α exactly as a MoG and then
collapse it to a single Gaussian. However, this would mean extra costs. Here, we do not inves-
tigate the tradeoff between computational cost and accuracy for different collapsing methods.
Inserting Gaussian distributed beliefs α into the propagation equations (28, 29) leads to two
different MoGs for the resulting messages
with
p̂txk
N (x − x | µ tk
tk
= x , Σ̌x ) , (41)
µ txk
(σt + Σtk
tk tk tk tk
µ̂ = x ) Λ̌x ( x − x ) + Σt,x Λ̌x µ x , (42)
Σ̂txk
= Σtk tk tk
t,x Λ̌x ( σt + Σx ) , (43)
−1
tk
Σ̌ x = Λ̌tk
x = σt + Σtk tk
t,x + Σx ,
and
t k
mk→k (vtx k ) = ∑ ptxk N (vtx k |µ txk , Σx ) ≈ N (vtx k |π tx k , Πtx k ) ,
(44)
x
with
t k
ptxk = N (x |x, Σtk
tk
k,x ) , Σx = σk + Σx . (45)
In order to satisfy the Gaussian constraint formulated in (39) the MoG’s are collapsed into sin-
gle Gaussians (40, 44) again. This is derived by minimizing the Kullback-Leibler Divergence
between the given MoG’s and the assumed Gaussians for the means ω tk tk
x , π x and the covari-
ances Ωtkx , Π tk which results in closed-form solutions for these parameters. The final predictive
x
belief α(vtk
x ) follows from the product of these Gaussians
α(vtk
x) = (vtk tk µ tk tk
x ) N ( vx | µ̃ x , Σ̃x ) , (46)
−1
Σ̃tk
x = Πtk tk
x Πx + Ωx
tk
Ωtk
x , (47)
−1 −1
µ tk
µ̃ x = Ωtk tk
x Πx + Ωx
tk
π tk tk tk tk
x + Πx Πx + Ωx ω tk
x . (48)
By applying the approximation steps (39, 40) and (44) we guarantee the posterior (27) to be
Gaussian which allows for Kalman-filter like update equations since the observation is de-
fined to factorize into Gaussian factors (30). The final recurrent motion estimation is given
Dynamic Visual Motion Estimation 299
by
α(vtk
x ) = N (vtk tk tk
x |µ x , Σx ) (49)
= N (−Itk tk T tk tk tk µ tk tk
t,x |(∇ Ix ) vx , Σ,x )N ( vx | µ̃ x , Σ̃x ) , (50)
−1
Σtk
x = Λ̃tk tk tk tk T
x + ∇ Ix Λ,x (∇ Ix ) , (51)
µ tk
x = µ tk
µ̃ tk tk tk tk
x − Σx ∇ Ix Λ,x Ĩt,x . (52)
For reasons explained in Simoncelli (1999) the innovations process is approximated as the
following
Ĩtk tk µ tk
t,x ≈ ∂/∂t T Ix , µ̃ x , (53)
with T applying a backward warp plus bilinear interpolation on the image Itk x using the pre-
µ tk
dicted velocities µ̃ x from (48). We end up with a Gaussian scale-time filter which is, in com-
parison to existent filtering approaches Elad & Feuer (1998), Simoncelli (1999), Singh (1991),
not a Kalman Filter realization but related to an extended Kalman Filter since the result of the
nonlinear transitions is linearized after each message pass with the collpase of each MoG to a
single Gaussian.
If we have access to a batch of data (or a recent window of data) we can compute smoothed
posteriors as a basis for an EM-algorithm and train the free parameters. In our two-filter
approach we derive the backward filter as a mirrored version of the forward filter, but using
instead of (55). This equation is motivated in exactly the same way as we motivated (55):
1
we assume that vtx ∼ S(vtx+
, σV , νV ) for a corresponding position x in the subsequent image,
and that x ∼ N (x − vtx , V ) is itself defined by vtx . However, note that using this symmetry
of argumentation is actually an approximation to our model because applying Bayes rule on
(55) would lead to a different, non-factored P(Vt |Vt+1 ). What we gain by the approximation
P(Vt |Vt+1 ) ≈ ∏ x P(vtx |Vt+1 ) are factored β’s which are feasible to maintain computationally.
The backward filter equations read
To derive the smoothed posterior we need to combine the forward and backward filters. In
the two-filter approach this reads
with P(Yt+1:T ) and P(Y1:T ) being constant. If both the forward and backward filters are ini-
tialized with α(v0x ) = β(vxT ) = P(vx ) we can identify the unconditioned distribution P(vtx )
with the prior P(vx ). For details on the standard forward-backward-algorithm we refer to
Bishop (2006).
6. Summary
A reliable and robust motion estimate is an important low-level processing unit that has the
potential to bootstrap a number of visual perception tasks to be solved by a cognitive vision
system. Since the estimation of motion information has to rely on highly uncertain visual
information a probabilistic treatment of the problem is proposed. Based on three basic ap-
proaches to solve motion ambiguities, the derivation of a probabilistic filter is given that com-
bines all these three approaches into one recurrent framework. The derivation comprises an
efficient approximate inference algorithm based on belief propagation applied on a directed
graphical model with a graph topology suitable for intertwining belief propagation along two
dimensions, scale and time, simultaneously. Introducing some factorisation assumptions and
a special class of transition probabilities results in a very compact and computationally effi-
cient algorithm. For this algorithm two implementations are presented. The first one realizes a
purely factored Gaussian belief propagation and the second one the propagation of a factored
non-parametric discrete distribution. The presented framework provides a flexible basis for
the realization of user specific motion estimation algorithms with the focus on online appli-
cations. It also serves as an exploration platform to investigate in adaptation mechanisms
and online learning strategies for example to improve the optical flow estimation accuracy or
increase the robustness for highly dynamic scenes.
7. References
Anandan, P. (1989). A computational framework and an algorithm for the measurement of
visual motion, IJCV: International Journal of Computer Vision 2(3): 283–310.
Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M. & Szeliski, R. (2007). A database and
evaluation methodology for optical flow, Proceedings of the 11th IEEE International
Conference on Computer Vision (ICCV).
Barron, J., Fleet, D. & Beauchemin, S. (1994). Performance of optical flow techniques, IJCV:
International Journal of Computer Vision 12(1): 43–77.
Beauchemin, S. & Barron, J. (1995). The computation of optical flow, ACM Computing Surveys
27(3): 433–467.
Bergen, J. R., Anandan, P., Hanna, K. J. & Hingorani, R. (1992). Hierarchical model-based
motion estimation, Proceedings of the Second European Conference on Computer Vision
302 New Advances in Machine Learning
(ECCV), Vol. 588 of Lecture Notes In Computer Science, Springer-Verlag, London, UK,
pp. 237–252.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Springer Science+Business
Media.
Black, M. (1994). Recursive non-linear estimation of discontinuous flow fields, ECCV, pp. 138–
145.
Brox, T., Bruhn, A., Papenberg, N. & Weickert, J. (2004). High accuracy optical flow estimation
based on a theory for warping, in T. Pajdla & J. Matas (eds), Proceedings of the 8th
European Conference on Computer Vision (ECCV), Vol. 3024 of Lecture Notes in Computer
Science, Springer-Verlag, Prague, Czech Republic, pp. 25–36.
Burgi, P., A.L.Yuille & Grzywacz, N. (2000). Probabilistic motion estimation based on temporal
coherence, Neural Computation 12: 1839–1867.
Elad, M. & Feuer, A. (1998). Recursive optical flow estimation-adaptive filtering approach,
Journal of Visual Communication and image representation 9: 119–138.
Horn, B. K. P. & Schunk, B. G. (1981). Determining optic flow, Artificial Intelligence 17: 185–204.
Isard, M. & Blake, A. (1998). Condensation - conditional density propagation for visual track-
ing, IJCV: International Journal of Computer Vision 29: 5–28.
Jähne, B. (1997). Digitale Bildverarbeitung, Springer-Verlag Berlin Heidelberg.
J.J.Gibson (1950). The perception of the visual world, Technical report, Houghton Mifflin Com-
pany, Boston, MA.
Kay, S. (1993). Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice Hall,
Englewood Cliffs, NJ.
Kitagawa, G. (1994). The two-filter formula for smoothing and an implementation of the
gaussian-sum smoother, Annals Institute of Statistical Mathematics 46(4): 605–623.
Lukas, B. D. & Kanade, T. (1981). An iterative image-registration technique with an appli-
cation to stereo vision, Proceedings of the 7th International Joint Conference on Artificial
Intelligence (IJCAI), Vancouver, Canada, pp. 674–679.
Memin, E. & Perez, P. (1998). A multigrid approach for hierarchical motion estimation, Pro-
ceedings of the Sixth International Conference on Computer Vision (ICCV), Bombay, India,
pp. 933–938.
Rosenberg, Y. & Werman, M. (1997). A general filter for measurements with any probability
distribution, Proceedings of the IEEE International Conference on Computer Vision and
Pattern Recognition (CVPR), IEEE Computer Society, San Juan, Puerto Rico, pp. 106–
111.
Roth, S. & Black, M. (2005). On the spatial statistics of optical flow, ICCV, pp. 42–49.
Simoncelli, E. (1993). Distributed Representation and Analysis of Visual Motion, PhD thesis, MIT
Department of Electrical Engineering and Computer Science.
Simoncelli, E. (1999). Handbook of Computer Vision and Applications, Academic Press, chapter
Bayesian Multi-Scale Differential Optical Flow, pp. 397–421.
Simoncelli, E. (2003). Local analysis of visual motion, The Visual Neuroscience, MIT Press.
Simoncelli, E., Adelson, E. & Heeger, D. (1991). Probability distributions of optical flow, CVPR,
pp. 310–315.
Singh, A. (1990). An estimation-theoretic framework for image-flow computation, Proceedings
of the Third International Conference on Computer Vision, Osaka, Japan, pp. 168–177.
Singh, A. (1991). Incremental estimation of image flow using a kalman filter, IEEE Workshop
on Visual Motion, pp. 36–43.
Dynamic Visual Motion Estimation 303
Stocker, A. A. & Simoncelli, E. P. (2006). Noise characteristics and prior expectations in human
visual speed perception, Nature Neuroscience 9(4): 578–585.
Weber, J. & Malik, J. (1995). Robust computation of optical flow in a multi-scale differential
framework, IJCV: International Journal of Computer Vision 14(1): 5–19.
Weiss, Y. (1993). Bayesian Motion Estimation and Segmentation, PhD thesis, MIT Department of
Brain and Cognitive Science.
Weiss, Y. & Fleet, D. (2002). Velocity likelihoods in biological and machine vision, Probabilistic
Models of the Brain: Perception and Neural Function, MIT Press, pp. 77–96.
Willert, V., Toussaint, M., Eggert, J. & Körner, E. (2007). Uncertainty optimization for robust
dynamic optical flow estimation, ICMLA, pp. 450–457.
Willert, V., Toussaint, M., Eggert, J. & Körner, E. (2008). Probabilistic exploitation of the lucas
and kanade smoothness constraint, ICMLA, pp. 259–266.
Wu, Q. X. (1995). A correlation-relaxation-labeling framework for computing optical flow -
template matching from a new perspective, IEEE Trans. PAMI 17(3): 843–853.
Yedidia, J., Freeman, W. & Weiss, Y. (2003). Exploring Artificial Intelligence in the New Millen-
nium, Morgan Kaufmann, chapter Understanding Belief Propagation and Its Gener-
alizations, pp. 239–236.
Zelek, J. (2002). Bayesian real-time optical flow, Proceedings of the 15th International Conference
on Vision Interface (VI), Calgary, Canada, pp. 310–315.
Zetzsche, C. & Krieger, G. (2001). Nonlinear mechanisms and higher-order statistics in bio-
logical vision and electronic image processing: Review and perspectives, Journal of
Electronic Imaging 10(1): 56–99.
304 New Advances in Machine Learning
Concept Mining and Inner Relationship Discovery from Text 305
19
x
1. Introduction
From the cognitive point of view, knowing concepts is a fundamental ability when human
being understands the world. Most concepts can be lexicalized via words in a natural
language and are called Lexical Concepts. Currently, there is much interest in knowledge
acquisition from text automatically and in which concept extraction, verification, and
relationship discovery are the crucial parts (Cao et al., 2002). There are a large range of other
applications which can also be benefit from concept acquisition including information
retrieval, text classification, and Web searching, etc. (Ramirez & Mattmann, 2004; Zhang et al.,
2004; Acquemin & Bourigault, 2000)
Most related efforts in concept mining are centralized in term recognition. The common used
approaches are mainly based on linguistic rules (Chen et al., 2003), statistics (Zheng & Lu,
2005; Agirre et al., 2004) or a combination of both (Du et al., 2005; Velardi et al., 2001). In our
research, we realize that concepts are not just terms. Terms are domain-specific while
concepts are general-purpose. Furthermore, terms are just restricted to several kinds of
concepts such as named entities. So even we can benefit a lot from term recognition we
cannot use it to learn concepts directly.
Other relevant works in concept mining are focused on concepts extraction from documents.
Gelfand has developed a method based on the Semantic Relation Graph to extract concepts
from a whole document (Gelfand et al., 1998). Nakata has described a method to index
important concepts described in a collection of documents belonging to a group for sharing
them (Nakata et al., 1998). A major difference between their works and ours is that we want
to learn huge amount of concepts from a large-scale raw corpus efficiently rather than from
one or several documents. So the analysis of documents will lead to a very higher time
complexity and does not work for our purpose.
There are many types relationships between lexical concepts such as antonymy, meronomy
and hyponymy, among which the study of hyponymy relationship has attracted many effort
of research because of its wide use. There are three mainstream approaches—the Symbolic
approach, the Statistical approach and the Hierarchical approach—to discovery general
306 New Advances in Machine Learning
hyponymy relations automatically or semi automatically (Du & Li, 2006). The Symbolic
approach, depending on lexicon-syntactic patterns, is currently the most popular technique
(Hearst, 1992; Liu et al., 2005; Liu et al., 2006; Ando et al., 2003). Hearst (Hearst, 1992) was one
of the early researchers to extract hyponymy relations from Grolier’s Encyclopedia by
matching 4 given lexicon-syntactic patterns, and more importantly, she discussed about
extracting lexicon-syntactic patterns by existing hyponymy relations. Liu (Liu et al., 2005; Liu
et al., 2006) used the “isa” pattern to extract Chinese hyponymy relations from unstructured
Web corpus, and have been proven to have a promising performance. Zhang (Zhang et al.,
2007) proposed a method to automatically extract hyponymy from Chinese domain-specific
free text by three symbolic learning methods. The statistical approach usually adopts
clustering and associative rules. Zelenko et al. (Zelenko et al., 2003) introduced an
application of kernel methods to extract two certain kinds of hyponymy relations with
promising results, combining Support Vector Machine and Voted Perception learning
algorithms. The hierarchical approach is trying to build a hierarchical structure of hyponymy
relations. Caraballo (Caraballo, 1999) built a hypernymy hierarchy of nouns via a bottom-up
hierarchical clustering technique, which was akin to manually constructed hierarchy in
WordNet.
In this paper, we use both linguistic rules and statistical features to learn lexical concepts
from raw texts. Firstly, we extract a mass of concept candidates from text using
lexico-patterns, and confirm a part of them to be concepts according to their matched
patterns. For the other candidates we induce an Inner-Constructive Model (CICM) of words
which reveal the rules when several words construct concepts through four aspects: (1) parts
of speech, (2) syllables, (3) senses, and (4) attributes. Once the large scale concept set is built
based on the CICM model, we developed a framework to discover inner relationships within
concepts. A lexical hyponym acquisition is proposed based on this framework.
ID Lexical Patterns
1 <?C1><是><一|><个|种|><?C2>
2 <?C1><、><?C2><或者|或是|以及|或|等|及|和|与><其他|其它|其
余>
3 <?C1><、><?C2><等等|等><?C3>
4 <?C1><如|象|像><?C2><或者|或是|或|及|和|与|、><?C3>
5 <?C1><、><?C2><是|为><?C3>
6 <?C1><、><?C2><各|每|之|这><种|类|些|样|流><?C3>
7 <?C1><或者|或是|或|等|及|和|与><其他|其它|其余><?C2>
8 <?C1><或者|或是|或|及|和|与><?C2><等等|等><?C3>
9 <?C1><中|里|内|><含|含有|包含|包括><?C2>
10 <?C1>由<?C2><组成|构成>
Table 1. The Lexico-Patterns for Extracting Concepts from Text
Here is an example to show how to extract concepts from text using lexico-patterns:
Example 1. Lexico-Pattern_No_1{
Pattern: < ?C1> <是><一 | > < 个|种> < ?C2>
Restrict Rules:
not_contain(<?C2>,<! 标
点> )^lengh_greater_than(<?C1>,1)^lengh\_greater\_than(<?C2>,1)^
lengh_less_than(<?C1>,80)^lengh_less_than(<?C2>,70)^not\_end\_with(<?C1>,<这|那>)^
not_end_with(<?C2>,<的|而已|例子|罢了>)^
not_begin_with(<?C2>,<这|的|它|他|我|那|你|但>)^
not_contain(<?C2>,<这些|那些|他们|她们|你们|我们|他|它|她|你|谁>)}
How to devise good patterns to get as much concepts as possible? We summarized the
following criteria through experiments:
(1) High accuracy criterion. Concepts distributing in sentences meet linguistics rules, so each
pattern should reflect at least one of these rules properly. We believe that we should know
linguistics well firstly if we want create to good patterns.
(2) High coverage criterion. We want to get as much concepts as possible. Classifying all
308 New Advances in Machine Learning
concepts into three groups by their characteristics, (i.e. concepts which describe physical
objects, concepts and the concepts which describe time) is a good methodology for designing
good patterns to get more concepts.
Hypothesis 1. A chunk ck extracted using lexico-patterns in section 2.1 is a concept if (1) {ck}
has been matched by sufficient lexico-patterns, or (2) {ck} has been matched sufficient times.
To testify our hypothesis, we randomly draw 10,000 concept candidates from all the chunks
and verify them manually. The association between the possibility of a chunk to be a concept
and its matched patterns is shown as Fig. 1:
Fig. 1. Association between the lexico-patterns number / the times matched by all the
patterns of chunks and their possibility of being concepts}
The left chart indicates our hypothesis that the chunks which matched more patterns are
more likely to be concepts and the right chart shows that the frequency of the chunk does
work well to tell concepts from candidate chunks too. In our experiments, we take the
number of patterns matchings to be 5 and threshold of matching frequency as 14, and single
out about 1.22% concepts from all the candidate chunks with a precision rate of 98.5%. While
we are satisfied with the accuracy, the recall rate is rather low. So in the next step, we
develop CICMs to recognize more concepts from chunks.
linguistics (Lu et al., 1992) and the cognitive process of human beings creating lexical
concepts (Laurance & Margolis, 1999). Some examples will be given to illuminate the
Hypothesis 2 after present the definition of CICM.
Definition 1. The word model W=< PS, SY, SE, AT> of a word w is a 4-tuple where (1) PS is
all the parts of speech of w; (2) SY is the number of w's syllable; (3) SE is the senses of w in
HowNet; and (4) AT is the attributes of w.
The word models are integrated information entities to model words. The reason of choosing
these four elements listed above will be clarified when we construct CICMs.
Definition 2. Given a concept cpt=w1… wi-1 wi wi+1 …wn with n words, the C-Vector of the
word wi towards cpt is a n-tuple:
The C-Vector of a word stands for one constructive rule when it forms concepts by linking
other words and i is its position in the concept. A word can have same C-Vectors towards
many different concepts. The C-Vector is the basis of CICM.
Definition 3. The Concept Inner-Constructive Models (CICMs) of a word w is a bag of
C-Vectors, in which each C-Vector is produced by a set of concepts contain w.
Essentially, CICMs of words represent the constructive rules when they construct concepts.
In the four elements of word models, PS and SY embody the syntactical information which
have significant roles when conforming concepts in Chinese (Lu et al., 1992) and are
universal for all types of words. SE and AT reveal the semantic information of words and are
also indispensably. HowNet is an elaborate semantic lexicon attracted many attentions in
many related works (Dong & Dong, 2006). But there are still some words which are missing
in it so we need to introduce attributes as a supplement. Attributes can tell the semantic
differences at the quantative level or qualitative level between concepts. Tian has developed
a practicable approach to acquire attributes from large-scale corpora (Tian, 2007).
310 New Advances in Machine Learning
Note that we omit the details of each word vector for simplicity. Taking ``国民 生产 总值''
for example, the full C-Vector is:
< 2,
<{n},2,{属性值,归属,国,人,国家},{有组成,有数量}>,
<{n},2,{数量,多少,实体},{有值域,是抽象概念}>>
Algorithm
CICMs Instance Learning Algorithm:
(1) Initializing the resources including (1.1) A words dictionary in which each one has fully parts
of speech; (1.2) The HowNet dictionary; and (1.3) An attributes base of words (Tian, 2007).
(2) Constructing a model set MSet to accommodate all the words' models which is empty
initially.
(3) For each concept cpt in the training set, segment it and create each word's C-Vector(wi).
Subsequently, if C-Vector(wi) ∈ MSet(wi), then just accumulate the frequency; otherwise add
C-Vector(wi) to MSet(wi).
(4) Removing the C-Vectors which have low frequency for each word's MSet.
Based on experiments, we choose 10% as the threshold of the number of the concepts
containing the word in the training set. We exclude the vectors which have low frequency,
that is, if a C-Vector for a word is supported by just a few concepts, we look at it as an
exception.
Concept Mining and Inner Relationship Discovery from Text 311
But unluckily, even we know ``药品 生产'' is a concept, our system still cannot tell whether ``
药品 制造(pharmaceutical manufacture)'' is also a concept for there are no CICMs for the word
``制造''. The reason for this is that the system still cannot make use of word similarity.
Therefore, we need to cluster words based on the similarity of CICMs and then learn more
new concepts.
The commonly used similarity measure for two sets includes minimum distance, maximum
distance, and average distance. Considering that there are still some noises in our training set
which would result in some wrong C-Vectors in CICMs, we choose the average distance for
it is more stable for noisy data, that is:
1
sim(w1, w2) = sim ( veci ,CICM ( w 2 ))
|CICM ( w1)| veciCICM ( w1)
(3)
1
sim ( veci ,vecj )
|CICM ( w1)||CICM ( w 2 )| veciCICM ( w1) veciCICM ( w1)
Now the problem is how to calculate the similarity of two C-Vectors of two words now. For
two C-Vectors:
and Wk= if there is no word model in position k for both of them. We adopt the cosine
312 New Advances in Machine Learning
f Bw fB ( w0, w) (7)
which is a function proportionately to the similarity of w0 and w, and reveals the influence
degree w0 over w. Commonly used influence functions include Square Wave Function and
Gauss Function. The former is suitable for the data which dissimilar distinctly while the later
is more suitable for reflect the smooth influence of w0. Because a word is related with many
other words in different degrees but no simply 1 or 0 in corpus, it is more reasonable to
choose Gauss Influence Function:
(1 sin( w 0, w1)) 2 (8)
w
f Gauss ( w0) e 2 2
We call Equation (8) the Gauss Mutual Influence of w, w0 for fwGauss (w0)=fw0Gauss (w). It makes
each word linked with many other words to some extent. According to it, we can cluster
words into groups. Before giving the definition of a word group, we develop some
definitions first for further discussing:
Based on the definitions above, a word group can be seen as the maximal words set based on
the reachable property. The corresponding clustering algorithm is given below:
(1)Taking = * and for all the words w perform the following operation:
cur= *;cwcur={w};
while( cur < 1){
cwpre=cwcur;
Concept Mining and Inner Relationship Discovery from Text 313
Definition 6. For a chunk ck=w0… wn, the Local C-Vector for a word wi in it :
L_Vector(wi, ck)=<i,W0,…,Wi-1, Wi+1,…, Wn>.
Theorem 1. For a chunk ck=w0 … wn , for each word wi in it, there is L_Vector(wi, ck) ∈
CICM(gwi), then ck is a concept, where gwi is the similar word group of wi.
The <suf> here denotes a common suffix of all concepts. We may think the <suf> could be a
hypernymy concept and following relations may exist:
HISA(<cpt1>, <suf>), HISA(<cpt2>,<suf>),…, HISA(<cptn>,<suf>)
For instance, given S={炭疽活菌苗, 冻干鼠疫活菌苗, 结核活菌苗, 自身菌苗, 外毒素菌苗}, we
can segment the concepts as follows,
<自身菌苗>=<自身><菌苗>,
<外毒素菌苗>=<外毒素><菌苗>,
<结核活菌苗>=<结核活><菌苗>,
<炭疽活菌苗>=<炭疽活><菌苗>,
<冻干鼠疫活菌苗>=<冻干鼠疫活><菌苗>,
where the corresponding hypernymy concept suffix <suf> is <菌苗> and all HISA relations
come into existence. However, if we consider the suffix chunk <苗> to be <suf> instead of <
菌苗> (i.e. we segment the concept <外毒素菌苗>:=<外毒素菌><苗> ), all HISA relations do
not exist. Moreover, the suffix <苗> can not even be considered as a concept. We notice that a
subset S’={结核活菌苗, 炭疽活菌苗, 冻干鼠疫活菌苗} of S contains a longer common
hyponymy <活菌苗>, lexical hyponymy relations HISA(结核活菌苗, 活菌苗), HISA(炭疽活
菌苗, 活菌苗) and HISA(冻干鼠疫活菌苗, 活菌苗).
We will investigate into such common suffix in a concept set and mine lexical hyponymy
taking advantage of the common suffix features. There is a limitation in this approach: the
size of the concept set should be very large in order to find such common chunks. In an
extreme case, we can extract nothing if there is only one concept in the concept set, even if the
only concept in the set contains rich lexical hyponymy relations. However, there is no
definition how large can be thought to be very large and we will analysis this factor in the
experiment section.
statistical acquisition model [16] to extract concepts from web corpus, which results in a
large-scale concept set and then clustered them into a common suffix tree according to
suffixes of concepts. The suffix analysis module uses a set of statistical-based rules to analyze
suffix nodes. Class concept candidates, which are concepts, are identified by our Google-base
verification module and used to enlarge the original concept set. A class concept verification
process was taken to verify class concept candidates. Human judgment-based relation
verification is taken after a prefix clustering process dedicating to reduce the verification cost
is done. Finally we got extracted hyponymy relations from the common suffix tree with a
hierarchical structure.
Definition 7. A common suffix tree containing m concepts is a tree with exactly m leaves.
Each inner node, other than leaf, has more than two children, and contains a single Chinese
gram. Each leaf indicates a concept with a longest shared suffix that equals the string leading
from the leaf to root. Along with the path, the string from each inner node to root is a shared
suffix of the concept indicated by leaves it can reach.
With CST, not only are we able to find what is the longest shared suffix, we can also find
which concepts share a certain common suffix. Following CST clustering algorithm will help
us construct a CST in liner time complexity:
7. Suffix Analysis
Given the “学” cluster in the example above, the suffix collection S={第十六中学, 十六中学, 六
中学, 中学} may all hypernymy concepts we interested in, without any other information
supporting (1-gram suffix causes great ambiguous, therefore we leave it alone in our system).
Some suffix concepts may be extracted by some Chinese word segment systems [20], however,
there is no word segment system adopted in our system, because the segment system performs
poor in a large scale general-purposed concept set, where many suffixes cannot be correctly
segmented and thus lowered the performance of the entire system.
However, some useful statistic features can be obtained in a concept-set to identify class
concepts. For a suffix chunk <ck> in concept-set, we may have patterns such as CNT[<ck><*>],
CNT[<*><ck>], CNT[<*><ck><*>] and etc., where CNT[<pattern>] means the frequency of
<pattern> in concept set. A list of examples of such patterns was listed in Table 3.
Pattern Example
(1) ISCpt[<ck>] <大学>∈S
(2) CNT[<ck><*>] <大学>学生服务部,
<大学>校区, …
(3) CNT[<*><ck>] 理工<大学>, 科技<大学>, …
(4) CNT[<*><ck><*>] 北京<大学>学生会,
中国<大学>评估组, …
Table 3. Statistical patterns and examples
Pattern (1) is not a real statistic. The pattern, once appears in the given concept set, prove that
indicated suffix <ck> is a class concept candidate. If the concept set is large enough (i.e. for any
<cpt>, always exists <cpt>∈S), this single rule can be used to identify all class concept
Concept Mining and Inner Relationship Discovery from Text 317
canndidates. Actually, our concept seet can never achieeve that large.
Thhe emergence of pattern
p (2) and (3)) is a strong indica
ation of class conccept, which usuallly can
be some components of other wo ords. Class conceept < 大学 (univerrsity)> can be ussed as
lim
mitation of other concept,
c such as <大学校区(univerrsity campus)>, wh hich indicates a special
s
kinnd of <校区(camppus)>. Experiment in following con ntent shows thatt once the patternn (2) or
(3)) appears, the emppirical probability y of <ck> to be a concept
c is very h
high.
Thhe information em mbedded in the pattern (4) is riccher. We rewritee the CNT[<*><cck>] as
fsuff(<ck>), called suff
ffix frequency. The ith suffix of conceept <cpt>= <xm>……<x1> (i.e. <xi>…
…<x1>),
wh here <xi> a is singgle gram, is denotted as Suf(<cpt>, i)i and m is the len
ngth of <cpt>. Thee suffix
proobability Ssuf(<cpt>
>,n) is defined as::
Ssuf(<
<cpt>,n) =psuf(<xm>…<x1>, <xn+1>< <xn>…<x1>)
(9)
= fsuf(Suf(<cpt>,n+1)) / fsuf(Suf(<cp
pt>,n))
here n≦ Length(<
wh <cpt>)-1. psuf(<ck1>,
> <ck2>) is the joiint probability of chunk <ck1> and <ck2>.
Wee define that sing
gle-gram conceptss have no suffix probability.
p
Fig
g. 4. Case study: suffix
s probability
y of 4, 5, 6, 8, 9-gra
am concepts resp
pectively.
318 New Advances in Machine Learning
Figure 4 shows some cases of suffix probability whose numbers of grams composed are
ranging from 4 to 9. Such cases illustrate how suffix probability changes with varying
number of grams.
Figure 4 (V) shows the change of suffix frequency of concept <cpt>:= “混合型证券投资基金”
in a concept set with a size of 800,000. Figure 4 (U) shows the situation when <cpt>:=“流行性
感冒病毒感染”. For instance,
From the case (S) we observe that S(<cpt>, 3)=0.99667 is the maximum among S(<cpt>, 2),
S(<cpt>, 3) and S(<cpt>, 4). At the same time Suf(<cpt>, 4) (i.e. 投资基金) is a class concept.
Same situation could be found in maximum point S(<cpt>, 5) and S(<cpt>, 8), while Suf(<cpt>,
6) and Suf(<cpt>, 8) are both class concept. In another case when <cpt>=“流行性感冒病毒感
染”, we find the same phenomenon that class concept happened to appear in inflexions,
which makes us believe it to be a useful rule. The rule is proved to be very effective in later
experiment and is defined as follows:
Definition 8. (Suffix Probability Inflexion Rule) In a large-scale concept set, whenever the suffix
probability S(<cpt>,n) encounters an inflexion, the suffix Suf(<cpt>,n+1) =<wn+1><wn>…<w1>
is considered to be a class concept candidate, which is called Inflexion Rule.
The suffix probability inflexion rule is exported from empirical study, and the hidden
theoretical support of this rule is based on mutual information. The higher the S(<cpt>,n), then
the suffix <wn>…<w1> and <wn+1><wn>…<w1> has higher mutual information, which may
lead to a close correlation, the sudden reduce of mutual information means differentiation in
linguistic usage.
Based on the discussions above, we summarize three Suffix Concept Identification (SCI) Rules:
1. Pattern ISCpt[<wx>] appears, then <wx> must be a concept.
2. Pattern CNT[<wx><*>] or CNT[<*><wx><*>] appears, then <wx> can be a concept.
3. Suffix Probability Inflexion Rule.
The experimental baseline comparisons among three rules are listed in Table 4. We use SCI
rules in an 800,000 concept set and 300 test cases and manually extract all the class concept
candidates in test cases, denoted by cm. Then we use SCI rules to extract class concepts,
denoted by ca. We adopt following evaluation measurements in baseline experiment:
Precision = | ca ∩ cm | / | ca |
Recall = | ca ∩ cm | / | cm |
The average value and standard deviation of precisions and recalls are computed in 5
baseline scheme. Rules based on (1), (2) or the combinations of which have a low recall
although with a high precision, as a result of the data sparsity. However, rule (3) holds a high
precision and at the same time has a promising recall once combined with the other two
rules.
Concept Mining and Inner Relationship Discovery from Text 319
Precision Recall
Average Std. Dev Average Std. Dev
Rule(1) 100% 0 - n/a
Rule(2) 95.753% 0.4603 - n/a
Rule(3) 98.641% 0.1960 65.125% 2.393
Rule(1,2) 96.561% 0.5133 - n/a
Rule(1,2,3) 98.145% 0.5029 66.469% 2.792
Table 4. SCI Rules Baseline Comparison (- mean the value is lower than 5%).
Definition 10. For a concept <cpt>, the pattern frequency is defined as f(Pattern(<cpt>)),
where Pattern(<cpt>) is applying the concept to a certain pattern. Pattern association is
defined as the pattern frequency of the concept dividing its frequency, denoted by
p(Pattern(<cpt>)).
To verify class concepts, pattern associations can be used as attributes to train a classifier by
machine learning algorithms. However, according to the linguistic property of the three
classes, the pattern associations of a certain concept are likely to associate well with only one
pattern in each class. Therefore we only use the patterns that can have the maximum pattern
association in each class. We use the liner combination to sum pattern associations of all three
classes into a scoring function, which is proved to be more effective than adopting three
separate attributes.
Three classes of patterns are assigned with different class weights wI, wII, wIII, which can be
used to adjust score according to liner analysis methods. Besides, we take the frequency of
concept as a coefficient of the score, which indicates that a concept with a higher frequency is
more likely to be a class concept. To sum all effects above, the expression of scoring a concept
<cpt> is:
To obtain a score threshold identifying class concept, we firstly annotate a training set of 3000
concepts, including 1500 class concept and 1500 non-class concept. We then use Google to
retrieve pattern associations of training set. So the pattern associations are calculated into a
score. And we use a linear analysis method to adjust the class weighs that can maximize the
scoring function, and finally we get a score threshold. Concepts that exceed the given
threshold are classified as class concept and vice versa. In our experiment, the class concept
classifier we built is proved to achieve a remarkable high accuracy at 95.52%.
9. Prefix Clustering
Due to the property of lexical hyponymy relations, they hardly appear in other sources such
as text corpus and web corpus, which makes human judgment a compulsory step in the
relation verification process. In a large-scale concept set, the number of lexical hyponymy
relations is huge, and thus it becomes a misery if we need to manually verify each relation.
In a concept sub-set S={<京津塘高速公路>, <长株潭高速公路>, <京石高速公路>, <京承高速
公路>, <信息高速公路>} with the suffix Suf={<*>, 4} and Suf={<*>, 2}, where <*> denotes the
wildcard of concepts, but the hyponymy relation within term <信息高速公路(Information
High-Way)> is different from others. Since the concept is a kind of metaphor, there is not a
real lexical hyponymy relation. If we can cluster the relations into meaningful groups, such
as, metaphor group and non-metaphor group, it is possible for us to verify parts of the
relation group instead of all relations.
We notice that a prefix <pref> of a concept <cpt>=<pref><suf> is typically a term that forms
parts of other concepts in our concept set. Given a <pref>, H(<pref>) denotes all chunks that
Concept Mining and Inner Relationship Discovery from Text 321
appears before <pref> in other concepts and T(<pref>) denotes all chunks that appears after
<pref> in other concepts. The two statistical information, that provided by concept set
context, can be used to define the similarity of two prefixes.
Definition 11. Prefix Similarity is a quantity for measuring the similarity of two prefixes
within a concept-set context. It is the average of Crossover Coefficients of Head Similarity
and Tail Similarity.
| H ( x ) H ( y ) | | T ( x ) T ( y ) |
Sim( x , y ) ( ) (12)
min(| H ( x ) |,| H ( y ) |) min(| T ( x ) |,| T ( y ) |)
100%
80%
60%
40%
0%
1 2 4 8 16
Fig. 5. Judging cases and accuracy in prefix clustering
This step is optional comparing to other modules employed in our framework, and
sometimes it may lower the precision of the system. Figure 5 describes our judging cases and
accuracy in a 1000-sized sub tree of a CST built by an 800,000 concept set. When setting the
K-value to be 8, we will have an accuracy of 90.4% by judging 62% relation cases.
Remarkably,not only does the percentage of judging cases depend on K-value, it also relates
to the structure of targeting CST. However, prefix clustering will significantly improve the
efficiency of human judgment during verification phase.
SCI rules to extract class concept candidates T’, and add them to C, enlarging our original
concept set. We verify the unverified candidates in T’ with the Google-base verification
described in Sect. 7, and get a class concept set T. In lexical hyponymy relation candidate set
H’, we remove all the relations that have hypernymy concepts in T-T’.
Lexical hyponymy relations are generated as follows: For a given concept node <cpt>, set
{<s-cpt1>… <s-cptn>} is used to denote all the verified class concept nodes it goes through in
CST, and we have HISA(<s-cpti>,<s-cptj>) (i<j). Put all generated relations to H’. As the
original concept set changed, we update statistical information of each node, and keep
performing steps above until the status of each node remains unchanged. Finally we cluster
the prefix according to Sect.7 and judge one relation candidate in each cluster in H’, resulting
our final hierarchical lexical hyponymy relation set H. The pseudo code of acquiring process is
given in Figure 6.
To better illustrate this acquisition process, an example is given in Figure 7. Nodes {a, b, c, d, e,
f} are suffix chunk nodes in a Common Suffix Tree. A suffix chunk node represents a lexical
chunk of string starting from the corresponding CST node leading to the root. In (I), we have
already known that b, d, e are class concept nodes and the rest are unknown nodes. Through
suffix analysis, a is proved to be a non-concept and b, c are identified to be class concept
candidates, as shown in (II). The candidates are then verified by the class concept classifier.
In (III), c is classified as class concept and d is classified as non-class concept. Hyponymy
relation candidates are HISA(d, c), HISA(e, c), HISA(d, b), HISA(e, b), HISA(f, b), where HISA(d,
b) and HISA(e, b) are derived from transitivity of hyponymy relation. HISA(e, c) is judged as a
non-hyponymy relation, leading that HISA(e, b) to be removed, as shown in (IV).
11. Experiment
11.1 Concept Extraction
The concept extraction part of our system is called Concept Extractor (CptEx) and uses the
following formulae to evaluate its performance:
|| ma mm || || ma mm || 2 pr
p ,r , F Measure (8)
|| ma || || mm || pr
where ma are the concepts CptEx extracts and $m_{m}$ are the ones built manually. To
calculate the performance, we selected 1000 chunks from the raw corpus and label the
concepts in them manually. We compare the results based on CICMs with those based the
Syntax Models and the POS Models as shown in Table 6:
Having adopted CICMs to distinguish concepts from the chunks extracted by lexico-patterns,
the precision rate drops down to 89.1% while the recall rate flies to 84.2%. The precision rate
reduces because there are still some improper CICMs which will confirm fake concepts.
Compared with POS Models, CICMs has a higher accuracy rate because we consider more
factors to clarify the inner constructive rules rather than using part of speech only. On the
other hand, our stricter models result in a lower recall rate.
100,00%
90,00%
80,00%
70,00%
60,00%
50,00%
40,00% Precision
30,00% Recall
20,00% F‐Measure
10,00%
0,00%
10.000 50.000 100.000 400.000 800.000
From the acquisition result shown in Fig.8, we can discover that F-measure incrementally
increases coincides the larger concept-set size, from 24.93 in 10000-sized concept set,
climbing to 78.34 in 800000 one. Precision lower slightly and recall increase significantly with
a larger concept set. As the size of concept set enlarges, more statistical information emerges,
and at the same time more suffix concepts are extracted as class concepts, some of which
form lexical hyponymy relations, causing a higher recall, while some other relations are
invalid, leading to a lower precision. Under the concept-set with a size of 800000, the
precision is 93.8% and recall reaches to 67.24%. The recall can be even higher when given a
larger concept set.
In our concept set, we discover noise due to exocentric compounds, in which the suffix
concepts are not hypernymy concepts. So far, no effort has been done to verify Chinese
exocentric structures and the difficulty of linguistic usage makes it hard to analyze semantic
relation within Chinese lexical concepts, which inevitably lower the precision of our
framework.
Single-gram hypernymy concepts, such as ‘计’, are likely to cause ambiguity. In our concept
set, we find a large number of concepts ended with suffixes like {“硬度计”,“光度计”, “温度
计”, “速度计”, “长度计”, “高度计”}. The mutual information between “度” and “计” is very
high, leading the algorithm adopting SPI rule to wrongly mark the chunk “度计”, rather than
“计”, as a class concept candidate. This problem might be solved if we could avoid the
information sparsity by further enlarging the concept set.
The precision of class concept verification module is an important factor to the performance
of whole system. We can further obtain a larger feature space and enhance the performance
by employing advanced learning techniques such as SVM and Naïve Bayes Network.
Final precision of the framework is affected by our prefix clustering judgment, however,
when the concept set becomes larger and thus more relations are extracted, it is inevitable for
us to adopt that judgment.
Concept Mining and Inner Relationship Discovery from Text 325
12. Conclusion
We have described a new approach for automatic acquisition of concepts from text based on
Syntax Models and CICMs of concepts. This method extracted a large number of candidate
concepts using lexico-patterns firstly, and then learned CICMs to identify more concepts
accordingly. Experiments have shown that our approach is efficient and effective. We test the
method in a 160G free text corpus, and the outcome indicates the utility of our method.
To discover the inner relationships of the concept set, we propose a novel approach to
discover lexical hyponymy relations in a large-scale concept set and make the acquisition of
lexical hyponymy relations possible. In this method we cluster a concept set into a common
suffix tree firstly, and then use the proposed statistical suffix identification rules to extract
class concept candidates in the inner nodes of the common suffix tree. We then design a
Google-base symbolic class concept verifier. Finally we extract Lexical hyponymy relations
and judge them after the prefix clustering process. Experimental result has shown that our
approach is efficient and can correctly acquire most lexical hyponymy relations in a
large-scale concept set.
In the concept extraction part there are still some more works be done to get better
performance for there are some improper CICMs. We plan to validate concepts in an open
corpus such as in the World Wide Web in the future. In the relation discovery future work
will be concentrated on the extraction of single-gram suffixes, which covers a large part of
lexical hyponymy relations. On the other hand, through inner cross verification within a
concept set, an approach that automatically verifies hyponymy relation is coming soon.
13. Reference
Acquemin, C. & Bourigault, D. (2000). Term Extraction and Automatic Indexing. Oxford
University Press, Oxford(2000)
Agirre, E.; Ansa, O.; Hovy, E. & Martinez, D. (2004). Enriching very large ontologies using
the WWW. In: Proc. of the ECAI 2004 Workshop on Ontology Learning.
Ando, M.; Sekine, S. & Ishizaki, S.(2003). “Automatic Extraction of Hyponyms from
Newspaper Using Lexicon-syntactic Patterns”, In IPSJ SIG Technical Report
2003-NL-157, pp.77-83, 2003
Cao C., Feng, Q. and et al. (2002). Progress in the Development of National Knowledge
Infrastructure. Journal of Computer Science & Technology, Vol.17, No.5, 1¨C16,
May, 2002.
Caraballo, S. (1999). “Automatic Construction of a Hypernym-babeled Noun Hierarchy from
Text”, In proceedings of 37th Annual Meeting of the Association for Computational
Linguistics, Maryland, pp120-126, Jun 1999.
Chen, W.; Zhu, J. & Yao, T. (2003). Automatic learning field words by bootstrapping. In: Proc.
of the JSCL. Beijing: Tsinghua University Press, 2003. pp. 67--72
Cilibrasi, R.L.& Vitanyi, P.(2007). The Google Similarity Distance. Knowledge and Data
Engineering. IEEE Transactions 19(3), pp370–383, 2007
Dong, Z. & Dong, Q. (2006). HowNet and the computation of meaning. World Scientific
Publishing Co., Inc. 2006.
Du, B.; Tian, H.; Wang, L. & Lu, R. (2005), Design of domain-specific term extractor based on
multi-strategy. Computer Engineering, 31(14):159¨C160, 2005
326 New Advances in Machine Learning
Du, X. & Li, M. (2006). “A Survey of Ontology Learning Research”, Journal of Software, Vol.
17 No.9 pp. 1837-1847, Sep. 2006.
Gelfand, B.; Wulfekuler, M. & Punch. W. F. (1998). Automated concept extraction from plain
text. In: AAAI 1998 Workshop on Text Categorization, pp. 13--17, Madison, WI,
1998.
Gusfield,D. (1997). Algorithms on Strings,Trees and Sequences:Computer Science and
Computational Biology[M]. Cambridge University Press, 1997.
Hearst, M.A. (1992). Automatic acquisition of hyponyms from large text corpora.
Proceedings of the 14th International Conference on Computational Linguistics
(COLING), pp. 539--545, 1992.
Hinneburg, A. & Keim, D. (1998). An efficient approach to clustering in large multimedia
databases with noise. In: Proceedings of the 4th International Conference on
Knowledge Discovery and Data Mining, 1998.
Laurence, S. & Margolis, E. (1999). Concepts: Core Readings, Cambridge, Mass. MIT Press.
1999.
Liu L. & et al. (2005). Acquiring Hyponymy Relations from Large Chinese Corpus, WSEAS
Transactions on Business and Economics, Vol.2, No.4, pp.211-218, 2005
Liu, L.; Cao, C.; Wang, H.& Chen, W. (2006). A Method of Hyponym Acquisition Based on
“isa” Pattern, J. of Computer Science. pp146-151.2006
Lu, C.; Liang, Z.& Guo, A. (1992). The semantic networks: a knowledge representation of
Chinese information process. In: ICCIP’92. pp. 50--57
MacQueen,J.(1967). Some Methods for classification and Analysis of Multivariate
Observations. In proceedings of 5-th Berkeley Symposium on Mathematical
Statistics and Probability, Berkeley, University of California Press, pp281-297, 1967.
Nakata, K.; Voss, A.; Juhnke, M. & Kreifelts, T. (1998). Collaborative Concept Extraction from
Documents. In U. Reimer, editor, Proc. Second International Conference on Practical
Aspects of Knowledge Management (PAKM 98), Basel, 1998.
Ramirez, P.M. & Mattmann, C.A. (2004). ACE: improving search engines via Automatic
Concept Extraction. In: Proceedings of the 2004 IEEE International Conference(2004)
pp. 229--234.
Rydin,S. (2002). Building a hyponymy lexicon with hierarchical structures, In Proceedings of
the Workship of the ACL Special Interest Group on the lexicon (SIGLEX),
Philadelphia, July 2002, pp. 26-33, 2002
Tian, G. (2007). Research os Self-Supervised Knowledge Acquisition from Text based on
Constrained Chinese Corpora. A dissertation submitted to Graduate University of
the Chinese Academy of Sciences for the degree of Doctor of Philosophy. Beijing
China, May 2007.
Velardi, P.; Fabriani, P. & Missikoff, M. (2001). Using text processing techniques to
automatically enrich a domain ontology. In: Proc. Of the FOIS. New York: ACM
Press, 2001. pp. 270--284.
Yu. L. (2006). A Research on Acquisition and Verification of Concepts from Large-Scale
Chinese Corpora. A dissertation Submitted to Graduate School of the Chinese
academy of Sciences for the degree of master. Beijing China, May 2006
Zelenko, D.; Aone, C. & Richardella, A.(2003). “Kernel Methods for Relation Extraction”,
Journal of Machine Learning Research, No.3, pp .1083-1106, 2003
Concept Mining and Inner Relationship Discovery from Text 327
Zhang, C. & Hao, T. (2005). The State of the Art and Difficulties in Automatic Chinese Word
Segmentation. Journal of Chinese System Simulation. Vol.17 No.1 138¨C147. 2005.
Zhang, C. et al. (2007). Extracting Hyponymy Relations from domain-specific free texts, In
proceedings of the 9th Int. Conf. on Machine Learning and Cybernetics, Hong Kong,
19-22 August 2007.
Zhang, Y.; Gong, L.; Wang, Y. & Yin, Z. (2003). An Effective Concept Extraction Method for
Improving Text Classification Performance. Geo-Spatial Information Science. Vol. 6,
No.4, 2003
Zheng J. & Lu J. (2005). Study of an improved keywords distillation method. Computer
Engineering, 31:194¨C196, 2005
Zhou, J.; Wang, S. & Cao,C. (2007), A Google-Based Statistical Acquisition Model of Chinese
Lexical Concepts. In proceedings of KSEM07, pp243-254. 2007
328 New Advances in Machine Learning
Cognitive Learning for Sentence Understanding 329
20
x
1. Introduction
In the research field of natural language understanding, sentence stands a very prominent
position in text processing. The process of sentence understanding involves computing the
meaning of a sentence based on analysis of meanings of its individual words. Research
procedures in sentence understanding examine the representations and processes that
connect the identification of individual words in text reading (Culter, 1995; Balota, 1994)
with mapping sentence meanings to relevant mental models (Johnson-Laird, 1983) or
discourse representations (Kintsch, 1988; van Eijck & Kamp, 1997).
The task of sentence understanding includes two stages, sentence parsing and semantic
processing. Sentence parsing resides in the fundamental level, while semantic
understanding involves lexcial and higher discourse analysis. Sentence understanding has
compact connections with human cognition, thus this chaper will introduce how cognitive
models are integrated, with machine learning algorithms (or models), into the procedures of
sentence parsing and semantic processing.
Misyak & Christiansen (2007) revealed that statistical learning ability was a stronger
predictor of relative clause comprehension than the reading span measure, and suggested
that statistical learning may play a strong role in the accumulation of linguistic experience
relevant for sentence processing.
Moreover, within natural language comprehension and production studies, there is clear
evidence that prior experience of a given syntactic structure affects (1) comprehension of
similar structures and (2) the probability that a speaker will utter a sentence with the same
or similar structure, even when there is no meaning overlap between sentences (Ferreira &
Bock, 2006).
Syntactic priming has been described as stemming from statistical learning at the syntactic
level (Bock & Griffin, 2000; Chang et al., 2006) or at the syntactic–semantic interface (Chang
et al., 2003), which can be viewed as examples of statistical learning of information relevant
to sentence processing.
Above research works have testified the significance of statistical learning for natural
language processing, including sentence comprehension, and also explicitly pointed out the
performance bottlenecks (Monaghan et al., 2005; Dell & Bock, 2006; Misyak & Christiansen,
2007) of statistical processing technologies. Since human, rather than the computer software
and hardware, is the core subject to process and understand natural language, it is essential
to survey pivotal research works regarding human cognition.
With respect to syntactic and semantic processing in sentence comprehension, two main
classes of cognitive models have been proposed to account for the behavioral data: Syntax-
First and Interactive models.
Syntax-First models (Fodor, 1983; Frazier & Fodor, 1978; Kako & Wagner, 2001) claims that,
(1) syntax plays the main part whereas semantics is only a supporting role, (2) the parser
initially builds a syntactic structure based on word category information, which is
independent from lexical or semantic information, and (3) thematic role assignment takes
place during a second stage. If the initial syntactic structure cannot be mapped onto the
thematic structure, the final stage will require a re-analysis.
Interactive models (Bates & Mac-Whinney, 1987; MacDonald et al., 1994; Marslen-Wislon &
Tyler, 1980; Taraban & McClelland, 1988) state that syntactic and semantic processes
actually interact with each other at an early stage, and both syntax and semantics work
together to determine the meaning of a sentence. Despite the agreement that syntactic and
semantic information has to be integrated within a short period of time, the two model
classes differ in their views on the temporal structure of the integration processes.
Syntax and semantics are two indispensable properties of sentences. The eye-tracking
studies (Tanenhaus & Trueswell, 1995) have supported the conclusion that syntax and
semantics interact during parsing, which denotes that meaning affects early processing.
These behavioristic experiments have convinced that the interactionist approach (Trueswell
et al., 1994) is rational and effective to simulate human parsing and semantic understanding
mechanism.
Although semantic interpretations are constructed upon syntactic frames (MacDonald et al.,
1994), semantic information can influence the activation of syntactic frames. As a
consequence, syntactic and semantic analysis may influence each other.
The SRN architecture (as illustrated in Fig. 1.) includes the activations from the recurrent
layer (RL, the hidden layer) as the context layer (CL) in the input layer (IL), aiming at
processing inputs that consist of sequences of patterns of variable length. This architecture
allows the network to include information connected with all the previous steps in a
sequence in its processing of the current stage. The architecture will remember what has
gone before, forgetting gradually as it progresses through the sequence.
Output
Layer (OL)
… …
WOR
Recurrent
Layer (RL)
Time Delay … …
WRC WRI
Context Input
Layer (CL) Layer (IL)
… … … …
Symbols Definition
IU A unit of input layer
RU A unit of recurrent layer
CU A unit of context layer
OU A unit of output layer
|I| The number of units in IL
|R| The number of units in RL
|C| The number of units in CL
|O| The number of units in OL
WRI The weight vector from IL to RL
WRC The weight vector from CL to RL
WOR The weight vector from RL to OL
Table 1. Definition of SRN Symbols
Symbols in Fig. 1. are defined in table 1: the first order weight matrices WRI and WOR fully
connect the units of the input layer (IL) , the recurrent layer (RL) and the output layer (OL)
respectively, as in the feed forward multilayer perceptron (MLP). The current activities of
recurrent units RU(t) are fed back through time delay connections to the context layer, which
is presented as CU(t+1) = RU(t).
Therefore, each unit in recurrent layer is fed by activities of all recurrent units from previous
time step through recurrent weight matrix WRC. The context layer, which is composed with
334 New Advances in Machine Learning
activities of recurrent units from previous time step, can be viewed as an extension of input
layer to the recurrent layer. Above working procedure represents the memory of the
network via holding contextual information from previous time steps.
The weight matrices W RI , W RC and W OR are presented as equations (1) to (3)
RI
In above formulations, where ( wkRI )T is the transpose of wkRI for the instance of W ,
RI RI T
where w k
is a row vector, ( w ) is the column vector of the same elements. The
k
vector w = ( w1rik , w2rik ,… w|riI |,k ) represents the weights from all the input layer units to the
RI
k
recurrent (hidden) layer unit RU k . The same conclusion applies with W RC and W OR
Given an input pattern in time t, IU(t) = ( IU1( t ) , IU 2( t ) ,..., IU |(It| ) ), and recurrent activities RU(t)
= ( RU 1( t ) , RU 2( t ) ,..., RU |(Rt|) ) ), for the ith recurrent unit, the net input RU i(t ) and output
activity RU i(t ) are calculated as equations (4) and (5).
|I | |R|
RU i(t ) = IU(t) · ( wiRI )T + RU(t-1) · ( wiRC )T =
j 1
IU (jt ) w riji + RU
j 1
( t 1)
j w rcji (4)
For the kth output unit, its net input OU k(t ) and output activity OU k(t ) are calculated as
equations (6) and (7).
Cognitive Learning for Sentence Understanding 335
|R|
OU k(t ) = RU(t) · ( wkOR )T = RU (jt ) worjk (6)
j 1
Here, the activation function f applies the logistic sigmoid function (Eq. 8).
1 ex (8)
f ( x)
1 e x 1 e x
In recent a few years, the research works of natural language processing (NLP) have strived
toward the elaboration of huge linguistic dictionaries and ontologies (Knight et al., 1995;
Miller et al., 1990; Sugumaran & Storey, 2002), even including relations between concepts
and common sense. The exploitation and implementation of such dictionaries and
ontologies has fulfilled some understanding requirements.
Kapetanios et al. (2005) proposed to implement the process of parsing natural language
queries with an ontology, which preserved extensional semantics, such as domain terms,
operators and operations. Since the context of terms circumscribed by the real-world
semantics can be expressed by the ontology, it also will alleviate the semantic parsing.
Context of terms is defined by the interrelationships expressed with an ontology as well as
by the intentional meaning expressed with annotations.
Considering the impacts of linguistic dictionaries and ontologies in NLP, our solution for
interactionist parsing, CIParser, takes WordNet (Miller et al., 1990) as the linguistic
dictionary, and designs a corresponding ontology, WNOnto (as defined in Guo & Shao
(2008)), referring to a W3C working draft (van Assem et al., 2006). Since nouns and verbs are
more dominant in parsing sentences into phrases, they are the word types deliberately
chosen for semantical analysis with WordNet. Therefore, the design of WNOnto grounds on
nouns and verbs, which also benefits time efficiency in machine learning and parsing.
… … … …
Time Delay … … … …
Time Delay
… … … … … … … …
Based on the architecture of SRN (figure 1), our CIParser is designed as illustrated in Fig. 2.
The left wing is a classical SRN as described in section 4: all the input units in IL are single
words from original sentences; the activations from RL of the previous time step produce
the CL for the current stage; the units of IL and CL respectively multiplying matrices of
W RI and W RC compose the input of RL; the activations of RL multiplying W OR produce the
input units of OL in current stage.
All the grammatical information is implicitly preserved in its pattern of link weights.
Moreover, there are fewer independence assumptions. The SRN itself decides what to pay
Cognitive Learning for Sentence Understanding 337
attention to and what to ignore. Statistical issues, such as combining multiple estimators or
smoothing for sparse data, are handled in the training procedure. “One-size-fits-all” is a
common feature of machine learning techniques.
The right wing is structurally identical as the left wing, except that the input units in IL
include not only single words from sentences but also individual ontologies, WordOntos,
produced according to WNOnto with querying results from WordNet. In another word,
each input unit of IL is composed with (1) a single word and (2) a corresponding ontology
(only for a noun or verb). Here, any noun or verb has been appended with its semantical
information from WordNet in the ontology manner.
The syntactic structure of a natural language sentence is a hierarchical structure, which
represents how the words connect together to form constituents, such as phrases and even
clauses. This structure is normally specified with a constituent-tree, in which the
constituents are nodes or leaves and the hierarchical structure is denoted with parent-child
relationships.
In the final processing phase, “Verification and Adjustment of Parsing Results”, the parsing
results of left and right wings are verified against each other in case that either wing takes
too long time to deliver a parsing result. In the case of both wings producing parsing results,
we have followed a selection rule that the tree containing more constituents wins, which has
been strictly followed in later experiments. The application of phrases to identify structural
constituents in our CIParser also offers the competence to generalize machine learned
information across structural constituents.
As we know that (1) people has language processing constraints in constructions, such as
center embedding (Chomsky, 1959), and (2) people can only activate a limited number of
information units in memory at any one time (Miller, 1956), we introduced working
memory (Baddeley & Susan, 2006) into our CIParser. Baddeley et al. (2006) defined working
memory as a limited capacity system for temporary storage and manipulation of
information for complex tasks such as comprehension, learning and reasoning. In this paper,
we add the storage task of working memory to our CIParser to simulate human processing
features.
The nature of SRN decides that each new input, a word or/and its ontology, of the network,
will also be input of the network in a new state, which indicates that information is
computed through all of these states in every subsequent time period. However, the
constraints on the depth of center embedding (Chomsky, 1959) implies that a limited
number of these states will be referred to by following parts of the constituent-tree in any
given time period.
In CIParser, we construct a queue with limited length to simulate the active units in human
memory. When the SRNs arrive at a new state, this state will be queued from head to tail.
When a new state comes to the queue fully filled with previous network states, the oldest
state leaves the queue at tail and the new one enters the head. This queue mechanism
presents that, when the number of states exceeds the queue length, the oldest state will be
forgotten. This mechanism also helps the CIParser to focus on active states and to achieve
precise computing results efficiently.
Guo & Shao (2008) has designed and constructed experiments for training and examining
CIParser in sentence parsing. The experiments demonstrate that the SRN-based CIParser
may be used for connectionist language learning with structured output representations.
338 New Advances in Machine Learning
…… RU| R|
Word_2,
Time Delay
...
Word1 ,Word2,…,Wordi,… WordOnto1,WordOnto2,…,WordOntoi,…
The above model is able to store and retrieve different sentence-meaning construction
appropriate for different sentences. The requirement is that each individual sentence should
yield a unique construction index. The construction indices are used in a working memory
or an associative memory to store and retrieve the correct sentence-meaning construction
index.
6. Conclusion
This chaper starts with a review of classical and traditional statistical learning approaches.
As sentence understanding has latent compact connections with human cognition, this
chaper also highlights relevant cognitive concepts or models in sentence understanding
domain. Afterwards, this chapter described the completion of sentence understanding task
from two aspects, sentence parsing and semantic processing, and how cognitive models are
integrated, with machine learning algorithms (or models), into the procedures of sentence
parsing and semantic processing.
The CIParser has been evaluated and proven comparablr with the state-of-the-art parsing
techniques based on statistical language learning. Another computing model of Semantic
Processing for Sentence Understanding (Fig. 3.) also has been constructed to deliver
Sentence-Meaning Construction Index (SMCI) for each sentence. With SMCI, a sentence can
be understood in four dimensions, which are lexical, syntactic, grammatical and semantic
(or conceptual) dimensions.
Cognitive learning with machines for sentence understanding has just started with minor
productions, in which our works took SRNs as an initial model of artificial neural networks.
340 New Advances in Machine Learning
In an artificial language learning task (next-word prediction), van der Velde et al. (2004)
evaluated a simple recurrent network (SRN) and claimed that the SRN failed to process
novel sentences appropriately, for example, by correctly distinguishing between nouns and
verbs. However, Frank (2006) extended above simulations and showed that, although
limitations had arisen from overfitting in large networks (van der Velde et al., 2004), an
identical SRN still can display some generalization performance in the condition that the
lexicon size was increased properly. Moreover, Frank (2006) demonstrated that
generalization could be further improved by employing the echo-state network (ESN)
(Jaeger, 2003), an alternative network that requires less training (due to fixed input and
recurrent weights) and is less prone to overfitting.
Recurrent Self-Organizing Networks (RSON) (Farkaš & Crocker, 2006), coupled with two
types of a single-layer prediction module, had demonstrated salient benefit in learning
temporal context representations. In the task of next-word prediction, RSON achieved the
best performance, which turned out to be more robust and faster to train than SRN and
higher prediction accuracy than ESN. As a conclusiong, further investigation will take ESN
and RSON as neural network models, and we believe that comparison and evalation works
among SRNs, ESNs, and RSONs are also venturing and promising directions.
7. Acknowledgement
This research work has been partly funded with the Returned Scholar Research Funding of
Chinese Ministry of Education (MoE).
8. References
Baddeley, A. & Susan, J. P. (2006). Working Memory: An Overview. Working Memory and
Education, Academic Press, ISBN:0125544650, Burlington, pp.1-31.
Balota, D. A. (1994). Visual word recognition: the journey from features to meaning, In:
Handbook of Psycholinguistics, Gernsbacher, M. A., (Ed.), pp.303-358, Academic
Press, ISBN-10: 0122808908, New York
Bates, E. & MacWhinney, B. (1987). Competition, variation and language learning, In:
MacWhinney, B. (Ed.), Mechanisms of Language Acquisition, Erlbaum, Hillsdale, NJ,
pp. 157–194.
Berwick, R. & Weinberg, A. (1984). The Grammatical Basis of Linguistic Performance, MIT Press,
Cambridge, MA.
Bock, K. & Griffin, Z. M. (2000). The persistence of structural priming: Transient activation
or implicit learning? Journal of Experimental Psychology: General, Vol.129, pp.177–192.
Caplan, D. & Waters, G. S. (1999). Verbal working memory and sentence comprehension,
Behavioral and Brain Sciences, Vol.22, pp.77–126.
Chang, F.; Bock, K. & Goldberg, A. (2003). Do thematic roles leave traces of their places?
Cognition, No. 90, pp.29–49.
Chang, F.; Dell, G. S. & Bock, K. (2006). Becoming syntactic, Psychological Review, No.113,
pp.234–272.
Charniak, E. (1997). Statistical Techniques for Natural Language Parsing, AI Magazine, No.18,
pp. 33-43.
Cognitive Learning for Sentence Understanding 341
Miller, G. A.; Beckwith, R.; Fellbaum, C.; Gross, D., & Miller, K. J. (1990). Introduction to
WordNet: An On-Line Lexical Database, International Journal of Lexicography, Vol. 3,
No. 4, pp. 235-312.
Misyak, J. B. & Christiansen, M. H. (2007). Extending statistical learning farther and further:
Long-distance dependencies, and individual differences in statistical learning and
language, In: D. S. McNamara & J. G. Trafton (Eds.), Proceedings of the 29th annual
cognitive science society conference, pp. 1307-1312, Austin, TX: Cognitive Science
Society.
Monaghan, P.; Chater, N. & Christiansen, M. H. (2005). The differential role of phonological
and distributional cues in grammatical categorisation, Cognition, Vol. 96, pp. 143–
182.
Morgan, J. L.; Meier, R. P. & Newport, E. L. (1987). Structural packaging in the input to
language learning: Contributions of prosodic and morphological marking of
phrases to the acquisition of language, Cognitive Psychology, Vol. 19, pp. 498–550.
Onnis, L.; Christiansen, M.; Chater, N. & Gómez, R. (2003). Reduction of uncertainty in
human sequential learning: Evidence from artificial language learning, In:
Proceedings of the 25th annual conference of the cognitive science society, Mahwah, NJ:
Lawrence Erlbaum, pp. 886–891.
Rayner, K.; Carlson, M., & Frazier, L. (1983). The interaction of syntax and semantics during
sentence processing, Journal of Verbal Learning and Verbal Behavior, No. 22, pp. 358–
374.
Saffran, J. R. (2001). The use of predictive dependencies in language learning, Journal of
Memory and Language, Vol. 44, No. 4, pp. 493–515.
Saffran, J. R. (2003). Statistical language learning: Mechanisms and constraints, Current
Directions in Psychological Science, Vol. 12, No. 4, pp. 110–114.
Sedivy, J. C.; Tanenhaus, M. K.; Chambers, C. G. & Carlson, G. N. (1999). Achieving
incremental semanticinterpretation through contextual representation, Cognition,
No. 71, pp. 109–148.
Speer, S. R. & Clifton, C. (1998). Plausibility and argument structure in sentence
comprehension, Memory and Cognition, Vol. 26, pp. 965–978.
Sugumaran, V. & Storey, V. C. (2002). An ontology based framework for generating and
improving DB design, In: Proceedings of 7th International Workshop on Applications of
Natural Language to Information Systems, pp. 1-12, Springer Verlag, Stockholm,
Sweden.
Tanenhaus, M. K.; Spivey-Knowlton, M. J.; Eberhard, K. M. & Sedivy, J. C. (1995).
Integration of visual and linguistic information in spoken language comprehension,
Science, No. 268, pp. 1632–1634.
Tanenhaus, M. K. & Trueswell, J. C. (1995). Sentence Comprehension. Speech, Language, and
Communication (2nd), Academic Press, San Diego, pp. 217-262.
Taraban, R. & McClelland, J. R. (1988). Constituent attachment and thematic role assignment
in sentence processing: Influence of content-base expectations, Journal of Memory
and Language, Vol. 27, pp. 597–632.
Townsend, D. J. & Bever, T.G. (2001). Sentence Comprehension: The Integration of Habits and
Rules, The MIT Press, Cambridge, MA.
Trueswell, J. & Tanenhaus, M. K. (1994). Toward a lexicalist framework for constraint-based
syntactic ambiguity resolution, In: C. Clifton, L. Frazier, & K. Rayner (Eds.),
344 New Advances in Machine Learning
21
x
1. Introduction
The principle significance of an artificial neural network is that it learns and improves
through that learning. The definition of the learning process in neural networks is of great
importance. The neural network is stimulated and regarding to these stimulations the free
parameters of the network change in its internal structure. As a result the neural network
replies in a new way. Based on a basic learning algorithm namely Hebbian learning, a
solution to the problem of resolving uncertainty areas in diffusion tensor magnetic
resonance image (DTMRI) analysis is represented. Diffusion tensor imaging (DTI) is a
developing and promising medical imaging modality allowing the determination of in-vivo
tissue properties noninvasively upon the random movement of the water molecules. The
method is unique in its ability being a noninvasive modality which is a great opportunity to
explore various white matter pathologies and healthy brain mapping for neuroanatomy
research. In neuroscience applications DTI is mostly used addressing brain’s fiber
tractography, reconstructing the connectivity map. Clinical evaluation of fiber tracking
results is a major problem in the field. Noise, partial volume effects, inefficiency of
numerical implementations by reconstructing the intersecting tracts are some of the reasons
for the need of standardized fiber tract atlas. Also misregistration caused by eddy currents,
ghosting due to motion artifacts, and signal loss due to susceptibility variations may all
affect the calculated tractography results.
The proposed method based on the Hebbian learning provides an instance of non-
supervised and competitive learning in a neurobiological aspect as a solution to the tracking
problem of the intersecting axonal structures. The main contribution of the study is to
describe a tracking approach via a special class of artificial neural networks namely the
Hebbian learning with improved reliability.
the rate of diffusion is greatest (Basser et al., 2000). The developing imaging modality is
almost a routine MR technique analyzing tissue anisotropy characteristics, connectivity and
alterations of human brain neural tracts.
The discrete diffusion tensor and diffusivity trajectory estimation between neighboring
image pixels are used to trace out the fiber pathways namely the tracts. The process of
determining the neural tracts especially white matter structures by diffusion tensor analysis
is commonly known as tractography. Fiber tractography is able to provide both quantitative
and qualitative information aiming to clarify the anatomical architecture of brain’s fibers
and advance our knowledge of fiber connectivity maps (Ding et al., 2003). There are some
limiting cases in DTI analysis and fiber tracking. One of the critical problems in estimating
these brain maps is the existence of intersecting tracts within the tissue. As a consequence of
this fact, axonal structures in the image voxels with more than one diffusivity direction can
not be clearly defined, where the generally the diffusion tensor model becomes inaccurate to
define the uncertainties (Bammer, 2003; Ciccarelli et al., 2003).
Current researchs are involved in multi-tensor mixture models (Tuch et al., 2002) and higher
order tensor models (Basser et al., 1994; Basser, 2002). Some techniques such as q-space
imaging (Callaghan et al. 1988; Basser, 2002), and high angular resolution diffusion imaging
(Frank, 2002; Tuch et al., 2002) are enhanced in resolving such multidiffusivities within a
voxel. Jones employed the so called “cone of uncertainty” as a construction method where
the tensor’s principal eigenvector has a confidence interval in which one helps to define the
uncertainty regions as a cone with a probability distribution instead of a discrete diffusivity
determination (Jones, 2003). In spite of having some proposed methods for determination of
the intersecting diffusivities (Westin et al., 1999; Pajevic & Pierpaoli, 2000; Poupon et al.,
2000; Tuch et al., 2002), still the connection is not precisely defined, and there isn’t any gold
standard yet (Westin et al., 1999; LeBihan et al., 2006). So depending on the proposed
Hebbian learning rule approach, we aim to clarify the tracts in the intersections in order to
eliminate the uncertainty.
Getting the six unique elements of the diffusion tensor D requires at least six diffusion
weighted measurements in non-collinear measurement directions g along with a non-
diffusion-weighted measurement S0 based on the three-dimensional Gaussian Stejskal-
A Hebbian Learning Approach for Diffusion Tensor Analysis & Tractography 347
1 S
ln
1
Dxx b S
0
x 2
1
y2
1
z2
1
2 x1 y1 2 y1 z1 2 x1 z1 Dyy 1 S
ln
2
2
x2 y22 z22 2 x1 y1 2 y2 z 2 2 x2 z2 Dzz b S
0
Dxy (2)
2
xn y 2
z 2
2 xn yn 2 yn z n 2 xn zn Dxz
n n
Dyz 1 S
ln
n
b S
0
In the linear system of equations Ad = s of equation 2, A is the encoding matrix containing
the n ≥ 6 unit normalized gradient measurement directions, d is a vector specifying the 6
unique elements of the diffusion tensor D (Eq.3), and s is a vector containing natural
logarithmic scaled RF signal loss resulting from the Brownian motion of spins (Berg, 1983).
| D - I|=0 (5)
348 New Advances in Machine Learning
The calculated eigenvectors are ordered descending, and an ordered orthogonal basis with
the first eigenvector having the direction of largest variance of the data is created (Goksel &
Ozkan, 2006). In our sample, this leads to the principal diffusivity, and so the most
appropriate diffusivity directions can be determined (Borisenko & Tarapov, 1979). The first
principal component 1 has maximum variance, and thus its weighting coefficients give the
direction of the maximum diffusion weighted signal, or largest principal diffusivity (Basser
et al., 2000). The weighting coefficients of the second and third principal components 2 and
3 give the directions of the intermediate and smallest principal diffusivity respectively.
Estimating the fiber tract maps follow the implementation of the selected post processing
methods, in this study the Hebbian learning rule, to resolve the related eigensystem. To
begin the tracking process, generally a starting pixel also called a seed point is selected to
focus on the desired region of interest and to avoid calculation overload in consideration of
working on the whole brain DTMR data. Starting at the seed point coordinates, similar fiber
orientation vectors are traced out upon a predefined similarity constraint until the selected
ROI is fully covered. Specific tracts related to the investigated ROIs can be visualized by
choosing the regions/seed points according to anatomical structures picked on the brain
atlas where the selection can be made either on segmented DT brain map or unsegmented
and full brain volume.
In this study, the input pattern is actually the eigensystem defining the principal diffusivity
of the fibers in DTMRI. As previously mentioned, the learning theory of Hebb relies in the
increase of the weights between neighboring nodes by simultaneous activation. In other
words, the weights between the nodes of the input pattern in Hebbian learning are
representing the relationship between these nodes. The modification of the weights and the
implementation will be explained in the next subtitles.
The Hebb’s formula has many forms, the simplest form expressed with the weight
modification is given as (Haykin, 1999):
The synaptic adjustment wkj is applied to the synaptic weight vector wkj at time step n
with a learning rate parameter . This proceeds from one step in the learning algorithm to
another.
F(x1) w1j
w2j
F(x2) 1 if yj ≥
. . yj = (Fw)
0 if yj <
. .
. . wij
.
F(xi)
Fig. 1. The principal of the Hebbian learning: The activation function with a threshold is
defining the output in other words activated, allowed vectors of the examined pattern
(2,3) (4,3)
(2,2) (3,2)
The seed point in Fig.2 is the voxel with the coordinate (4,2). The 8 neighboring nodes of the
starting point are first investigated and in each step the weight is been updated. For the
A Hebbian Learning Approach for Diffusion Tensor Analysis & Tractography 351
same sample, Gaussian noise is added to the pattern, and again weights are updated
starting from the same seed point. As a result, the Hebb rule defines the green path as the
winning path (Fig.3). Depending on the threshold function varying branches can be
determined for the same ROI (Fig.4). For simulation studies various threshold functions are
selected to verify the algorithm, and to define the simulated paths more precisely. In living
tissue, the selection of the threshold constraint is depending actually on the human brain
atlas and anatomical structure knowledge.
(2,3) (4,3)
(2,2) (3,2)
(2,3) (4,3)
(2,2) (3,2)
The real data sets of brain diffusion MR images are used for the validation of the
algorithm. The starting point and the activation function’s threshold are selected upon the
knowledge of white matter fiber atlas and the most common pathology regions in the
literature.
Fig. 5. Eigenvectors of the whole slice is represented on the registered anatomic MR image.
The manually executed seed point selection is seen on the figure.
Tracking Result (red) on the Principle Vector (blue) Representation
120
100
80
60
40
20
20 40 60 80 100 120
Fig. 6. Implementation of the algorithm at the starting point shown in Fig. 5 with a threshold
function 1 allowing a wide range of neighbors as winning nodes results the represented
axial slice registered with the anatomic MR image
A Hebbian Learning Approach for Diffusion Tensor Analysis & Tractography 353
Tracking Result (red) on the Principle Vector (blue) Representation Tracking Result (red) Representation
65
120
100
60
80
60
55
40
20
50
34 36 38 40 42 44 46 48 50 52
20 40 60 80 100 120
4. Discussion
The special class of artificial neural networks namely Hebbian learning is proposed for the
analysis of a relatively new 3D imaging technique DTMRI’s raw data. One of the major
problems in DTI literature is the absence of a gold standard in fiber tractography.
Intersections of two or more fiber tracts yield to erroneous estimates of the diffusivity and
the fiber orientation. The anatomical fiber maps determined by diffusion tensor analysis are
having still an unclear accuracy because of the inefficiency of the tensor model used to
define the uncertainty regions such as crossing diffusivities in a single voxel. The practical
accuracy of DT analysis and tractography vary upon the limitations of data quality and
signal-to-noise ratio (SNR) (Mangin et al., 2002). In this proposed study the critical
uncertainty problem is tried to be eliminated by adequate analysis tool. The method is first
implemented on synthetic tract pattern (Fig. 2). The weight modifications yield to determine
the weighted connections between the neighboring pixels (Fig. 3). The application results
give promising tract estimations (Fig. 4) based on the threshold determination of the
activation functions. Some real data analyses are done as represented in the implementation
section 3.3 in Fig. 6 and Fig. 7. Still the proposed rule has to be implemented on 3D brain
volume for validation studies.
Post processing reconstruction can reduce the sensitivity of tractography, so in Hebb
application automated mapping and tracking after seed point selection is achieved and
the method relies in basic learning algorithm which is quite an accepted procedure in
defining the anatomical brain mapping. The applicability of the Hebbian rule to the
uncertainty problem is verified by examining the updated weight changes by defining a
fiber path.
The assumptions made in the determination of the diffusion tensor analysis are of great
importance because the error tolerance and the general limitations of all the sequent
applications including tractography are highly dependent on these.
366 New Advances in Machine Learning