Convolutional Neural Networks in Computer Vision: Jochen Lang

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Convolutional Neural Networks in Computer

Vision
Jochen Lang

[email protected]

Faculté de génie | Faculty of Engineering


Jochen Lang, EECS
[email protected]
Hands-on machine learning

• Machine learning landscape


• data handling
• visualizing data
• organizing the data
• training, validation and testing

Jochen Lang, EECS


[email protected]
What is machine learning?

• A definition that I like by Tom Mitchell [1997]:


– A computer program is said to learn from experience
E with respect to some task T and some performance
measure P, if its performance on T, as measured by
P, improves with experience E.
• Wikipedia knows
– Machine learning is a field of computer science that
uses statistical techniques to give computer systems
the ability to "learn" (e.g., progressively improve
performance on a specific task) with data, without
being explicitly programmed. (Attributed to A.
Samuel [1959])

Jochen Lang, EECS


[email protected]
Computer Vision and AI?

• Artificial Intelligence has regarded computer vision as a


primary task.
– MIT Vision Project in the summer of 1966
– Connect cameras to computers
– AI to extract and interpret information in the image
• Rumors:
– “how to extract objects and features from video
camera data was originally tossed to a part-time
undergraduate student researcher to figure out in a
few short months.” H. Knight, T. Green and others
(CSAIL) [2006]

Jochen Lang, EECS


[email protected]
Computer Vision and AI?

• Classical AI in computer vision


– E.g. Session 21 “Computer Vision” of the 3rd
International Joint Conference on Artificial
Intelligence, IJCAI 1973, Alan Mackworth,
“Interpreting Pictures of Polyhedral Scenes.”

• But “handcrafting the visual model of an object is


neither easy nor practical.”

Jochen Lang, EECS


[email protected]
Classic AI vs Machine Learning
• Classic knowledge-based, hand-crafted approach
(simplified view)

Study the Write rules Eval Success!


problem uate

Analyze
errors

Adapted from A. Géron, Hands-On ML

Jochen Lang, EECS


[email protected]
Classic AI vs Machine Learning
• Machine learning based approach (simplified view)

Data

Study the Fit model Eval Success!


problem “training” uate

Analyze
errors
Adapted from A. Géron, Hands-On ML
Jochen Lang, EECS
[email protected]
Classification of Machine Learning
System
• Mitchell’s definition is based on a task T, an experience
E, and a performance measure P. This can help to
classify machine learning approaches.
• Classify by task T in broad terms
– Classification
• Group the input into a set of different categories,
e.g., image classification: decide if the main
image object is a cat, car, house, person etc.
• We try to learn a function
– Our model maps the input into k categories
with the input vector and a
probably distribution over categories

Jochen Lang, EECS


[email protected]
Classification of Machine Learning
System by Task
• Regression
– Predict a numerical value from the input, e.g.,
(stereo) depth estimation: given an image decide for
each pixel how far away the object imaged at this
pixel is from the camera
– We try to learn a function
• Our model maps the input to a real output
value with the input vector and being
a real vector or number
• Note often we will want to predict many outputs
and is a vector and often the output needs to
have a particular structure (structured output).

Jochen Lang, EECS


[email protected]
Classification of Machine Learning
System by Task
• Density estimation (or probability mass function in
the discrete case)
– Background (and foreground) modelling, e.g., based
on color for surveillance or as preprocessing step for
other task
– Given data learn the distribution of the data such
that the probability of new data belonging to a
cluster can be estimated

Jochen Lang, EECS


[email protected]
Classification of Machine Learning
System based on Experience E
• Experience (or training) may include the ground truth or
label
• Supervised Learning
– A so-called training data set is used to fit a machine
learning model. The training involves comparing the
prediction by the model to the known solution, often
called ground truth or label and improving the
model based on the comparison between labels and
model output.

Jochen Lang, EECS


[email protected]
Classification of Machine Learning
System based on Experience E
• Experience (or training) may be just data. As the
ground truth or label is missing, the “solution” is
unknown and hence the learning is unsupervised
• Unsupervised Learning
– The model must model the distribution of the
training data set, maybe even discovering its
properties or interesting features.

Jochen Lang, EECS


[email protected]
Classification of Machine Learning
System based on Experience E
• Experience (or training) may be on-line and dependent
on the actions of the mode governed by its policy
• Reinforcement learning
– The experience is in the form of on-going positive
and negative rewards. But the rewards also depend
on the actions as a result of the learned policy
(feedback loop)

Jochen Lang, EECS


[email protected]
Classification of Machine Learning
System?
• Warning: These classification are soft
– Supervised and unsupervised tasks can often be
reformulated as the other
• E.g., based on chain rule of probability, we can
split the unsupervised data set vectors into
dependent and independent variables and solve it
with supervised techniques
• E.g. if we learned the distribution of a data set
than we can use it to solve a supervised task
– Regression and classification can often also be
turned into the other (maybe with or without a
normalization step)

Jochen Lang, EECS


[email protected]
Instance-based vs. Model-based
Learning
• Another classification
– Instance-based means that the prediction for new
data is simply the most similar output seen during
training. This can be just the closest (nearest-
neighbor) or some kind of combination of k-nearest
neighbors.
– Model-based means a model is fit to training data
and for new input, the model is used to
calculate/predict the output.

Jochen Lang, EECS


[email protected]
Generative vs. Discriminative Models

• A distinction, especially common in computer vision.


• A formal definition based on Bayes rule
– Discriminative methods model posterior
– Generative methods model likelihood and prior
• Recall Bayes rule and say we have an interest to find
images of people, i.e., we have two types of images,
images with and without people.

Posterior Likelihood ratio Prior

Jochen Lang, EECS


[email protected]
Use of Machine Learning in Computer
Vision
This is only a short non-comprehensive list!
All the cited work pre-dates and/or does not use deep
learning!
• (Linear) Regression (least squares), in tasks such as
edge detection, stereo matching, camera calibration
etc., e.g., Y.I. Abdel-Aziz and H.M. Karara, Direct-linear
transform into object-space coordinates in close-range
photogrammetry [1971]
• Principal Components (PCA), e.g., Turk and Pentland,
Eigenfaces for recognition [1991].

Jochen Lang, EECS


[email protected]
Use of Machine Learning in Computer
Vision
• Decision trees (forests), e.g., A. Criminisi et al.,
Decision forests: A unified framework for classification,
regression, density estimation, manifold learning and
semi-supervised learning [2012].
• Boosting, e.g., AdaBoost in P. Viola and M. Jones, Rapid
object detection using a boosted cascade of simple
features [2001].
• Neural networks, e.g., T.D. Sanger, Optimal
unsupervised learning in a single-layer linear
feedforward neural network [1989] or N.M. Nasrabadi
and C.Y. Choo, Hopfield network for stereo vision
correspondence [1992].

Jochen Lang, EECS


[email protected]
Use of Machine Learning in Computer
Vision
• Hidden Markov Models, e.g., N.M. Oliver et al., A
Bayesian computer vision system for modeling human
interactions [2000].
• Random fields (MRF, CRF) lead to energy minimization
often solved by graph cuts, see, Y. Boykov, O. Veksler,
R. Zabih, Fast approximate energy minimization via
graph cuts [2001].
• Graphical models, e.g., K.P. Murphy et al., Using the
forest to see the trees: A graphical model relating
features, objects, and scenes [2000].
• Non-parametric Kernels, e.g., J. Tighe and S. Laznebik,
Superparsing: Scalable Nonparametric Image Parsing
with Superpixels [2010].

Jochen Lang, EECS


[email protected]
Traditional Machine Learning vs.
Deep Learning in Computer Vision
• Deep learning (partially) replaced the step of feature
extraction with “end-to-end” learning.

Source: Zbigniew Zdziarsk, https://zbigatron.com/has-deep-learning-superseded-traditional-computer-vision-techniques/,


accessed Sept. 8, 2018

Jochen Lang, EECS


[email protected]
Is Computer Vision still different
from AI?
• Not all of computer vision is classification and recognition
but also
– computational cameras, geometrical computer vision,
3D reconstruction, motion, light transport etc.
• Even for classical applications of AI in computer vision
there is still a need to solve classical computer vision tasks
– Deep learning needs big data
• Need knowledge in CV to collect and organize data
– Understanding CV helps in making deep learning
effective
• Need to know how to train for the task, e.g.,
augmentation, regularization etc.
– Some tasks can be solved much more efficiently by
classical techniques

Jochen Lang, EECS


[email protected]
Data

• One of the major reasons for success of deep learning in


computer vision are large datasets
• Example: ImageNet
– http://www.image-net.org/
– According to the above webpage
• Total number of non-empty synsets: 21,841
• Total number of images: 14,197,122
• Number of images with bounding box
annotations: 1,034,908
• Number of synsets with SIFT features: 1000
• Number of images with SIFT features: 1.2 million

Jochen Lang, EECS


[email protected]
Data Example - COCO

• Microsoft Common Objects in Context (COCO)


• http://cocodataset.org/
• Usable for object segmentation, recognition in context
and Superpixel “stuff” segmentation
– 330K images (>200K labeled)
– 1.5 million object instances
– 80 object categories
– 91 stuff categories
– 5 captions per image
– 250,000 people with keypoints

Jochen Lang, EECS


[email protected]
Data Example – FlyingThings3D

• FlyingThings3D
• https://lmb.informatik.uni-freiburg.de/resources/
datasets/SceneFlowDatasets.en.html
• Useful for various 3D Vision tasks: optical flow, stereo
matching, segmentation
– CG Images of various scenes
– Synthetic dataset with optical flow, stereo and
segmentation ground truth
– ~39000 stereo frames @ 960x540 px

Mayer et al, A Large Dataset to Train Convolutional Networks for Disparity, Optical
Flow, and Scene Flow Estimation [2016]

Jochen Lang, EECS


[email protected]
Data Repositories
• There are many (open) data repository for ML including
CV datasets, e.g.,
– Kaggle https://www.kaggle.com/datasets
– Amazon AWS https://registry.opendata.aws
• Computer vision specific, e.g.,
– Keith Price vision bibliography
http://datasets.visionbib.com/
– CVonline: Image Databases
http://homepages.inf.ed.ac.uk/rbf/CVonline/Imaged
base.htm
– Yet Another Computer Vision Index To Datasets
(YACVID) http://yacvid.hayko.at/

Jochen Lang, EECS


[email protected]
“Hello World” or MNIST Data set

• Classification of handwritten numerals


• See Yann LeCun’s page with scores for different ML
algorithms http://yann.lecun.com/exdb/mnist/

Image src: By Josef Steppan - Own work, CC BY-SA 4.0,


https://commons.wikimedia.org/w/index.php?curid=64810040

Jochen Lang, EECS


[email protected]
Data Challenges
• Many challenges for ML (in computer vision) are related to
(training) data
– Insufficient quantity
• Additionally data gathering can be costly or
impossible
– Poor quality
• “Ground truth” may have errors
• Mismatch between image content and labels
• Ambiguity of images
– Non-representative
• Training (and test) image may not match the
images seen during application
– Different distribution, appearance changes,
deformation, unexpected lighting etc.

Jochen Lang, EECS


[email protected]
Course Software Prerequisites

• Setup:
– Python with jupyter, matplotlib, numpy, pandas,
scipy, scikit-learn, tensorflow
• scikit-learn, according to http://scikit-learn.org/stable/
– Simple and efficient tools for data mining and data
analysis
– Accessible to everybody, and reusable in various
contexts
– Built on NumPy, SciPy, and matplotlib
– Open source, commercially usable - BSD license

Jochen Lang, EECS


[email protected]
Course Software Prerequisites
• Tensorflow
– started at the google brain team
– open source library for general high performance
numerical computation (HPC)
– flexible architecture that runs on CPUs, GPUs (and
TPUs)
– support for ML including deep learning
– APIs in different languages: Python, JavaScript, C++,
Java, Go, Swift plus community support for C#, Haskell,
Julia, Ruby, Rust, Scala, …
• Different high-level APIs, more consistent in 2.0
– Tf.keras
– Eager execution and autograph
– Estimators

Jochen Lang, EECS


[email protected]
Tensor Flow API Levels

Image source: https://www.tensorflow.org

Jochen Lang, EECS


[email protected]
Inspecting the Data

• Golden rule: Set aside test data and do not look at it


until test time.
• Inspection easy with MNIST as it is available through
fetch_openml from sklearn.datasets
– Data source: https://www.openml.org
• Can use maplotlib.pyplot use to look at the data
arrays as images

Jochen Lang, EECS


[email protected]
Binary Classifier

• Pick one digit, say 7, in MNIST and train a model to


predict if an image shows 7 or not.
– Classification in 7 and not 7.
• Stochastic gradient descent is one way to fit (linear)
models to training
– Use SGDClassifier from sklearn.linear_model
– SGDClassifier can fit different models, e.g., (soft-
margin) linear Support Vector Machine, and logistic
regression.
– More, later.

Jochen Lang, EECS


[email protected]
Performance Measure

• Besides the data, the model and fitting, we also need to


measure the performance of the model on the task
• Commonly, we use
– Accuracy
• Simply count the correct predictions and express
as percentage of all predictions.
• In or example; how many times is the classifier
correct, i.e., the prediction and the ground truth
match.

Jochen Lang, EECS


[email protected]
Positives and Negatives

• Predictions can be characterized as true positive (TP),


false positive (FP), true negatives (TN) and false
negatives (FN)
Predict “not 7”
FN 8 8
3 1 TN8 6
3 4
7 2
5
TP 5 6
9 4 6
7
5
7 FP 1
4 5
1 9 2 3 9 2
Predict “7”
Jochen Lang, EECS
[email protected]
Precision vs. Recall

• Precision
– Precision is high if the predictions are correct for
positives but if positives (e.g. detections or
classification of 7 in our example) are missed, it does
not influence the precision.
• Recall (sensitivity)
– Recall is high if true predictions are not missed but if
negatives (e.g. detections or classification of not 7 in
our example) are incorrect, it does not influence the
recall.

Jochen Lang, EECS


[email protected]
F1 Score: Precision and Recall

– Precision and recall can be combined into a F1 score

– Precision vs. recall trade-off


– Often we can increase precision by trading off recall
and vice versa
• Example:
– Classifier that predicts always “7”
• Recall 100% but precision is
– Classifier that predicts always “not 7”
• Accuracy in example but recall is
Jochen Lang, EECS
[email protected]
Receiver Operating Characteristics (ROC)
Curves
perfect
• Definition
– true positive rate (recall)

better
– false positive rate

Selection by chance

Jochen Lang, EECS


[email protected]
Classification of all characters

• Multiclass Classification
– Different possible approaches
– Can use multiple binary classifier
• One vs. all (OvA), or one vs the rest, e.g., 10
binary classifiers which make a binary decision if
the given image is of a specific number. The
binary classifier with the highest score or
probability wins.
• One vs. one (OvO), i.e., all possible pairs are
( )
formed, e.g., . The class which
wins most binary classifications is chosen.

Jochen Lang, EECS


[email protected]
Multinomial Classification

• Direct multi-class classification, i.e., one classifier which


directly produces scores or probability for all categories.
– Example: Decision tree or random forest classifier
– A list of classifiers implemented in scikit-learn and
their behaviour for multi-class can be found at
http://scikit-
learn.org/stable/modules/multiclass.html

Jochen Lang, EECS


[email protected]
Multilabel and Multioutput
Classification
• Each sample can receive multiple (non-exclusive)
labels.
– An image may show multiple object
– An object may have multiple properties
– Output of the classifier is a vector

• Each sample can receive multiple non-


binary labels, e.g., an image may show
multiple objects: cars and people but the
estimator must also estimate properties
of the these objects.
– Output of the classifier is matrix
Image by D. Lange, public domain.
Jochen Lang, EECS
[email protected]
Cross-Validation

• Cross validation is based on splitting the training data


into different data set or folds.
– Then the ML model needs to be trained on folds
and validated with the fold not used for training
– E.g., 4-fold cross validation would train 4 different
models and evaluate each model on a different of the
4 folds
– The folds have to be chosen randomly but should
also be representative of the whole data sets
• Stratified sampling can be used

Jochen Lang, EECS


[email protected]
Cross Validation Example

• Each fold will produce a score, e.g., accuracy which then


we can calculate the mean and standard deviation of.
Note: Test data should likely be called validation data in the figure below.

Jochen Lang, EECS


[email protected]
Confusion Matrix

• Detail analysis of especially multi-class problems


– E.g., consider a 3 class problem with a data size of
90 and 30 samples of class I, II, III respectively.
– We build a matrix with actual sample classes along
rows and the result of the classifier along columns.
– We are entering counts for each class predicted by

the classifier, e.g.,

– A perfect classifier would have a diagonal confusion

matrix, e.g.,

Jochen Lang, EECS


[email protected]
Summary

• Computer vision and AI and in particular machine


learning have long joined history
• Any machine learning project has to deal with data and
this may well be the trickiest part, because of the need
for
– data handling
– visualizing data
– organizing the data
– training, testing and validation

Jochen Lang, EECS


[email protected]

You might also like