Feature Extraction Using Deep Learning For Food Type Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Feature Extraction Using Deep Learning for Food

Type Recognition
Muhammad Farooq, Edward Sazonov*, Senior Member, IEEE

 role in food type recognition systems. Yang et al. proposed


Abstract— With the widespread use of smartphones, people support vector machine (SVM) based approach with pair-
are taking more and more images of their foods. These images wise statistics of local features such as distance and
can be used for automatic recognition of foods present and orientation to differentiate between eight basic food
potentially providing an indication of eating habits. This work materials [2] on the Pittsburgh fast-food image dataset [1].
proposes the use of convolutional neural networks (CNN) for On the same dataset, they further classified a given food into
feature extraction from food images. A linear support vector one of the 61 food categories with a classification rate of
machine classifier was trained using 3-fold cross-validation 28.2% [1]. Another work on the same dataset proposed the
scheme on a publically available Pittsburgh fast-food image use of local textural patterns and their global structure using
dataset. Features from 3 different fully connected layers of SIFT detector and Local Binary Pattern (LBP) to classify
CNN were used for classification. Two classification tasks were
images. Joutou et al. proposed a visual recognition system to
defined. The first task was to classify images into 61 categories
classify images of Japanese food into one of the 51
and the second task was to classify images into 7 categories.
Best results were obtained using 4096 features with an accuracy
categories [3]. They proposed feature fusion approach where
of 70.13% and 94.01% for 61 class and 7 class tasks an SIFT-based bag of features, Gabor, and color histogram
respectively. This shows improvement over previously reported features were used with multiple kernel learning [4]. Authors
results on the same dataset. in [5] used three image descriptors Bag of Textons, SIFT,
and PRICoLBP to classify food images. Random Forest has
Keywords— Deep Learning, transfer learning, image also been proposed for determining distinctive visual
recognition, food recognition, classification. components in food images and to use it for classification of
food type [6]. Other researchers have proposed systems
I. INTRODUCTION which are able to recognize and segment different food items
from images taken by people in real world scenarios using
In the last few years, recognition of food items from smartphone cameras [7], [8].
images has become a popular research topic due to the
availability of a large number of images on the internet and One of the most critical tasks for any machine learning
because of the interest of people in social networks. One of problem is to extract useful and descriptive features. Feature
the challenging tasks in image-based food recognition is to engineering can be domain-specific and often requires
determine which food items are present in the pictures. This domain knowledge. In recent years, Deep Learning
paper focuses on the task of food item recognition assuming algorithms have been successfully applied to a number of
that it is known that given images contain food and the image recognition problems [9]. An added advantage of
algorithm is used to determine the food type. using Deep Learning algorithms is their ability to
automatically extract useful representative features during
Food type recognition is a hard problem because the the training phase [10]. A special class of deep learning
shape of different food items is not well defined and can algorithms called convolutional neural network (CNN) has
have a variety of ingredients with varying textures. Color, shown excellent performance on recognition task such as
shape, and texture of a given food type is defined by the Large Scale Visual Recognition Challenge and is considered
ingredients and the way food is prepared [1]. Even for a as state of the art [11]. Training CNN requires large datasets
given food type, high intra-class variations in both shape and and are computationally expensive. Therefore, an alternate
texture can be observed for example chicken burgers, etc. way is to use a pre-trained CNN model for feature extraction
[1]. called transfer learning [12] and then use another simpler
Researchers have proposed a number of algorithms for classifier such as SVM to perform final classification.
recognition of food items from images. Features computed The goal of this paper was to explore the use a pre-
from images and the choice of classifier plays an important trained CNN model for feature extraction for classification
of food images into different food categories. A secondary
Muhammad Farooq ([email protected]) and Edward Sazonov goal was to explore the classification ability of features
([email protected]) are with the Department of Electrical and Computer extracted from different fully-connected layers of CNN. In
Engineering, University of Alabama, Tuscaloosa, AL 35487 USA.
*Corresponding author (phone: 205-348-1981).
this work, SVM classifier was used for classifying food
Research reported in this publication was supported by the National Institute intake using features extracted from the pre-trained CNN
of Diabetes and Digestive and Kidney Diseases (grants number: R01DK100796). models, to perform multi-class classification.
The content is solely the responsibility of the authors and does not necessarily
represent the official views of the National Institutes of Health.
II. METHODS multi-layer neural networks with multiple convolution and
pooling layers. The convolution layer consists of small
A. Data rectangular patches (filters) smaller than the original image
The algorithm designed in this work was tested on the and whose weights are learned during the training phases.
Pittsburgh Fast-food Image Dataset (PFID) [1]. The dataset These filters or kernel are used to extract low-level details
comprised of images of 61 different fast foods captured in from input images. CNN layer filters can be used to extract
the laboratory. According to the authors, each food item was basic information such as edges and blobs etc. The second
bought from a fast food chain on 3 different days and on type of layer used by CNN is called pooling layers which are
each day, 6 images from different angles with different used to reduce the spatial size of images at each pooling
lightening conditions were taken. The background was kept layer by using some form of activation function over a
constant in each image, and the focus was on the food item. rectangular window such as maximum or average over a
The dataset consisted of a total of 1098 images of 61 rectangular region. This reduces the number of parameters
categories. Details of the dataset are given in [1]. As needs to be computed and hence results in reduced
suggested in [1], in this work, data was divided into 3-folds computations at subsequent layers. In addition, a CNN
for each food type, and 3-fold cross validation was architecture can have multiple fully-connected layers which
performed where 12 images from two days were used for are similar to regular neural networks, where the layer has
training and the remaining 6 images were used for testing. full connection to all activations in the previous layer. Fully-
Fig.1 shows an example of two different food items (burger connected layers are represented by FC.
and salad). First, three rows present images of a chicken
burger taken on 3 different days and the last 3 rows show
images of salad taken on 3 different days.
Further, authors in [1] proposed to divide foods into
seven different categories since different food types might
have similar ingredients and similar physical appearance,
and the training and validation images were captured on
separate days with different view angles. These categories
were “(1) sandwiches including subs, wraps; (2) salads,
typically consisting of greens topped with some meat; (3) Fig. 2. Example filters used by the first convolution layer in AlexNet [11].
Each of the 96-filters shown is of the size 11x11x3. These filters are used to
meat preparations such as fried chicken; (4) bagels; (5) extract basic information such as edges, blobs, etc.
donuts; (6) bread/pastries; and (7) miscellaneous category
that included variety of other food items such as soup and In this work rather than training a CNN from scratch, a
Mexican-inspired fast food” [1]. pre-trained convolution neural network was used. Pre-
trained networks can be used to feature extractions from a
This approach resulted in two separate datasets, one with wide range of images. In this work a network pre-trained on
61 categories and the second with 7 categories of food items. ImageNet dataset called AlexNet was used [11]. AlexNet
Two separate classifiers were trained for both problems. For consists of a total of 23 layers, where the input size is 227-
both problems, similar feature computation and by-227-by-3 (RGB images). Images in the PFID are of the
classification approaches were used. Details are given size 600-by-800-by-3 and therefore they were re-sampled to
below. 227-by-227-by-3 so that they can be used as an input to the
network. Fig. 2 shows the filters used in the first convolution
layer of AlexNet. AlexNet has 3 fully connected layers
represented as FC6, FC7, and FC8. Fully-connected layers
learn higher level image features and are better suited for
image recognition tasks [13]. In AlexNet, FC6, FC7, and
FC8 consist of 4096, 4096 and 1000 features respectively
C. Classification: Support Vector Machine
To perform multiclass classification; linear SVM models
were used. Training and validation were performed using 3-
fold cross validation where for each food type, images taken
on two days were used for training and the images taken on
the third day were used for validation. This process was
repeated three times. Classification accuracies (F-scores)
were reported for each food type using confusion matrix.
Features from all three fully-connected layers of AlexNet
were used for training three separate linear SVM models.
These features were used for both 61 class and 7 class
Fig. 1. An example of image categories present in the PFID food database. multiclass classification problem.
B. Feature Extraction: Convolutional Neural Network
Convolutional neural networks (CNN) are state of the art III. RESULTS
for many image recognition problems. CNN are essentially Using features extracted from three fully-connected
layers of the AlexNet to train linear SVM models resulted in obtained with a combination of PRI-CoLBPg features and
different accuracies for classification of images into 61 SVM classifier and resulted in a classification accuracy of
categories. Average classification accuracies were 70.13%, 87.3% [14], whereas in this work features extracted from
66.39% and 57.2% for features extracted from FC6, FC7 and FC6 fully-connected layer with linear SVM obtained
FC8 layers of the AlexNet. classification accuracy of 94.01% with an overall
improvement of about 7%. The performance of other
For 7 classes, the accuracies obtained for features classifiers trained with features from FC7 and FC8 layers are
extracted from FC6, FC7, and FC8 layers were 94.01%, also better than previous results.
93.06%, and 89.73%, respectively. Fig. 3, Fig. 4, and Fig. 5
show the confusion matrices for seven class classification In this work the image dataset was based on fast food
for features extracted from FC6, FC7 and FC8 fully images taken in the laboratory. This work is also important
connected layers of the AlexNet. Confusion matrices for 61 because of the wide use of smartphones for taking images of
classes/categories are harder to visualize and therefore, are the foods. The approach presented here can be used to
not presented. automatically recognize food images and categories similar
foods. One limitation of the approach presented here is that
images contain only single food items. Future work will
focus on images containing multiple food items. Another
relevant problem is the use of learning algorithms to
differentiate between images of food versus non-food. This
will be considered in future work.
In the last decade or so, several wearable sensor systems
have been proposed for automatic detection of food intake
by monitoring of chewing and swallowing such as [15]–[17].
One future direction is to use these systems to automatic
detect eating episodes and then automatically trigger a
camera to capture images of the food being consumed. As a
Fig. 3. Confusion matrix; Classification into seven food categories based on final step, the approach proposed here can be used to
features extracted from FC6 layer of the AlexNet recognize food type and relevant caloric information and
volume of food can can be extracted from the captured
IV. DISCUSSION AND CONCLUSIONS images as proposed in the study [18].
This work presented an approach based on convolution
neural network and linear SVM models to differentiate
between categories of fast foods from the Pittsburgh dataset.
Instead of computing user defined features, AlexNet was
used for automatically extracting features from food images.
Results suggest that the feature extracted from the FC6 fully-
connected layer along with Linear SVM classifier provided
the best classification results in both 61 class classification
as well as on 7-class classification problem.
The approach presented in this work has improvements
over the previously reported results on the same dataset with
similar testing conditions. For example, for 61-class
problem, previous best results were reported using a
combination of Pairwise Rotation Invariant Co-occurrence
Local Binary Pattern (PRI-CoLBPg) features with SVM
classifier, resulting in classification accuracy of 43.1% [14],
whereas the proposed approach in with work resulted in the
best accuracy of 70.13%, which shows an improvement of
about 27%. On average, the proposed approach consistently
performs better than previous approaches, even if features
from other two layers are used (accuracies of 66.39% and
57.2%). A possible reason is the ability of CNN to extract
local and global features which are more relevant to the
classification task.
PFID is a challenging dataset where for each food
category, images were taken on 3 different days. On each
day images were taken from 6 different viewpoints. Since
there are intra-class variations in classes, therefore food
types were split into seven major categories i.e. sandwiches,
salads/sides, chicken, bread/pastries, donuts, bagels, and
tacos. Previous best results for 7 category classification were
Advances in neural information processing systems, 2012, pp.
1097–1105.
[12] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE
Trans. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct.
2010.
[13] J. Donahue et al., “DeCAF: A Deep Convolutional Activation
Feature for Generic Visual Recognition,” ArXiv13101531 Cs, Oct.
2013.
[14] X. Qi, R. Xiao, J. Guo, and L. Zhang, “Pairwise Rotation Invariant
Co-occurrence Local Binary Pattern,” in Computer Vision – ECCV
2012, A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C.
Schmid, Eds. Springer Berlin Heidelberg, 2012, pp. 158–171.
[15] J. M. Fontana, M. Farooq, and E. Sazonov, “Automatic Ingestion
Monitor: A Novel Wearable Device for Monitoring of Ingestive
Fig. 4. Confusion matrix; Classification into seven food categories based on Behavior,” IEEE Trans. Biomed. Eng., vol. 61, no. 6, pp. 1772–
features extracted from FC6 layer of the AlexNet. 1779, Jun. 2014.
[16] M. Farooq and E. Sazonov, “A Novel Wearable Device for Food
Intake and Physical Activity Recognition,” Sensors, vol. 16, no. 7,
p. 1067, Jul. 2016.
[17] M. Farooq, J. M. Fontana, and E. Sazonov, “A novel approach for
food intake detection using electroglottography,” Physiol. Meas.,
vol. 35, no. 5, p. 739, May 2014.
[18] J. Chae et al., “Volume Estimation Using Food Specific Shape
Templates in Mobile Image-Based Dietary Assessment,” Proc.
SPIE, vol. 7873, p. 78730K, Feb. 2011.

Fig. 5. Confusion matrix; Classification into seven food categories based on


features extracted from FC6 layer of the AlexNet.

REFERENCES
[1] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang,
“PFID: Pittsburgh fast-food image dataset,” in 2009 16th IEEE
International Conference on Image Processing (ICIP), 2009, pp.
289–292.
[2] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food
recognition using statistics of pairwise local features,” in 2010 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR),
2010, pp. 2249–2256.
[3] T. Joutou and K. Yanai, “A food image recognition system with
Multiple Kernel Learning,” in 2009 16th IEEE International
Conference on Image Processing (ICIP), 2009, pp. 285–288.
[4] S. Sonnenburg, G. Rätsch, C. Schäfer, and B. Schölkopf, “Large
Scale Multiple Kernel Learning,” J. Mach. Learn. Res., vol. 7, no.
Jul, pp. 1531–1565, 2006.
[5] G. M. Farinella, D. Allegra, and F. Stanco, “A Benchmark Dataset
to Study the Representation of Food Images,” in Computer Vision -
ECCV 2014 Workshops, L. Agapito, M. M. Bronstein, and C.
Rother, Eds. Springer International Publishing, 2014, pp. 584–599.
[6] “Food-101 -- Mining Discriminative Components with Random
Forests.” [Online]. Available:
https://www.vision.ee.ethz.ch/datasets_extra/food-101/. [Accessed:
12-Nov-2015].
[7] Y. He, C. Xu, N. Khanna, C. J. Boushey, and E. J. Delp, “Analysis
of food images: Features and classification,” in 2014 IEEE
International Conference on Image Processing (ICIP), 2014, pp.
2744–2748.
[8] Z. Ahmad, N. Khanna, D. A. Kerr, C. J. Boushey, and E. J. Delp,
“A mobile phone user interface for image-based dietary
assessment,” 2014, vol. 9030, pp. 903007-903007–9.
[9] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86, no.
11, pp. 2278–2324, Nov. 1998.
[10] Q. V. Le, “Building high-level features using large scale
unsupervised learning,” in 2013 IEEE International Conference on
Acoustics, Speech and Signal Processing, 2013, pp. 8595–8598.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” in

You might also like