Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Food Image Classification and Calorie Estimation

MINI PROJECT REPORT

Submitted in partial fulfilment of the requirements for the award of the degree

of

BACHELOR OF TECHNOLOGY
in

ELECTRONICS & COMMUNICATION ENGINEERING

by

Aditya Saxena Parth Arora Devansh Malhotra Sridhar Sharma


Enrollment No: Enrollment No: Enrollment No: Enrollment No:
43011502818 43311502818 35311502818 05111502818

Guided by
Prof. Kirti Gupta

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING


BHARATI VIDYAPEETH’S COLLEGE OF ENGINEERING
(AFFILIATED TO GURU GOBIND SINGH INDRAPRASTHA UNIVERSITY, DELHI)
NEW DELHI – 110063
March 2021
CANDIDATE’S DECLARATION

It is hereby certified that the work which is being presented in the B. Tech Mini Project Report
entitled "Food Image Recognition and Calorie Estimation" in partial fulfilment of the
requirements for the award of the degree of Bachelor of Technology and submitted in the
Department of Electronics & Communication Engineering of BHARATI VIDYAPEETH’S
COLLEGE OF ENGINEERING, New Delhi (Affiliated to Guru Gobind Singh Indraprastha
University, Delhi) is an authentic record of our own work carried out under the guidance of Prof.
Kirti Gupta.
The matter presented in the B. Tech Mini Project Report has not been submitted by me for the award
of any other degree of this or any other Institute.

Aditya Parth Devansh Shridhar


Saxena Arora Malhotra Sharma
Enrollment No: Enrollment No: Enrollment No: Enrollment No:
43011502818 43311502818 35311502818 50111502818

This is to certify that the above statement made by the candidate is correct to the best of my
knowledge. He/She/They are permitted to appear in the In-house training Examination.

Dr Kirti Gupta Prof. Kirti Gupta Ms. Shifaly Sharma,


Project Mentor Head of Department, Class Advisor
ECE

The B. Tech Industrial/In-house training Viva-Voce Examination of

Aditya Saxena(Enrollment Number: 43011502818), Parth Arora(Enrollment Number:


43311502818), Devansh Malhotra(Enrollment Number: 35311502818) & Shridhar
Sharma(Enrollment Number: 50111502818) has been held on …………………………

Project Mentor (Signature of Examiner)

i
ABSTRACT

Images of food dominate across all social media platforms. These images can be utilized in a beneficial
manner which can improve the food experiences of the people helping people make right food choices
for their diet. In this project, we explore the method of food image classification on the Food 101 dataset
using and optimising pre-trained models of Convolutional Neural Networks - InceptionV3, Resnet-50,
VGG-16, EfficientNet. This classification further proceeds to calorie estimation of the food item.

ii
ACKNOWLEDGEMENT

We express our deep gratitude to Prof. Kirti Gupta, Project Mentor for his valuable guidance
and suggestion throughout my project work. We are thankful to Ms. Shifaly Sharma, Class
Advisor for her valuable guidance.
We would like to extend my sincere thanks to Head of the Department (ECE),
Prof. Kirti Gupta for her time-to-time suggestions to complete my project work. We are also
thankful to Dr. Dharmender Saini, Principal for providing me the facilities to carry out my
project work.

Thanks for all your encouragement!

Aditya Saxena Parth Arora Devansh Malhotra Sridhar Sharma


Enrollment No: Enrollment No: Enrollment No: Enrollment No:
43011502818 43311502818 35311502818 50111502818

iii
LIST OF FIGURES

Figure 2.1 Data Augmentation………………………………………………………………………..3


Figure 2.2 Train & Validation Generators…………………………………………………………..4
Figure 2.3 Loading Base Model…………………………………………………………………….…5
Figure 2.4 Compile & Fit……………………………………………………………………………...6
Figure 3.1(a) RESNET loss vs accuracy……………………………………………………………...7
Figure 3.1(b) EfficientNET loss vs accuracy………………………………………………………....7
Figure 3.1(c) Inception V3 loss vs accuracy………………………………………………………….8
Figure 3.1(d) VGG-16 loss vs accuracy………………………………………………………………8
Figure 3.1(e) Test Image………………………………………………………………………………8

iv
LIST OF TABLES

Table 3.1 Performance evaluation of obtained results………………………………................9

v
TABLE OF CONTENTS

CANDIDATE’S DECLARATION i
ABSTRACT ii
ACKNOWLEDGEMENT iii
LIST OF FIGURES iv
LIST OF TABLES v

Chapter 1: Introduction………………………………………………………….........................1

Chapter 2: Literature Survey....………………………................................................................2

Chapter 3: Proposed Methodology & Implementation………………………………………...3


2.1 Data Augmentation.………...………..............………………………………….......................3
2.2 Train & Validation…....……………………………………………..........................................4
2.3 Loading Base Model………….…………..................................................................................4
2.4 Compile & Fit…………………………………………………………………………………………...6

Chapter 4: Results.............................……………………………………….................................7

Chapter 5: Conclusion.......................………………………………………...............................10

Chapter 6: References................................................................................................................11

vi
CHAPTER 1: INTRODUCTION

Calorie Tracker applications are popular for the on-diet population. With such applications, people can
enter their meals and estimate the daily consumption of calories. Somehow, the meal entry task can be
inconvenient: sometimes it may not be possible to judge the size of the dish. Thus a good idea to help
make these applications more user-friendly is to ask users to take a photo of their meal, and from that
photo to give a calorie estimation.

Moreover, this task seeks to find the calorie content of a meal from a food picture.The project involves
several tasks: identifying the food contained in an image, estimating the quantity of food contained, and
converting the food classification into a calorie estimation. For the scope of this project, we ignore the
scale and quantity of food in an image and treat it as an image classification problem and we then
convert the predicted labels to calorie estimations using an online food database.

Talking about the dataset, Food-101 has around one lakh images with around a hundred categories with
1000 images in each segment. Each category has been divided into test and train images, 750 train
images, and 250 test images.

The above mentioned task is image classification problem. To tackle this problem we used CNN
(Convolutional Neural Network)[6][5]. CNN is most commonly used for image classification and the
results of which have been quite promising.

But training a Convolutional Neural Network from scratch is a computationally heavy task , also, with
the advancements in deep learning the availability of state - of - the models with high accuracy is
increasing especially for the problem of image classification. Hence, we made use of transfer learning
for our task[3] .We explore the method of food image classification on the Food 101 dataset using and
optimizing pre-trained Convolutional Neural Networks -InceptionV3, Resnet-50, VGG-16, Efficient
Net[1][3][4][7]. This classification further proceeds to calorie estimation of the food item.

1
CHAPTER 2: LITERATURE SURVEY

Food image classification and calorie estimation is promising example of image classification.That is
why many works have been published. Below is the literature survey with citations that have been
adopted while the development of this project:

● It is fascinating how food understanding has been examined in the last few years, but it has been
a subject of contemplation for decades. From around 1977, four major areas have been defined
where machine learning techniques are being used: i) Food detection and recognition for
automatic harvesting: automatic detection and recognition of vegetables are important to
enhance the vision system of robots in order to improve the harvesting process in terms of
quality and speed; ii) Food quality assessment for industry [6].

● Categorical Image Classification needs many images to train and the system needs more time
to extract the features as well as classification. We will discuss how deep learning can be used
to identify objects in images. This system uses a more deep convolutional neural network to
categorize thousands of high-end images into eight different classes. Image features have been
extracted from a pre-trained Representational deep Neural network (RESNET), and use those
features to train machine learning Support vector machine (SVM) classifiers. Representational
deep networks makes feature extraction the easiest and fastest way to use than any other
conventional network methods. [1]

● Hussain MBird J.J., Faria D.R. (2019) used Inception V3 to establish whether it would work
best in terms of accuracy and efficiency with new image datasets via Transfer Learning. The
results showed proof that Inception V3 would generate effective results when image datasets
are used for classification and segmentation tasks. The results were superior to those using
Inception V2 with the smaller image dataset for classification tasks, but the segmentation task
is limited in the low-resolution images and the better performance of Inception V3 may not be
as effective for classification tasks as it is for segmentation.[4]

● Taranjit Kaur, T. Gandhi (2019) explored the capability of a pre-trained CNN VGG-16 model
with transfer learning for pathological brain image categorization. The validation on the test set
revealed that the pre-trained VGG-16 model with transfer learning exhibited the best
performance in contrast to the other existing state-of-the-art works. The approach provides
categorization with an end-to-end structure on raw images without any hand-crafted attribute
extraction, and achieves high-performance while keeping the network's weights in sparse form.
After analysing the results of this research paper, we decided to use VGG-16. [7]

● Haikel Alhichri in his paper proposed a novel deep learning model for the classification of RS
scenes based on the EfficientNet CNN combined with an attention mechanism. We investigate
two versions EfficientNet-B3-Attn-1 and efficientNet-B3-Attn-2. In the EfficientNet-B3-Attn-
1 model, the attention mechanism is added to the last feature map, whereas in the EfficientNet-
B3-Attn-2, it is added at the end of layer 262. Thus EfficientNet-B3-Attn-2 has two branches;
the main branch without attention and a secondary branch attached to the end of layer 262 that
uses attention. [2]

2
CHAPTER 3: METHODOLOGY & IMPLEMENTATION

2.1 DATA AUGMENTATION

A total of 101,000 images from 101 classes of food were used from the Food-101 dataset, with 1000
images for each class. Of the 1000 images for each class, 250 were manually reviewed test images, and
750 were intentionally noisy training images, for a total training data size of 75,750 training images and
25,250 test images. Compared to the 10-class food image dataset from ImageNet, this Food-101 dataset
presents some additional challenges. For one, the ImageNet food image dataset contains relatively
distinct and few food categories (apple, banana, broccoli, burger, egg, french fries, hot dog, pizza, rice,
and strawberry), while Food-101 contains some food items that are similar in both content and
presentation (e.g. pho vs. ramen).

Fig 2.1 : Data Augmentation

Additionally, the training dataset images were very dissimilar in lighting, coloring, and size, and also
contained mislabeled images, which were left in the training dataset to encourage models to be robust
to labeling anomalies. We also utilized ImageNet weights during transfer learning to boost model
accuracy, though not the ImageNet dataset directly.

Images were normalized and resized appropriately, either to 128x128 or 256x256 in the initial model
implementations, or to model specification when using transfer learning. Image data was augmented
through rotation, shifting, and horizontal flipping to avoid overfitting. During transfer learning,

images were also preprocessed using the custom model preprocessing functions, which were
implementations of the image preprocessing in the original model papers.

3
2.2 TRAINING & VALIDATION GENERATORS

Data Augmentation helped us create new temporary images from the already present ones to expand
the data for better training. It was used to augment the images in real-time while the model is still
training. We have applied any random transformations on each training image as it is passed to the
model.
Training Data: Neural networks and other artificial intelligence programs require an initial set of data,
called a training dataset, to act as a baseline for further application and utilization. This dataset is the
foundation for the program’s growing library of information. The training dataset must be accurately
labeled before the model can process and learn from it.
Validation Data: The evaluation of a model skill on the training dataset would result in a biased score.
Therefore the model is evaluated on the held-out sample to give an unbiased estimate of model skill.
This is typically called a train-test split approach to algorithm evaluation. Generally, the term
“validation set” is used interchangeably with the term “test set” and refers to a sample of the dataset
held back from training the model.

fig 2.2 : Train & Validation Data

2.3 LOADING BASE MODEL

We decided to implement transfer learning from different models trained on the ImageNet dataset to
take advantage of the features learned by those models using deeper architectures and with more
training time, specifically VGG16, ResNet50, and InceptionV3. Transfer learning was implemented by
loading the ImageNet weights into each model and freezing the base layers of each model while
removing the top layers that were trained specifically on the ImageNet classes.

4
fig 2.3 : Loading Base Model

These final layers were then replaced with trainable layers meant to learn classification on the Food-
101 classes. VGG16 was a call back to the baseline model in terms of using fixed-size 3x3 filters and
was composed of a deeper model architecture, and was surprisingly slow to train. For faster training,
we began looking more into ResNet50 and 3 InceptionV3, with both models training sans the top layer.
In ResNet50, the model architecture remedies the common issues with deeper neural networks through
residual blocks, which allow the model to take advantage of skip connections between earlier and later
layers.

This essentially allows models to skip layers that do not improve the overall accuracy and choose the
optimal number of layers during training, boosting accuracy. For InceptionV3, we experimented with
unfreezing some of the base model layers, and found that training with the top few layers unfrozen for
training improved performance over just training the top layer.

In InceptionV3, the inception modules allow the model to train with different filter sizes at each layer
without risking the model overfitting or being too computationally expensive. We also attempted full-
layer training, but found the training extremely slow and computationally expensive.

Finally, in addition to the image-preprocessing on all images done before training, we also applied
InceptionV3’s custom image preprocessing function to all images during training, which increased
accuracy to 96.9%.

5
2.4 COMPILE & FIT

Figure 2.4: Model Compiled & Fit

6
CHAPTER 3: RESULTS

The section presents the performance evaluation of the proposed approach and obtained results.
We tested different model architectures against the same Food-101 dataset and classification problem,
both models trained from scratch and transfer learning with EfficientNet, VGG16, ResNet50, and
InceptionV3 models pre-trained on ImageNet weights. The highest performing model was a pre-trained
InceptionV3 model with top layers unfrozen in stages, with total accuracy of 96.9%, which outperforms
the performance of the original Food-101 paper model.

fig 3.1(a): RESNET loss vs accuracy fig 3.1(b): EfficientNet loss vs accuracy

Future work would involve more optimization on hyperparameters and model aspects such as which
layers to freeze versus make trainable during transfer learning. Due to computing resources and time
constraints, most model implementation decisions were made by examining the convergence of the
model and relative metrics from training versus validation, but an exhaustive hyperparameter search
would have been a more empirical approach.

Validation and train accuracy parameters represent the ability of the algorithm to successfully identify
food images that are outliers on the scale of familiarity and distinction, in other words very hard to
distinguish.

7
fig 3.1(c) : Inception V3 loss vs accuracy fig 3.1(d) : VGG-16 loss vs accuracy

fig. 3.1(e)

After training the model , we tested our model by passing an apple pie image through the model, result
of which could be seen in Fig 3.1(e).Not only it predicted the class of the image correctly but it is also
estimating the calorie of the food. Thus , our model is working fine.

8
S.No Model Validation Accuracy Train Accuracy

1 Inception V3 95.7% 96.9%

2 RESNET-50 91.4% 89.6%

3 VGG-16 94.2% 96.6%

4 EfficientNet 88.3% 86.9%

Table 3.1: Performance evaluation of obtained results.

The performance analysis of algorithms used revealed that the VGG16 and InceptionV3 gave similar
training and validation accuracies while RESNET-50 was on the lower side on a relative scale

9
CHAPTER 4: CONCLUSION

We tested different model architectures against the same Food-101 dataset and classification problem,
both models trained from scratch and transfer learning with EfficientNET, VGG16, ResNet50, and
InceptionV3 models pre-trained on ImageNet weights. The highest performing model was a pre-trained
InceptionV3 model with top layers unfrozen in stages, with total accuracy of 96.9%, which outperforms
the performance of the original Food-101 paper model.

Transfer learning was the most successful because the earlier pre-trained layers had already learned a lot
of the general features needed to identify food images. Future work would involve more optimization
on hyperparameters and model aspects such as which layers to freeze versus make trainable during
transfer learning.

Due to computing resources and time constraints, most model implementation decisions were made by
examining the convergence of the model and relative metrics from training versus validation, but an
exhaustive hyperparameter search would have been a more empirical approach. Model performance
could be further improved by adding bounding boxes to the images. Some of the images from the Food-
101 dataset are not properly cropped on just the food image and contain other noisy elements, which
could be addressed by training another model to just tightly bound the food itself, before passing that
output as an input to the food image classification model trained in this paper.

Another possibility is to train models to recognize images within a subset of food (e.g. fruits vs. noodles
vs. pastries), since many of the errors from the model are a result of confusing similar food items with
each other (e.g. tiramisu vs. chocolate cake). Finally, given the relatively high top-5 accuracy, we can
utilize other non-image features to improve top-1 accuracy. For example, by using a food location’s
menu or cuisine definition, we can more confidently classify food images from the place (e.g. if the
classifier is identified).

10
REFERENCES

1. A. Mahajan and S. Chaudhary, "Categorical Image Classification Based On Representational


Deep Network (RESNET)," 2019 3rd International conference on Electronics, Communication
and Aerospace Technology (ICECA), 2019, pp. 327-330, doi: 10.1109/ICECA.2019.8822133.
2. H. Alhichri, A. S. Alswayed, Y. Bazi, N. Ammour and N. A. Alajlan, "Classification of Remote
Sensing Images Using EfficientNet-B3 CNN Model With Attention," in IEEE Access, vol. 9, pp.
14078-14094, 2021, doi: 10.1109/ACCESS.2021.3051085.
3. M. Shaha and M. Pawar, "Transfer Learning for Image Classification," 2018 Second
International Conference on Electronics, Communication and Aerospace Technology (ICECA),
2018, pp. 656-660, doi: 10.1109/ICECA.2018.8474802.
4. Hussain M., Bird J.J., Faria D.R. (2019) A Study on CNN Transfer Learning for Image
Classification. In: Lotfi A., Bouchachia H., Gegov A., Langensiepen C., McGinnity M. (eds)
Advances in Computational Intelligence Systems. UKCI 2018. Advances in Intelligent Systems
and Computing, vol 840. Springer, Cham. https://doi.org/10.1007/978-3-319-97982-3_16
5. Kagaya H., Aizawa K. (2015) Highly Accurate Food/Non-Food Image Classification Based on
a Deep Convolutional Neural Network. In: Murino V., Puppo E., Sona D., Cristani M., Sansone
C. (eds) New Trends in Image Analysis and Processing -- ICIAP 2015 Workshops. ICIAP 2015.
Lecture Notes in Computer Science, vol 9281. Springer, Cham. https://doi.org/10.1007/978-3-
319-23222-5_43
6. K. Yanai and Y. Kawano, "Food image recognition using deep convolutional network with pre-
training and fine-tuning," 2015 IEEE International Conference on Multimedia & Expo
Workshops (ICMEW), 2015, pp. 1-6, doi: 10.1109/ICMEW.2015.7169816.
7. T. Kaur and T. K. Gandhi, "Automated Brain Image Classification Based on VGG-16 and
Transfer Learning," 2019 International Conference on Information Technology (ICIT), 2019, pp.
94-98, doi: 10.1109/ICIT48102.2019.00023.

11

You might also like