Younis 2020
Younis 2020
44
To overcome these limitations, one of the simplest techniques in 2.1 Some Parameters in SSD
this approach is the SSD we are using with the DL (deep learning)
Pre-Trained Model MobileNet through CNN development. In this 2.1.1 Grating/Grid Cell
way, the item is easily realized in complex video scenes and Detecting objects means estimating the class and location of an
intricate background conditions. element directly in this area. For example (see Figure 3) we use the
4x4 grid. Individually, the grating focuses on creating the space and
First, recreate the background image with input frames. From this the shape of the space that suits it. If the picture contains different
background image, we can imagine animation based on the motion elements in grating individually or wants to detect various things of
information and the color indicator, which is based on the current unlike objects, the accessible field called Anchor box is included to
color information in the frame. Then combine this motion and color complete this part.
indicator into the Markoff Random Field (MRF) framework to get
an object that detects data from the background. The idea of
background reduction is to cut the current image from a steel
backdrop, which is assembled before being inserted into the item.
After subtraction, only non-stationary or original items are left.
This technique is particularly suitable for video conferencing and
surveillance application, where the background remains throughout
the entire conference or monitoring period. Nevertheless, there is
still a lot of madness as both the foreground and background colors
look the same, changing the quality of the light, and the noise that
makes us a simple way to isolate a video object. The use of change
and limitation techniques is prohibited.
On the basis of survey, it was found that there has been a registered
increase in the need to build compact and efficient neural networks Figure 3. Example of a 4x4 grid and three anchor boxes
[12, 13, 14, 15, 16].
Many diverse methods can usually be considered either 2.1.2 Anchor Box
compressing already trained networks or directly training minor The grid cells individually allocated to SSD with different anchors
networks. Two smooths global hyperparameters that are offered to or prefixes. In each grid cell, these anchor boxes can control any
trade properly between delay and accuracy which allow the model shape and size. The cats (see Figure 3) corresponds to different
builder to choose the exact size of the MobileNet for their anchor boxes, one high anchor box, while the other is wider, hence
application [17]. However, the majority of papers on twisted different sizes of the anchor boxes. These anchor boxes with an
networks only emphasize size but do not address speed [18]. abundance of intersection through an object will finalize the class
In the detection of leaf diseases, a smart mobile application design and its place of that object. This stuff is used for training the
which built on deep CNN to detect tomato leaf diseases. To develop network and for predicting the detected object and its location after
the application, the model is based on the MobileNet CNN model, the network has been accomplished.
which is able to identify ten essential tomato leaf disease types. To
accomplish the tomato leaves dataset for the development of the 2.1.3 Zoom Level
tomato disease diagnose application, the mobile application utilizes Anchor boxes do not have to be the same size as the grid cells. It is
7176 images of tomatoes. The MobileNet is primarily formed from used to identify with grid cells to what degree of the anchor box
the deep-removable ones introduced in [19] and then used in the wants posterity individually upward or downward.
incision model [20] to reduce calculation to the rare initial layers. 2.1.4 Aspect Ratio
Some objects in shape are wider whereas some are longer (see
2. METHODOLOGY Figure 3) with different grades. The SSD framework permits the
SSD’s have dual mechanisms: a spinal model and SSD head. The aspect ratio of anchor boxes towards it. The range of proportions is
spinal model is generally a network of pre-trained image used to describe the variable aspect ratios of anchor box links.
classifications as fact-makers. Here we usually use a network called
MobileNet that trains over a million images that have been 2.1.5 Receptive Field
completely removed from the associated ranking layer. SSD Head The acceptable field input area is separated as a region of viewed
This waist contains only one or more fixed layers, and the results by a particular CNN. Zeiler and Rob [21] used the distinctive
are defined as classes of boxes and objects bound to the attribute and an actuate to present them as a back-line combination
dimensional place of the closing layer activation. The first few at the relative location. Due to the compromised operation, the
layers (see Figure 2) are white cells. The spinal cord, the layers of properties of different layers indicate the various size of region in
blue cells signify the head of SSD. the image. Place (See Figure 4) the lowest layer (5x5) and then a
console, which will result in a central layer (3x3) in which a single
green pixel represents a 3x3 region of the input layer (bottom layer).
The convolution is then applied to the middle layer (green), having
the upper red layer (2x2) where the individual attribute equals the
7x7 area of the input image. This green and a red 2D array are
referred to as the feature map, which in the form of an indicator
window points to a set of features created using a similar feature
extractor at different points of the input map. Similar map features
have a similar field, which in return tries to identify similar patterns
in different positions. Hence a local level of Convolutional
Figure 2. The architecture of CNN with an SSD detector [22] Network is created.
45
2.3.1 Model files
These are the files of our pre-trained models, one is configuration,
and the second is the weights. So, the model is actually how
neurons are arranged in a neural network.
1- Configuration
2- Weight
3. EXPERIMENTAL RESULTS
The algorithm of this object detection is up to 14 fps, so low-quality
Figure 4. Visualizing CNN feature maps and the receptive cameras of any fps can produce good results. In this case, we
field consider a 6 fps webcam. In our experiments, the SSD algorithm
demonstration indoor and outdoor feed video frames via webcam
2.2 Development of Algorithm but the position of the objects between two consecutive frames is
We try to do object detection using OpenCV library and deep different. The video captured by webcam and the algorithm convert
learning pre-trained models it’s almost similar to real-time face the size of a single frame that is considered to be 300 × 300. SSD
recognition. First we train the system using a familiar faces or can detecting objects frame by frame with the accuracy of a class
reference faces in case if the face appears any of the image or the label with creating the bounding box around the detected object.
video feeding in the system, which will recognize that face. In this The obtained results from this procedure on the homemade video’s
paper, which are dealing with object detection we cannot predict frame are shown in the Figures 5.
the number of objects or predict the objects such as a car, people Input frame of a video sequence detection is a TV monitor (see
and cat. If we have possible images of a car to train a system, then Figure 6) with a confidence level of 76.46% and a person with the
a system can predict these objects from the image or video. But it’s level of 97.86% (i.e., probability), although the full face of the
not practically possible because there are plenty of objects around person is not shown, CNN has a highly accurate detection algorithm
us. We relayed some pre-trained models. These pre-trained models for human characteristics.
have been trained by some third-party person. Most of the objects
already pre-trained in these models. Finally, System is ready to The SSD can produce multiple bounding boxes for different classes
detecting objects using pre-trained models with SSD method. We with a different confidence level (see Figure 7) using a higher
use pre-trained models MobileNets to implement with the SSD proportion of default boxes that can have a better effect, where
method in python code. This model can classify labels on the bases different boxes are used for each location. This proposed method
of training data and a set of bounding box colors for individual of single-shot multi-box detection is based on frame difference (see
classes. Load the input video (frame by frame) and make it an input Figure 8). Frames analyzed the effectiveness of the proposed
drop for a single frame by resizing each frame with a fixed size method. The detection results in foggy weather conditions (see
(300x300) pixels. Figure 8) verified the accuracy and sturdiness of the proposed
method.
MobileNet method is utilized to expand the SSD algorithm and
speed rating accuracy on a real-time basis. This approach requires
taking a single shot to detect multiple objects. The SSD is a neural
network architecture design for detection purposes. This means
localization and classification occurring at the same time but other
methods such as the R-CNN series require two shots, SSD
technique detaches the output space of bounding into a set of
default boxes over dissimilar fact ratios and scales. SSD reveals the
banned output space in the default box set. The network rapidly
scans the presence of individual object classes in a default box and
unites the box to fit what is inside it. Also, this network fits many
models with different sizes of natural adhesives with different
resolutions. If no item is present, it is considered a background and
the location is ignored.
46
4. CONCLUSIONS
A high accuracy object detection procedure has been achieved by
using the MobileNet and the SSD detector for an object detection,
which can push a processing speed to 14 fps, and make it efficient
to all camera that can often process at only 6 fps. This system can
detect the items within its dataset, such as a car, bicycle, bottle,
chair, etc. The dataset can be expanded by adding an unlimited
number of items through using images, referred to as deep learning
technology. We used Ubuntu 18.04.2, OpenCV 3.4.2 and Python
programming language for the experiment of SSD algorithm. The
goal of this research is to develop an autonomous system where the
recognition of objects and scenes helps the community to make the
system interactive and attractive. For future work, this work will be
primarily deployed to identify the item with better features in the
external environment.
5. ACKNOWLEDGMENTS
Figure 6. System detecting a person and TV monitor The authors are special thanks to Editor in chief and anonymous
referees for the valuable comments and suggestions. This work was
partially supported by the Tianjin science and technology
(19JCTPJC54800) and Tianjin graduate research (2019YJSS194).
6. REFERENCES
[1] Hong, Y. C., Chung, Y. S, An Enhanced Hybrid
MobileNet 2018 9th International Conference on
Awareness Science and Technology (iCAST)
[2] Shraddha, M., Supriya, M. Moving object detection and
tracking using convolutional neural networks IEEE Xplore
ISBN:978-1-5386-2842-3
[3] Ojala. T., Matti. P., Topi, M. Multiresolution gray-scale and
rotation invariant texture classification with local binary
patterns, IEEE Transactions on Pattern Analysis & Machine
Intelligence, no. 7, pp. 971–987, 2002.
[4] Haralick R.M., Karthikeyan. S., Hak. D. Textural features
for image classification, IEEE Transactions on systems,
man, and cybernetics, no. 6, pp. 610–621, 1973.
[5] Hu. M.K., Visual pattern recognition by moment invariants,
Figure 7. SSD multiple bounding boxes for different classes IRE transactions on information theory, vol. 8, no. 2, pp.
with different confidence level 179–187, 1962.
[6] Khotanzad. A., Yaw. H. H. H. Invariant image recognition
by Zernike moments. IEEE Transactions on pattern analysis
and machine intelligence, vol. 12, no. 5, pp. 489–497, 1990.
[7] Huang. J, Kumar S. R., Mitra. M, Zhu. W. J., Zabih. R.
Image indexing using color correlograms. in cvpr. IEEE,
1997, p. 762.
[8] Prajakta, A. P., Prachi, A. D. Moving Object Extraction
Based on Background Reconstruction. International Journal
of Innovative Research in Computer and Communication
Engineering, Vol. 3, Issue 4, April 2015
[9] Dalal, N., Triggs, B. Histograms of oriented gradients for
human detection. An International conference on computer
vision & Pattern Recognition (CVPR’05), vol. 1. IEEE
Computer Society, 2005, pp. 886–893.
[10] Rosebrock, A. Deep Learning for Computer Vision with
Python: Starter Bundle. Pyimagesearch, 2017.
[11] Han. F., Shan, Y., Cekander R., Sawhney, H.S., Kumar R,
Figure 8. Car detection from video frame sequence A two-stage approach to people and vehicle detection with
hog-based SVM in Performance Metrics for Intelligent
Systems 2006 Workshop, 2006, pp. 133–140.
47
[12] Jin, J., Dundar, A., Culurciello, E. Flattened convolutional [18] Azeddine Elhassouny, Florentin Smarandache , " Smart
neural networks for feedforward acceleration. arXiv mobile application to recognize tomato leaf diseases using
preprint arXiv:1412.5474, 2014 Convolutional Neural Networks" IEEE/ICCSRE2019, 22-
[13] Wang, M., Liu, B., Foroosh, H. Factorized convolutional 24 July, 2019, Agadir, Morocco.
neural networks. arXiv preprint arXiv:1608.04337, 2016. [19] Sifre, L., Rigid-motion scattering for image classification.
[14] Iandola, F.N., Moskewicz, M.W., Ashraf, K., Han, S., PhD thesis, Ph. D. thesis, 2014.
Dally, W. J., Keutzer, K. Squeezenet: Alexnet-level [20] Ioffe, S., Szegedy, C. Batch normalization: Accelerating
accuracy with 50x fewer parameters and¡1mb model size. deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1602.07360, 2016. arXiv preprint arXiv:1502.03167, 2015.
[15] Wu, J., Leng, C., Wang, Y., Hu, Q., J. Cheng, J. Quantized [21] Zeiler, M. D., Rob, F. "Visualizing and understanding
convolutional neural networks for mobile devices. arXiv convolutional networks." In European conference on
preprint arXiv:1512.06473, 2015. computer vision, pp. 818-833. springer, Cham, 2014.
[16] Rastegari, M., Ordonez, V., Redmon, J., Farhadi, A. [22] Liu, W (October 2016). SSD: Single shot Multi-Box
Xnornet: Imagenet classification using binary convolutional detector. European Conference on Computer Vision.
neural networks. arXiv preprint arXiv:1603.05279, 2016. Lecture Notes in Computer
[17] Howar, A.G., Zhu,M., Che, b., Kalenichenko, D., Wang, Science. 9905. arXiv:1512.02325.
W., Wey, T., Andreetto, M., Adam, H. MobileNets:
Efficient Convolutional Neural Networks for Mobile Vision
Applications arXiv: 1704.048861
48