Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Activity Recognition on Kinect-3D Videos using

Transfer Learning
Deep Learning Final Project Report

Jianhang Chen
School of Electrical and Computer Engineering, Purdue University
West Lafayette, IN
[email protected]

Abstract—This project is to develope an algorithm to recognize


daily actions on 3D Kinect videos. The final result is generated by
a combined CNN and LSTM network. After decided
hyperparameters like sequence length and max frames by
experiment first, we trained the network on UCF101 2D video
dataset and fixed the CNN part and retrained LSTM on 3D
dataset.

Keywords—video classification; 3d kinect video; LSTM; transfer


learning

I. INTRODUCTION
In this project, we develop an algorithm to analyze videos to
detect daily activities such as falling, grasping, running, sitting,
etc. At first, we directly trained CNN+LSTM model on a 3D Figure 2 UCF101 activity dataset
video dataset named TST FALL DETECTION DATASET V2
[1], which is shown in Figure 1. Unfortunately, the result was II. OTHER WORK
unsatisfactory to classify daily activities. The reason of failing
might be the limited number of videos (approx. 88x3 videos) for There are several methods to classify videos. Andrej
training. Finally, we pretrained an CNN+LSTM network on Karpathy, et al., [3] extended the connectivity of a CNN in time
UCF101 2D dataset [2], shown in Figure 2, and transferred it to domain to train the network to understand the activities in
3D dataset. The validation accuracy of 3D videos is 52% and the videos. It has successfully classified video of all kinds of outdoor
test Accuracy of 3D videos is 46%. sports. Figure 3 shows the Multiresolution CNN architecture
developed by Andrej Karpathy for video classification. But
The following part consists of 4 sections: section II is a brief intuitively, it only uses the current image to identify activities,
introduction to other work. In section III, we introduce three and doesn't explain how to build a CNN based model that spans
sessions in our final work in detail. Section IV is the results of a few or more images.
our experiment and section V is discussion and conclusion.
Joe Yue-Hei Ng, et al., proposed a method explicitly models
the video as an ordered sequence of frames [4]. The method
employs a recurrent neural network that uses Long Short-Term
Memory (LSTM) cells which are connected to the output of the
underlying CNN. It shows that the use of LSTM's RNN (84.6%)
may outperform the pure CNN model (72.6%). Figure 4 is the
overview of CNN+LSTM approach by Joe Yue-Hei Ng, et al. In
CNN, a large amount of information cannot be extracted if only
from a single image that forms a video without considering the
time sequence. Therefore, the similar method is chosen for this
project except the optical flow information and feature pooling
part. Our work is an extension of this method to depth images of
3D videos from Kinect.

Figure 1 TST FALL DETECTION DATASET V2


• Extract features using pretrained INCEPTION V3 on UCF101 2D video dataset
• Train on a simple LSTM model with above features as input
Select Hyper • Try different hyper parameters to get the best test result
Parameters

• Using hyper parameters selected above


• Retrain our own smaller CNN+LSTM network on UCF101 2D video dataset
Train our own • Save the weights for the CNN model part
model

• Extract features using pretrained CNN model from last session


Model Training • Retrain LSTM on 3D video dataset
for 3D videos

Figure 3 Overview of the pure CNN approach Figure 5 Overview of the 3 sessions of our work

A. Hyper Parameter Selection


1. Load the 2D video UCF dataset and partition it into images
and save them.
2. Split the extracted video dataset into Training and Testing
according to the split version files provided with the dataset
3. Select a sequence length that will represent a single video
as a collection sequenced images
4. Clean the dataset that needs to be loaded (i.e). If the no. of
images for a video is less than sequence length, drop it.
5. Load the images of a video in order but not exceed the
sequence length (skip intermediate images)
6. Extract features for each image in the sequence from the
penultimate layer of pre-trained Inception v3 model and save
Figure 4 Overview of the CNN+LSTM approach them.
7. Load the 2048 feature map of training dataset and
validation dataset (a split from test dataset)
III. OUR CONTRIBUTION
8. Fit a simple LSTM model with no. of input as 2048 and
As mentioned above, we first directly trained our custom output nodes as no. of classes
CNN feature extraction + LSTM model on a 3D video dataset
named TST FALL DETECTION DATASET V2. 9. Repeat the above process for various sequence length and
max frames to get an optimized parameter with highest accuracy
Since we did not get satisfactory result, we implemented 3 on test dataset.
sessions to achieve our final goal for 3D video classification
which is illustrated in Figure 5. First we selected hyper B. Model Selection
parameters including Sequence Length, Max Frames, Image
Dimension and Epochs for training. Sequence Length is the 1. Now, load the 2D video dataset using optimum hyper
number of frames to represent the video. Max Frames is the max parameter found in previous step.
number of frames of a qualified training video. We selected 2. Generator based data loading is used (since loading all the
hyper parameters using pre-trained model of Inception v3[5] on images causes memory overflow). Images are resized using
UCF101 2D dataset. Then a small CNN network is trained to Cubic Interpolation.
extract features along with a simple LSTM to train the sequence
of features extracted representing the video. Finally, the trained 3. Use a simple CNN model shown in Figure 1 to extract the
CNN model is frozen (the weights are retained) and extracted features (Input is the normalized gray scale version of the RGB
features of the 3D videos from the CNN is fed into a simple data).
LSTM for training. 4. Feed the Output of CNN to a simple LSTM with the
number of inputs as 512 and the number of output nodes as the
number of classes
5. Train the combined model of CNN and LSTM together
for the 2D video dataset
6. Save the weights for the CNN model only.
7. Test this model on the Test dataset
Input Layer (100x100)

Conv Layer 1 – 32 filters of kernel size (3x3) with Max pool (2x2) Figure 8 Sample results of 2D dataset
Conv Layer 2 – 64 filters of kernel size (3x3) with Max pool (2x2)

Conv Layer 3 – 128 filters of kernel size (3x3) with Max pool (2x2)

Conv Layer 4 – 256 filters of kernel size (3x3) with Max pool (2x2)

Conv Layer 5 – 512 filters of kernel size (3x3) with Max pool (2x2)

Fully Connected Layer 1 – 2048 hidden units

Fully Connected Layer 2 – 512 hidden units


Figure 9 Sample results of 3D dataset
LSTM
Figure 6 Overview of the CNN architecture
V. DISCUSSION
Concepts of Transfer Learning and RNN were understood
C. Model Training for 3D videos through this project. We failed at first directly trained on 3D
1. Load the 3D video files which are a sequence of binary dataset. We also failed several times because we chose other
files. hyper parameters. Through the experiment, we learned the
significance of hyper parameters such as Sequence length, no.
2. Normalize the depth files to fit the previous data input. of Epochs and Max frames. Also, the Kinect could generate joint
3. Extract features using the CNN model finalized in the information of human body beside 3D depth data. The inclusion
previous method. of joint data for training LSTM could improve the accuracy in
classifying 3D - videos to predict actions in daily activity.
4. Train the LSTM to fit the 3D videos on Training and
Validation split We use CNN+LSTM method which explicitly models the
video as an ordered sequence of frames. This mothed could
5. Test its performance on the Test Split of the 3D dataset integrate information over time and have a better performance
compared to pure CNN method.
IV. RESULTS
By our custom CNN+LSTM model, the validation accuracy
of 3D videos is 52% and the test Accuracy of 3D videos is 46%.
Figure 9 shows some of the best hyper parameters selected
in experiment. We also tried sequence length like 1 or 100, but REFERENCES
the result is not so good.
Figure 8 and 9 shows sample results of 2D and 3D dataset. [1] S. Gasparrini, E. Cippitelli, E. Gambi, S. Spinsante, J. Wahslen, I. Orhan
and T. Lindh, Proposal and Experimental Evaluation of Fall Detection Solution
Based on Wearable and Depth Data Fusion, ICT Innovations 2015, Springer
Test Accuracy vs International Publishing, 2016
[2] Khurram Soomro, Amir Roshan Zamir and Mubarak Shah, UCF101: A
Sequence Length & Max Dataset of 101 Human Action Classes From Videos in The Wild., CRCV-TR-
12-01, November, 2012.
Frames [3] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei,
L., Large-scale Video Classification with Convolutional Neural Networks,
40 30 50 2014
[4] Yue-Hei Ng, Joe, et al., Beyond short snippets: Deep networks for video
0.8
0.7 classification., Proceedings of the IEEE conference on computer vision and
0.6 pattern recognition, 2015
300 400 250 [5] Szegedy, Christian, et al., Rethinking the inception architecture for
computer vision., Proceedings of the IEEE Conference on Computer Vision and
Figure 9 Hyper Parameters Selected Pattern Recognition. 2016.

You might also like