Jianhang Chen Activity Recognition On Kinect-3d Videos Using Transfer Learning
Jianhang Chen Activity Recognition On Kinect-3d Videos Using Transfer Learning
Transfer Learning
Deep Learning Final Project Report
Jianhang Chen
School of Electrical and Computer Engineering, Purdue University
West Lafayette, IN
[email protected]
I. INTRODUCTION
In this project, we develop an algorithm to analyze videos to
detect daily activities such as falling, grasping, running, sitting,
etc. At first, we directly trained CNN+LSTM model on a 3D Figure 2 UCF101 activity dataset
video dataset named TST FALL DETECTION DATASET V2
[1], which is shown in Figure 1. Unfortunately, the result was II. OTHER WORK
unsatisfactory to classify daily activities. The reason of failing
might be the limited number of videos (approx. 88x3 videos) for There are several methods to classify videos. Andrej
training. Finally, we pretrained an CNN+LSTM network on Karpathy, et al., [3] extended the connectivity of a CNN in time
UCF101 2D dataset [2], shown in Figure 2, and transferred it to domain to train the network to understand the activities in
3D dataset. The validation accuracy of 3D videos is 52% and the videos. It has successfully classified video of all kinds of outdoor
test Accuracy of 3D videos is 46%. sports. Figure 3 shows the Multiresolution CNN architecture
developed by Andrej Karpathy for video classification. But
The following part consists of 4 sections: section II is a brief intuitively, it only uses the current image to identify activities,
introduction to other work. In section III, we introduce three and doesn't explain how to build a CNN based model that spans
sessions in our final work in detail. Section IV is the results of a few or more images.
our experiment and section V is discussion and conclusion.
Joe Yue-Hei Ng, et al., proposed a method explicitly models
the video as an ordered sequence of frames [4]. The method
employs a recurrent neural network that uses Long Short-Term
Memory (LSTM) cells which are connected to the output of the
underlying CNN. It shows that the use of LSTM's RNN (84.6%)
may outperform the pure CNN model (72.6%). Figure 4 is the
overview of CNN+LSTM approach by Joe Yue-Hei Ng, et al. In
CNN, a large amount of information cannot be extracted if only
from a single image that forms a video without considering the
time sequence. Therefore, the similar method is chosen for this
project except the optical flow information and feature pooling
part. Our work is an extension of this method to depth images of
3D videos from Kinect.
Figure 3 Overview of the pure CNN approach Figure 5 Overview of the 3 sessions of our work
Conv Layer 1 – 32 filters of kernel size (3x3) with Max pool (2x2) Figure 8 Sample results of 2D dataset
Conv Layer 2 – 64 filters of kernel size (3x3) with Max pool (2x2)
Conv Layer 3 – 128 filters of kernel size (3x3) with Max pool (2x2)
Conv Layer 4 – 256 filters of kernel size (3x3) with Max pool (2x2)
Conv Layer 5 – 512 filters of kernel size (3x3) with Max pool (2x2)