Learning Representations from Spatio-Temporal Distance Maps for 3D Action Recognition with Convolutional Neural Networks

  • M. Naveenkumar
    National Institute of Technology Tiruchirappalli, Tamilnadu, India mnaveenmtech[at]gmail.com
  • S. Domnic
    National Institute of Technology Tiruchirappalli, Tamilnadu, India

Abstract

This paper addresses the action recognition problem using skeleton data. In this work, a novel method is proposed, which employs five Distance Maps (DM), named as Spatio-Temporal Distance Maps (ST-DMs), to capture the spatio-temporal information from skeleton data for 3D action recognition. Among five DMs, four DMs capture the pose dynamics within a frame in the spatial domain and one DM captures the variations between consecutive frames along the action sequence in the temporal domain. All DMs are encoded into texture images, and Convolutional Neural Network is employed to learn informative features from these texture images for action classification task. Also, a statistical based normalization method is introduced in this proposed method to deal with variable heights of subjects. The efficacy of the proposed method is evaluated on two datasets: UTD MHAD and NTU RGB+D, by achieving recognition accuracies91.63% and 80.36% respectively.
  • Referencias
  • Cómo citar
  • Del mismo autor
  • Métricas
Chaudhry, R., Ofli, F., Kurillo, G., Bajcsy, R., and Vidal, R., 2013. Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 471–478.

Chen, C., Jafari, R., and Kehtarnavaz, N., 2015. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Image Processing (ICIP), 2015 IEEE International Conference on, pages 168–172. IEEE.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y., 2015. The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.

Climent-Pérez, P., Chaaraoui, A. A., Padilla-López, J. R., and Flórez-Revuelta, F., 2012. Optimal joint selection for skeletal data from RGB-D devices using a genetic algorithm. In Mexican International Conference on Artificial Intelligence, pages 163–174. Springer.

Ding, Z., Wang, P., Ogunbona, P. O., and Li, W., 2017. Investigation of different skeleton features for CNN-based 3D action recognition. In Multimedia & Expo Workshops (ICMEW), 2017 IEEE International Conference on, pages 617–622. IEEE.

Du, Y., Fu, Y., and Wang, L., 2015a. Skeleton based action recognition with convolutional neural network. In Pattern Recognition (ACPR), 2015 3rd IAPR Asian Conference on, pages 579–583. IEEE.

Du, Y., Wang, W., and Wang, L., 2015b. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1110–1118.

Evangelidis, G., Singh, G., and Horaud, R., 2014. Skeletal quads: Human action recognition using joint quadruples. In Pattern Recognition (ICPR), 2014 22nd International Conference on, pages 4513–4518.IEEE.

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R., 2012. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580.

Huang, Z., Wan, C., Probst, T., and Van Gool, L., 2017. Deep learning on lie groups for skeleton-based action recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1243–1252. IEEE computer Society.

Hussein, M. E., Torki, M., Gowayyed, M. A., and El-Saban, M., 2013. Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations. In IJCAI, volume 13, pages 2466–2472.

Kerola, T., Inoue, N., and Shinoda, K., 2014. Spectral graph skeletons for 3D action recognition. In Asian Conference on Computer Vision, pages 417–432. Springer.

Keys, R., 1981. Cubic convolution interpolation for digital image processing. IEEE transactions on acoustics, speech, and signal processing, 29(6):1153–1160.

Krizhevsky, A., Sutskever, I., and Hinton, G. E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.

LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. E., and Jackel, L. D., 1990. Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pages 396–404.

Lee, I., Kim, D., Kang, S., and Lee, S., 2017. Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1012–1020. IEEE.



Li, C., Hou, Y., Wang, P., and Li, W., 2017. Joint distance maps based action recognition with convolutional neural networks. IEEE Signal Processing Letters, 24(5):624–628.

Liu, J., Shahroudy, A., Xu, D., and Wang, G., 2016. Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pages 816–833. Springer.

Liu, J., Wang, G., Hu, P., Duan, L.-Y., and Kot, A. C., 2017. Global context-aware attention lstm networks for 3d action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 7, page 43.

Müller, M., Röder, T., and Clausen, M., 2005. Efficient content-based retrieval of motion capture data. In ACM Transactions on Graphics (ToG), volume 24, pages 677–685. ACM.

Núñez, J. C., Cabido, R., Pantrigo, J. J., Montemayor, A. S., and Vélez, J. F., 2018. Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition, 76:80–94.

Ohn-Bar, E. and Trivedi, M., 2013. Joint angles similarities and HOG2 for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 465–470.

Poppe, R., 2010. A survey on vision-based human action recognition. Image and vision computing, 28(6):976–990.

Presti, L. L. and La Cascia, M., 2016. 3D skeleton-based human action classification: A survey. Pattern Recognition, 53:130–147.

Presti, L. L., La Cascia, M., Sclaroff, S., and Camps, O., 2014. Gesture modeling by hanklet-based hidden markov model. In Asian Conference on Computer Vision, pages 529–546. Springer.

Qian, N., 1999. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A.,Bernstein, M. et al., 2015a. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. et al., 2015b. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252.

Salamon, J. and Bello, J. P., 2017. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3):279–283.

Schaul, T., Zhang, S., and LeCun, Y., 2013. No more pesky learning rates. In International Conference on Machine Learning, pages 343–351.

Shahroudy, A., Liu, J., Ng, T.-T., and Wang, G., 2016. NTU RGB+ D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., and Blake, A., 2011.

Real-time human pose recognition in parts from single depth images. In Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on, pages 1297–1304. Ieee.

Song, S., Lan, C., Xing, J., Zeng, W., and Liu, J., 2017. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data. In AAAI, volume 1, pages 4263–4270.
Vemulapalli, R., Arrate, F., and Chellappa, R., 2014. Human action recognition by representing 3d skeletons as points in a lie group. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 588–595.

Wang, C., Wang, Y., and Yuille, A. L., 2013. An approach to pose-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 915–922.

Wang, H. and Wang, L., 2017. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In e Conference on Computer Vision and Pattern Recognition (CVPR).

Wang, H. and Wang, L., 2018. Learning content and style: Joint action recognition and person identification from human skeletons. Pattern Recognition, 81:23–35.

Wang, P., Li, W., Ogunbona, P., Gao, Z., and Zhang, H., 2014. Mining mid-level features for action recognition based on effective skeleton representation. In Digital lmage Computing: Techniques and Applications (DlCTA), 2014 International Conference on, pages 1–8. IEEE.

Wang, P., Li, Z., Hou, Y., and Li, W., 2016. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 2016 ACM on Multimedia Conference, pages 102–106. ACM.

Xia, L., Chen, C.-C., and Aggarwal, J., 2012. View invariant human action recognition using histograms of 3d joints. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, pages 20–27. IEEE.

Xie, S. and Tu, Z., 2015. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403.

Xie, S., Yang, T., Wang, X., and Lin, Y., 2015. Hyper-class augmented and regularized deep learning for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2645–2654.

Xu, Z., Huang, S., Zhang, Y., and Tao, D., 2015. Augmenting strong supervision using web data for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision, pages 2524–2532.

Yang, H. and Patras, I., 2015. Mirror, mirror on the wall, tell me, is the error small? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4685–4693.

Zhang, S., Liu, X., and Xiao, J., 2017. On geometric features for skeleton-based action recognition using multilayer lstm networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 148–157. IEEE.

Zhou, L., Li, W., Zhang, Y., Ogunbona, P., Nguyen, D. T., and Zhang, H., 2014. Discriminative key pose extraction using extended lc-ksvd for action recognition. In Digital lmage Computing: Techniques and Applications (DlCTA), 2014 International Conference on, pages 1–8. IEEE.
Naveenkumar, M., & Domnic, S. (2020). Learning Representations from Spatio-Temporal Distance Maps for 3D Action Recognition with Convolutional Neural Networks. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 8(2), 5–18. https://doi.org/10.14201/ADCAIJ201982518

Downloads

Download data is not yet available.

Author Biographies

M. Naveenkumar

,
National Institute of Technology Tiruchirappalli, Tamilnadu, India
Department of Computer Applications  

S. Domnic

,
National Institute of Technology Tiruchirappalli, Tamilnadu, India
Department of Computer Applications 
+