Spatial-Aware Multi-Level Parsing Network for Human-Object Interaction.

Zhan Su; Ruiyun Yu; Shihao Zou; Bingyang Guo; Li Cheng

doi:10.9781/ijimai.2023.06.004

Authors

Zhan Su Northeastern University
Ruiyun Yu Northeastern University.
Shihao Zou University of Alberta.
Bingyang Guo Northeastern University.
Li Cheng University of Alberta.

DOI:

https://doi.org/10.9781/ijimai.2023.06.004

Keywords:

Computer vision, Deep Learning, Gated Graph Neural Network, HOI, Image Classification

Supporting Agencies

This work is supported by the National Natural Science Foundation of China (62072094) and the LiaoNing Revitalization Talents Program (XLYC2005001).

Abstract

Human-Object Interaction (HOI) detection focuses on human-centered visual relationship detection, which is a challenging task due to the complexity and diversity of image content. Unlike most recent HOI detection works that only rely on paired instance-level information in the union range, our proposed Spatial-aware Multilevel Parsing Network (SMPNet) uses a multi-level information detection strategy, including instance-level visual features of detected human-object pair, part-level related features of the human body, and scene-level features extracted by the graph neural network. After fusing the three levels of features, the HOI relationship is predicted. We validate our method on two public datasets, V-COCO and HICO-DET. Compared with prior works, our proposed method achieves the state-of-the-art results on both datasets in terms of mAProle, which demonstrates the effectiveness of our proposed multi-level information detection strategy.

Downloads

Download data is not yet available.

References

P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, G. Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE winter conference on applications of computer vision (WACV), 2018, pp. 1451– 1460, IEEE.

J. Lu, M. Nguyen, W. Q. Yan, “Deep learning methods for human behavior recognition,” in 2020 35th International Conference on Image and Vision Computing New Zealand (IVCNZ), 2020, pp. 1–6, IEEE.

L. Mi, Z. Chen, “Hierarchical graph attention network for visual relationship detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13886–13895.

A. Gupta, A. Kembhavi, L. S. Davis, “Observing human-object interactions: Using spatial and functional compatibility for recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 31, no. 10, pp. 1775–1789, 2009.

S. Gupta, J. Malik, “Visual semantic role labeling,” arXiv preprint arXiv:1505.04474, 2015.

G. Gkioxari, R. Girshick, P. Dollár, K. He, “Detecting and recognizing human-object interactions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8359–8367.

C. Gao, Y. Zou, J.-B. Huang, “ican: Instance- centric attention network for human-object interaction detection,” arXiv preprint arXiv:1808.10437, 2018.

T. Wang, R. M. Anwer, M. H. Khan, F. S. Khan, Y. Pang, L. Shao, J. Laaksonen, “Deep contextual attention for human-object interaction detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5694–5702.

A. Bansal, S. S. Rambhatla, A. Shrivastava, R. Chellappa, “Spatial priming for detecting human-object interactions,” arXiv preprint arXiv:2004.04851, 2020.

Y. Liao, S. Liu, F. Wang, Y. Chen, C. Qian, J. Feng, “Ppdm: Parallel point detection and matching for real-time human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 482–490.

S. Ren, K. He, R. Girshick, J. Sun, “Faster r-cnn: towards real-time object detection with region proposal networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 6, pp. 1137–1149, 2016.

Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7103– 7112.

Y.-W. Chao, Y. Liu, X. Liu, H. Zeng, J. Deng, “Learning to detect human-object interactions,” in 2018 ieee winter conference on applications of computer vision (wacv), 2018, pp. 381–389, IEEE.

V. Mnih, N. Heess, A. Graves, et al., “Recurrent models of visual attention,” in Advances in neural information processing systems, 2014, pp. 2204–2212.

R. Yu, K. Yang, B. Guo, “The interaction graph auto-encoder network based on topology-aware for transferable recommendation,” in Proceedings of the 31st ACM International Conference on Information & Knowledge Management, 2022, pp. 2403–2412.

R. Yu, B. Guo, K. Yang, “Selective prototype network for few-shot metal surface defect segmentation,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–10, 2022.

B. Guo, Y. Wang, S. Zhen, R. Yu, Z. Su, “Speed: Semantic prior and extremely efficient dilated convolution network for real-time metal surface defects detection,” IEEE Transactions on Industrial Informatics, vol. 19, no. 12, pp. 11380-11390, 2023.

D. Bahdanau, K. Cho, Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Advances in neural information processing systems,” Proceedings of Machine Learning Research, pp. 5998–6008, 2017.

N. Dalal, B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1, 2005, pp. 886–893, IEEE.

S. R. Sain, “The nature of statistical learning theory,” Technometrics, vol. 38, no. 4, pp. 409, 1996.

Y. Freund, R. E. Schapire, et al., “Experiments with a new boosting algorithm,” in icml, vol. 96, 1996, pp. 148– 156, Citeseer.

R. Girshick, J. Donahue, T. Darrell, J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.

R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.

J. Redmon, S. Divvala, R. Girshick, A. Farhadi, “You only look once: Unified, real-time object detection,” 2015, https://doi.org/10.48550/arXiv.1506.02640.

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, A. C. Berg, “SSD: Single shot multibox detector,” Lecture Notes in Computer Science, vol 9905, pp 21–37, 2016.

J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. Shamma, M. Bernstein, L. Fei-Fei, “Image retrieval using scene graphs,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3668–3678.

J. Yang, J. Lu, S. Lee, D. Batra, D. Parikh, “Graph r- cnn for scene graph generation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 670–685.

H. Zhang, Z. Kyaw, S.-F. Chang, T.-S. Chua, “Visual translation embedding network for visual relation detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5532– 5540.

B. Xu, Y. Wong, J. Li, Q. Zhao, M. S. Kankanhalli, “Learning to detect human-object interactions with knowledge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2019-2028.

S. Wang, K.-H. Yap, J. Yuan, Y.-P. Tan, “Discovering human interactions with novel objects via zero-shot learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11652–11661.

A. Bansal, S. S. Rambhatla, A. Shrivastava, R. Chellappa, “Detecting human-object interactions via functional generalization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 10460–10469.

S. Gao, H. Wang, J. Song, F. Xu, F. Zou, “An improved human-object interaction detection network,” in 2019 IEEE 13th International Conference on Anti-counterfeiting, Security, and Identification (ASID), 2019, pp. 192–196, IEEE.

K. He, X. Zhang, S. Ren, J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

K. He, G. Gkioxari, P. Dollár, R. Girshick, “Mask r- cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.

Z. Su, Y. Wang, Q. Xie, R. Yu, “Pose graph parsing network for human-object interaction detection,” Neurocomputing, vol. 476, pp. 53-62, 2022.

Y.-L. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H.-S. Fang, Y. Wang, C. Lu, “Transferable interactiveness knowledge for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3585–3594.

T. Gupta, A. Schwing, D. Hoiem, “No-frills human- object interaction detection: Factorization, layout encodings, and training techniques,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9677–9685.

L. Li, Z. Gan, Y. Cheng, J. Liu, “Relation-aware graph attention network for visual question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10313–10322.

S. Qi, W. Wang, B. Jia, J. Shen, S.-C. Zhu, “Learning human-object interactions by graph parsing neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 401–417.

T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, 2014, pp. 740–755, Springer.

T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117– 2125.

A. Kolesnikov, A. Kuznetsova, C. Lampert, V. Ferrari, “Detecting visual relationships using box attention,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 1749-1753.

P. Zhou, M. Chi, “Relation parsing neural network for human-object interaction detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 843–851.

Y.-L. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H.-S. Fang, Y. Wang, C. Lu, “Transferable interactiveness knowledge for human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3585–3594.

T. Zhou, W. Wang, S. Qi, H. Ling, J. Shen, “Cascaded human-object interaction recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4263–4272.

B. Wan, D. Zhou, Y. Liu, R. Li, X. He, “Pose- aware multi-level feature network for human object interaction detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9469–9478.

J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.

L. Shen, S. Yeung, J. Hoffman, G. Mori, L. Fei-Fei, “Scaling human-object interaction recognition through zero-shot learning,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1568–1576, IEEE.