Abstract
Recently, benefit from the development of detection models, the multi-object tracking method based on tracking-by-detection has greatly improved performance. However, most methods still utilize traditional motion models for position prediction, such as the constant velocity model and Kalman filter. Only a few methods adopt deep network-based methods for prediction. Still, these methods only exploit the simplest RNN(Recurrent Neural Network) to predict the position, and the position offset caused by the camera movement is not considered. Therefore, inspired by the outstanding performance of Transformer in temporal tasks, this paper proposes a Transformer-based motion model for multi-object tracking. By taking the historical position difference of the target and the offset vector between consecutive frames as input, the model considers the motion of the target itself and the camera at the same time, which improves the prediction accuracy of the motion model used in the multi-target tracking method, thereby improving tracking performance. Through comparative experiments and tracking results on MOTchallenge benchmarks, the effectiveness of the proposed method is proved.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Bae SH, Yoon KJ (2014) Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. Proceedings of the IEEE conference on computer vision and pattern recognition, 1218-1225
Lenz P, Geiger A, Urtasun R, Followme (2015) Efficient online min-cost flow tracking with bounded memory and computation. Proceedings of the IEEE International Conference on Computer Vision, 4364-4372
Xu J, Cao Y, Zhang Z et al (2019) Spatial-temporal relation networks for multi-object tracking. Proceedings of the IEEE/CVF International Conference on Computer Vision, 3988-3998
Bergmann P, Meinhardt T, Leal-Taixe L (2019) Tracking without bells and whistles. Proceedings of the IEEE/CVF International Conference on Computer Vision, 941-951
Bewley A, Ge Z, Ott L et al (2016) Simple online and realtime tracking. 2016 IEEE international conference on image processing (ICIP). IEEE, 3464-3468
Bochinski E, Eiselein V, Sikora T (2017) High-speed tracking-by-detection without using image information. 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 1-6
Zhou X, Koltun V, Krähenbühl P (2020) Tracking objects as points. European Conference on Computer Vision. Springer, Cham, 474-490
Zhu J, Yang H, Liu N et al (2018) Online multi-object tracking with dual matching attention networks. Proceedings of the European Conference on Computer Vision (ECCV), 366-382
Evangelidis GD, Psarakis EZ (2008) Parametric image alignment using enhanced correlation coefficient maximization. IEEE Trans Pattern Anal Mach Intell 30(10):1858–1865
Wojke N, Bewley A, Paulus D (2017) Simple online and realtime tracking with a deep association metric. IEEE international conference on image processing (ICIP). IEEE, 3645-3649
Milan A, Rezatofighi SH, Dick A et al (2017) Online multi-target tracking using recurrent neural networks. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1)
Babaee M, Li Z, Rigoll G (2019) A dual cnn–rnn for multiple people tracking. Neurocomputing 368:69–83
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. arXiv preprint arXiv:1706.03762
Raganato A, Tiedemann J (2018) An analysis of encoder representations in transformer-based machine translation[C], Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. The Association for Computational Linguistics, Stroudsburg
Kasai J, Cross J, Ghazvininejad M et al (2020) Non-autoregressive machine translation with disentangled context transformer. International Conference on Machine Learning. PMLR, 5144-5155
Jiang M, Wu J, Shi X et al (2019) Transformer based memory network for sentiment analysis of web comments. IEEE Access 7:179942–179953
Naseem U, Razzak I, Musial K et al (2020) Transformer based deep intelligent contextual embedding for twitter sentiment analysis. Future Gener Comput Syst 113:58–69
Wang T, Wan X, Jin H (2020) AMR-to-text generation with graph transformer. Trans Assoc Comput Linguist 8:19–33
Li G, Crego JM, Senellart J (2019) Enhanced transformer model for data-to-text generation. Proceedings of the 3rd Workshop on Neural Generation and Translation, 148-156
Andriyenko A, Schindler K (2011) Multi-target tracking by continuous energy minimization. CVPR 2:76
Alahi A, Goel K, Ramanathan V et al (2016) Social lstm: Human trajectory prediction in crowded spaces. Proceedings of the IEEE conference on computer vision and pattern recognition, 961-971
Gupta A, Johnson J, Fei-Fei L et al (2018) Social gan: Socially acceptable trajectories with generative adversarial networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2255-2264
Ivanovic B, Pavone M (2019) The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2375-2384
Vemula A, Muelling K, Oh J (2018) Social attention: Modeling attention in human crowds. 2018 IEEE international Conference on Robotics and Automation (ICRA). IEEE, 4601-4607
Xu Y, Piao Z, Gao S (2018) Encoding crowd interaction with deep neural network for pedestrian trajectory prediction. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5275-5284
Huang Y, Bi H, Li Z et al (2019) Stgat: Modeling spatial-temporal interactions for human trajectory prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision, 6272-6281
Zhang P, Ouyang W, Zhang P et al (2019) Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12085-12094
Yu C, Ma X, Ren J et al (2020) Spatio-temporal graph transformer networks for pedestrian trajectory prediction. European Conference on Computer Vision. Springer, Cham, 507-523
Raganato A, Tiedemann J (2018) An analysis of encoder representations in transformer-based machine translation. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. The Association for Computational Linguistics, Stroudsburg
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. European Conference on Computer Vision. Springer, Cham, 213-229
Chen M, Radford A, Child R et al (2020) Generative pretraining from pixels. International Conference on Machine Learning. PMLR, 1691-1703
Liu R, Yuan Z, Liu T et al (2021) End-to-end lane shape prediction with transformers. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 3694-3702
Sun P, Jiang Y, Zhang R et al (2020) Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460
Meinhardt T, Kirillov A, Leal-Taixe L et al (2021) Trackformer: Multi-object tracking with transformers. arXiv preprint arXiv:2101.02702
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D (2009) Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell 32(9):1627–1645
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in neural information processing systems, pp 91-99
Yang F, Choi W, Lin Y (2016) Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2129-2137
Bernardin K, Stiefelhagen R (2008) Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J Image Video Process (2008):1-10
Fang K, Xiang Y, Li X, Savarese S (2018) Recurrent autoregressive networks for online multi-object tracking. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp 466-475
Sun S, Akhtar N, Song H, Mian AS, Shah M (2019) Deep affinity network for multiple object tracking. IEEE Trans Pattern Anal Mach Intell 43(1):104–119
Chu Q, Ouyang W, Li H, Wang X, Liu B, Yu N (2017) Online multi-object tracking using CNN-based single object tracker with spatial-temporal attention mechanism. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4836-4845
Yoon Y-C, Kim DY, Yoon K, Song Y-m, Jeon M (2021) Online multiple pedestrian tracking using deep temporal appearance matching association. Inf Sci 561:326–351
Wang Z, Zheng L, Liu Y et al (2020) Towards real-time multi-object tracking. European Conference on Computer Vision. Springer, Glasgow, 107-122
Acknowledgements
This work is supported by the National Natural Science Foundation of China under Grant 61806006, China Postdoctoral Science Foundation under Grant No. 2019M660149, Graduate Innovation Foundation of Jiangsu Province under Grant No. KYLX16_0781, the 111 Project under Grants No. B12018, and PAPD of Jiangsu Higher Education Institutions.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, J., Ge, H., Su, S. et al. Transformer-based two-source motion model for multi-object tracking. Appl Intell 52, 9967–9979 (2022). https://doi.org/10.1007/s10489-021-03012-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-03012-y