Siamese network based trackers formulate tracking as convolutional feature cross-correlation between a target template and a search region. However, Siamese trackers still have an accuracy gap compared with state-of-the-art algorithms and they cannot take advantage of features from deep networks, such as ResNet-50 or deeper. In this work we prove the core reason comes from the lack of strict translation invariance. By comprehensive theoretical analysis and experimental validations, we break this restriction through a simple yet effective spatial aware sampling strategy and successfully train a ResNet-driven Siamese tracker with significant performance gain. Moreover, we propose a new model architecture to perform layer-wise and depth-wise aggregations, which not only further improves the accuracy but also reduces the model size. We conduct extensive ablation studies to demonstrate the effectiveness of the proposed tracker, which obtains currently the best results on five large tracking benchmarks, including OTB2015, VOT2018, UAV123, LaSOT, and TrackingNet.
@inproceedings{li2019siamrpn++,
title={Siamrpn++: Evolution of siamese visual tracking with very deep networks},
author={Li, Bo and Wu, Wei and Wang, Qiang and Zhang, Fangyi and Xing, Junliang and Yan, Junjie},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={4282--4291},
year={2019}
}
Note that the checkpoints from 10-th to 20-th epoch will be evaluated during training. You can find the best checkpoint from the log file.
We provide the best model with its configuration and training log.
Method | Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | Success | Norm precision | Precision | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
SiamRPN++ | R-50 | - | 20e | 7.54 | 50.0 | 50.4 | 59.6 | 49.7 | config | model | log |
SiamRPN++ (FP16) |
R-50 | - | 20e | - | - | 50.4 | 59.6 | 49.2 | config | model | log |
Note:
FP16
means Mixed Precision (FP16) is adopted in training.
The checkpoints from 10-th to 20-th epoch will be evaluated during training. You can find the best checkpoint from the log file.
If you want to get better results, you can use the best checkpoint to search the hyperparameters on UAV123 following here. Experimentally, the hyperparameters search on UAV123 can bring around 1.0 Success gain.
The results below are achieved without hyperparameters search.
Method | Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | Success | Norm Precision | Precision | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
SiamRPN++ | R-50 | - | 20e | 7.54 | - | 60 | 77.3 | 80.3 | config | model | log |
The results of SiameseRPN++ in TrackingNet are reimplemented by ourselves. The best model on LaSOT is submitted to the evaluation server on TrackingNet Challenge. We provide the best model with its configuration and training log.
Method | Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | Success | Norm precision | Precision | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
SiamRPN++ | R-50 | - | 20e | 7.54 | - | 68.8 | 75.9 | 63.2 | config | model | log |
The checkpoints from 10-th to 20-th epoch will be evaluated during training. You can find the best checkpoint from the log file.
If you want to get better results, you can use the best checkpoint to search the hyperparameters on OTB100 following here. Experimentally, the hyperparameters search on OTB100 can bring around 1.0 Success gain.
Note: The results reported in the paper are 69.6 Success and 91.4 Precision. We train the SiameseRPN++ in the official pysot codebase and can not reproduce the same results. We only get 66.1 Success and 86.7 Precision by following the training and hyperparameters searching instructions of pysot, which are lower than those of the paper by 3.5 Succuess and 4.7 Precision respectively. Without hyperparameters search, we get 65.3 Success and 85.8 Precision. In our codebase, the results below are also achieved without hyperparameters search, close to the results reproduced in pysot in the same setting.
Method | Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | Success | Norm Precision | Precision | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
SiamRPN++ | R-50 | - | 20e | - | - | 64.9 | 82.4 | 86.3 | config | model | log |
The checkpoints from 10-th to 20-th epoch will be evaluated during training. You can find the best checkpoint from the log file.
If you want to get better results, you can use the best checkpoint to search the hyperparameters on VOT2018 following here.
Note: The result reported in the paper is 0.414 EAO. We train the SiameseRPN++ in the official pysot codebase and can not reproduce the same result. We only get 0.364 EAO by following the training and hyperparameters searching instructions of pysot, which is lower than that of the paper by 0.05 EAO. Without hyperparameters search, we get 0.346 EAO. In our codebase, the results below are also achieved without hyperparameters search, close to the results reproduced in pysot in the same setting.
Method | Backbone | Style | Lr schd | Mem (GB) | Inf time (fps) | EAO | Accuracy | Robustness | Config | Download |
---|---|---|---|---|---|---|---|---|---|---|
SiamRPN++ | R-50 | - | 20e | - | - | 0.348 | 0.588 | 0.295 | config | model | log |
Due to the influence of parameters such as learning rate in default configuration file, we recommend using 8 GPUs for training in order to reproduce accuracy. You can use the following command to start the training.
# Training SiamRPN++ on ImageNetVID、ImageNetDET and coco dataset with following command
# The number after config file represents the number of GPUs used. Here we use 8 GPUs
./tools/dist_train.sh \
configs/sot/siamese_fpn/siamese-rpn_r50_8xb28-20e_imagenetvid-imagenetdet-coco_test-lasot.py 8
The models tested on LaSOT, TrackingNet, UAV123 and VOT2018 have the same training settings. For OTB100, there are some unique training settings.
If you want to know about more detailed usage of train.py/dist_train.sh/slurm_train.sh
, please refer to this document.
2.1 Example on LaSOT, UAV123, OTB100 and VOT2018 datasets
# Example 1: Test on LaSOT testset
# The number after config file represents the number of GPUs used. Here we use 8 GPUs.
./tools/dist_test.sh \
configs/sot/siamese_fpn/siamese-rpn_r50_8xb28-20e_imagenetvid-imagenetdet-coco_test-lasot.py 8 \
--checkpoint ./checkpoints/siamese_rpn_r50_20e_lasot_20220420_181845-dd0f151e.pth
2.1 Example on TrackingNet dataset
If you want to get the results of the TrackingNet test set, please use the following command to generate result files that can be used for submission. It will be stored in ./results/siamese_rpn_trackingnet.zip
, you can modify the saved path in test_evaluator
of the config.
# Example 1: Test on TrackingNet testset
# We use the best checkpoint on LaSOT to test on the TrackingNet.
# The number after config file represents the number of GPUs used. Here we use 8 GPUs.
./tools/dist_test.sh \
configs/sot/siamese_fpn/siamese-rpn_r50_8xb28-20e_imagenetvid-imagenetdet-coco_test-trackingnet.py 8 \
--checkpoint ./checkpoints/siamese_rpn_r50_20e_lasot_20220420_181845-dd0f151e.pth
Use a single GPU to predict a video and save it as a video.
python demo/demo_sot.py \
configs/sot/siamese_fpn/siamese-rpn_r50_8xb28-20e_imagenetvid-imagenetdet-coco_test-lasot.py \
--checkpoint ./checkpoints/siamese_rpn_r50_20e_lasot_20220420_181845-dd0f151e.pth \
--input demo/demo.mp4 \
--output sot.mp4
If you want to know about more detailed usage of demo_sot.py
, please refer to this document.