Abstract. The motion-guided attention is based on track queries, which are regarded as both motion prior and temporal offset predictors without specifying any extra embedding to predict motion. The motion prior explicitly indicates queries where to attend the possible keys. Thus, track queries only need to interact with a constant number of keys around the queries to save computation. This motion-guided attention is fully integrated with the transformer, which enables an end-to-end architecture. The simulation results on MOT17 have shown the state-ofthe-art performance
Method | Dataset | Train Data | IDF1 | MT | ML | MOTA | IDF1 | IDS | URL |
---|---|---|---|---|---|---|---|---|---|
MOMOT | MOT17 | MOT17+CrowdHuman | 65.7 | 40.3 | 19.9 | 72.8 | 65.7 | 2586 | model |
Note:
- MOMOT on MOT17is trained on 4 NVIDIA A100 GPUs.
- The training time for MOT17 is about 1 days on A100;
- The inference speed is about 7.2 FPS for resolution 1536x800;
- All models of MOMOT are trained with ResNet50 with pre-trained weights on COCO dataset.
The codebase is built on top of Deformable DETR.
-
Linux, CUDA>=11.1, GCC>=5.4
-
PyTorch>=1.10.1, torchvision>=0.11.2 (following instructions here)
-
Other requirements
pip install -r requirements.txt
-
Build MultiScaleDeformableAttention
cd ./models/ops sh ./make.sh
- Please download MOT17 dataset and CrowdHuman dataset and organize them like FairMOT as following:
.
├── crowdhuman
│ ├── images
│ └── labels_with_ids
├── MOT15
│ ├── images
│ ├── labels_with_ids
│ ├── test
│ └── train
├── MOT17
│ ├── images
│ ├── labels_with_ids
├── DanceTrack
│ ├── train
│ ├── test
├── bdd100k
│ ├── images
│ ├── track
│ ├── train
│ ├── val
│ ├── labels
│ ├── track
│ ├── train
│ ├── val
You can download COCO pretrained weights from Deformable DETR. Then training MOMOT as following:
sh config/momot_train.sh
sh config/momot_eval.sh
sh config/momot_submit.sh