Visuomotor intelligent agent use vision signals as input to directly predict the decisions and actions. Because they require a large corpus of labeled data or environment interactions to achieve satisfactory performance, supervised pre-training is often applied to simultaneously cooperating and training the perception and control modules in an end-to-end fashion and transfer to downstream tasks such as visual navigation and trajectory prediction.
Using " intelligent agent self-driving in extraterrestrial planetary environments" as a case study, the supervised pre-training paradigm suffers from a lack of labeled data and high cost, and inefficient transfer. Dominant self-supervised approaches in computer vision are not applicable due to the lack of translation and view invariance in vision-driven driving tasks, and the input contains irrelevant information for driving. Therefore, the research goal is to design a self-supervised pre-training method applicable to self-driving in extraterrestrial planetary open environments.
Inspired by multimodal learning, we introduce temporal signals such as IMU and Odometry to help the visual encoder learning. The visual modality is the objective condition for driving decisions, and the temporal signal modality responds to the driving state and decision quality. The two are synergistic and complementary: the strong correlation between modalities makes it theoretically possible to predict semantic information from one modality to the other, while the inherent differences make cross-modal prediction a more challenging and valuable pretext task compared to within-modality learning.
We propose a pre-training method for cross-modal prediction by extracting features of both modalities through a visual encoder and a temporal signal encoder, constructing pseudo-labels of the other modality by clustering the features using the scalable K-Means algorithm, and optimizing the model by repeating the clustering and classification tasks.
The main dependencies are as follows:
- Python == 3.8.16
- pytorch >= 1.12.1
- torchvision >= 0.13.1
- sklearn == 1.2.2
- pillow == 9.5.0
- prefetch-generator == 1.0.3
- tensorboard == 2.12.1
- seaborn == 0.12.2
This implementation only supports multi-gpu, DataParallel training, which is faster and simpler; single-gpu is also supported but not advised.
To do unsupervised pre-training of a ResNet-50 model on ImageNet in an 4-gpu machine, run:
python main_pretrain.py \
--gpunum 4 \
--k 16 \
--model resnet50 \
--epoch 4 \
--subepoch 15 \
--lr 0.03 \
--batchsize 32
To test the pretrained models of both modalities, run:
python main_test.py \
--gpunum 4 \
--model resnet50 \
--dir1 ./v1_k16_epoch15_4_pre/vision_encoder_14 \
--dir2 ./v1_k16_epoch15_4_pre/sensor_encoder_14
CNNs have inductive bias of vision modality, but Transformer are not. Thus we initialize weights pretrained on ImageNet making the ViTs has the capability of recognizing the image texture features.
scratch ResNet | scratch ViT | IN pretrained ViT |
---|---|---|
Vision Backbone | 5-shot/% | 10-shot/% | Mean/% | Params/M | Download |
---|---|---|---|---|---|
ResNet-50 | 56.9 | 62.2 | 59.2 | 25.6 | model |
ResNet-101 | 56.1 | 63.5 | 59.8 | 44.5 | model |
ResNet-152 | 62.7 | 67.5 | 65.1 | 60.2 | model |
ViT-B | 50.1 | 55.6 | 52.9 | 86.6 | model |
Swin-T | 64.9 | 70.3 | 67.6 | 28.3 | model |
Swin-S | 58.5 | 64.1 | 61.3 | 49.6 | model |
Swin-B | 62.5 | 69.3 | 65.9 | 87.8 | model |