3D occupancy Challenge

Introduction

Understanding the 3D surroundings including the background stuffs and foreground objects is important for autonomous driving. In the traditional 3D object detection task, a foreground object is represented by the 3D bounding box. However, the geometrical shape of the object is complex, which can not be represented by a simple 3D box, and the perception of the background stuffs is absent. The goal of this task is to predict the 3D occupancy of the scene. In this task, we provide a large-scale occupancy benchmark based on the nuScenes dataset. The benchmark is a voxelized representation of the 3D space, and the occupancy state and semantics of the voxel in 3D space are jointly estimated in this task. The complexity of this task lies in the dense prediction of 3D space given the surround-view images.

Description of methods implemented

BEVFormer

BEV Former

BEVFormer is a camera only transformer-based model for multi-task learning in autonomous driving such as 3D object detection, and occupancy prediction using multi-view images. It has 6 transformer blocks as shown in the figure. It works on the discretized 2D bird's eye view gird. The main innovations here are the following

Learnable queries for each grid cell in BEV space
Temporal self attention with BEV features at previous timestamp
Spatial cross attention with multiscale multiview image features

The transformer encoder takes as input the learnable queries for each grid cell in the BEV space, the BEV grid features from previous timestamp and the multiview image features at timestamp t. It outputs features at each cell of the BEV grid for timestamp t which then is fed to a segmentation head to output class probabilities for each cell.

Cross Entropy loss is used for training. ResNet-50 is used as image backbone along with FPN.

SparseOcc

Sparse Voxel Decoder

Full Architecture

Sparseocc is a recent camera only transformer-based model for specifically 3D occupancy prediction. It has faster inference time as it is a fully sparse occupancy network. It is based on the idea that ~90% of the voxels are free in a given scene and dense occupancy prediction is not necessary. It mainly consists of two blocks

Sparse Voxel decoder
Mask Transformer

The job of Sparse voxel decoder is to predict the locations of occupied voxels in the scene. In other words, it constructs a class agnostic sparse occupancy space. It starts from a a set of coarse voxel queries equally distributed in the 3D space. In each layer, self attention and cross attention with multi-scale multi-view image features are performed. Then the voxels are upsampled by 2x, occupancy of each voxel is estimated and pruning is performed to remove free voxels. This is repeated for multiple layers and the at the end, voxel embeddings are stored to feed to the Mask Transformer layer. Binary Cross Entropy loss is used for supervision.

Mask Transformer is inspired from Mask2Former which predicts binary masks for each class label. It starts with N semantic query context vectors. These are passed to a transformer based model consisting of several layers. Each layer contains self attention and spatial cross attention with multiscale multiview images. At the end of each layer, voxel embeddings from Sparse voxel decoder and context vectors are used to predict the binary masks which are then passed to the next layer. At the end of the model, we predict class labels from context vectors, binary masks from voxel embeddings and context vectors. These are then supervised with ground truth occupancy labels. Dice loss and Binary cross entropy loss are used for mask prediction. Cross entropy loss is used for class label prediction.

Model Zoo

The following table summarizes the performance metrics of different models tested on the validation dataset of Nuscenes dataset for 3d occupancy prediction.

Method	Backbone	Config	RayIOU	Weights	Memory	FPS on A100
BEVFormer	ResNet50	config	0.285		15.8GB	~3
SparseBEV	ResNet50	config	0.3312	weights	20.7GB	~17
SparseBEV	FlashInternImage-T	In progress

Testing

The installation instructions in the original repositories have dependency issues. Instead, pull the the docker saiprakashc/mmdet3d-image:v0.17.0-2 for testing BEVFormer and the docker saiprakashc/mmdet3d-image:v1.0.0rc6 for testing SparseOcc. The docker images exist in docker hub and they work right out of the box.

For setting up the data, see data section.

Task Definition

Given images from multiple cameras, the goal is to predict the semantics and flow of each voxel grid in the scene.

Evaluation Metrics

The implementation is here: projects/mmdet3d_plugin/datasets/ray_metrics.py

Ray-based mIoU

We use the well-known mean intersection-over-union (mIoU) metric. However, the elements of the set are now query rays, not voxels.

Specifically, we emulate LiDAR by projecting query rays into the predicted 3D occupancy volume. For each query ray, we compute the distance it travels before it intersects any surface. We then retrieve the corresponding class label and flow prediction.

We apply the same procedure to the ground-truth occupancy to obtain the groud-truth depth, class label and flow.

A query ray is classified as a true positive (TP) if the class labels coincide and the L1 error between the ground-truth depth and the predicted depth is less than either a certain threshold (e.g. 2m).

Let $C$ be he number of classes.

$$ mIoU=\frac{1}{C}\displaystyle \sum_{c=1}^{C}\frac{TP_c}{TP_c+FP_c+FN_c}, $$

where $TP_c$ , $FP_c$ , and $FN_c$ correspond to the number of true positive, false positive, and false negative predictions for class $c_i$.

We finally average over distance thresholds of {1, 2, 4} meters and compute the mean across classes.

For more details about this metric, please refer to the technical report.

AVE for Occupancy Flow

Here we measure velocity errors for a set of true positives (TP). We use a threshold of 2m distance.

The absolute velocity error (AVE) is defined for 8 classes ('car', 'truck', 'trailer', 'bus', 'construction_vehicle', 'bicycle', 'motorcycle', 'pedestrian') in m/s.

Occupancy Score

The final occupancy score is defined to be a weighted sum of mIoU and mAVE. Note that the velocity errors are converted to velocity scores as max(1 - mAVE, 0.0). That is,

OccScore = mIoU * 0.9 + max(1 - mAVE, 0.0) * 0.1

(back to top)

OpenOcc Dataset

Basic Information

The nuScenes OpenOcc dataset contains 17 classes. Voxel semantics for each sample frame is given as [semantics] in the labels.npz. Occupancy flow is given as [flow] in the labels.npz.

Type	Info
train	28,130
val	6,019
test	6,008
cameras	6
voxel size	0.4m
range	[-40m, -40m, -1m, 40m, 40m, 5.4m]
volume size	[200, 200, 16]
#classes	0 - 16

Download

Download the nuScenes dataset and put in into data/nuscenes
Download our openocc_v2.1.zip and infos.zip from OpenDataLab or Google Drive
Unzip them in data/nuscenes

Hierarchy

The hierarchy of folder data/nuscenes is described below:

nuscenes
├── maps
├── nuscenes_infos_train_occ.pkl
├── nuscenes_infos_val_occ.pkl
├── nuscenes_infos_test_occ.pkl
├── openocc_v2
├── samples
├── v1.0-test
└── v1.0-trainval

openocc_v2 is the occuapncy GT.
nuscenes_infos_{train/val/test}_occ.pkl contains meta infos of the dataset.
Other folders are borrowed from the official nuScenes dataset.

Known Issues

nuScenes (issue #721) lacks translation in the z-axis, which makes it hard to recover accurate 6d localization and would lead to the misalignment of point clouds while accumulating them over whole scenes. Ground stratification occurs in several data.

(back to top)

ToDo

Train SparseBEV with FlashInternImage-T
Evaluate FB-Occ
Implement flow predicter for the best model

References

@article{sima2023_occnet,
    title={Scene as Occupancy},
    author={Chonghao Sima and Wenwen Tong and Tai Wang and Li Chen and Silei Wu and Hanming Deng and Yi Gu and Lewei Lu and Ping Luo and Dahua Lin and Hongyang Li},
    year={2023},
    eprint={2306.02851},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@misc{liu2024fully,
    title={Fully Sparse 3D Occupancy Prediction}, 
    author={Haisong Liu and Yang Chen and Haiguang Wang and Zetong Yang and Tianyu Li and Jia Zeng and Li Chen and Hongyang Li and Limin Wang},
    year={2024},
    eprint={2312.17118},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

@article{li2022bevformer,
  title={BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers},
  author={Li, Zhiqi and Wang, Wenhai and Li, Hongyang and Xie, Enze and Sima, Chonghao and Lu, Tong and Qiao, Yu and Dai, Jifeng}
  journal={arXiv preprint arXiv:2203.17270},
  year={2022}
}

This dataset is under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Before using the dataset, you should register on the website and agree to the terms of use of the [nuScenes](https://www.nuscenes.org/nuscenes). 

<p align="right">(<a href="#top">back to top</a>)</p>

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github		.github
docker		docker
docs		docs
figs		figs
projects		projects
sparse_occ @ c2c5deb		sparse_occ @ c2c5deb
tools		tools
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3D occupancy Challenge

Table of Contents

Introduction

Description of methods implemented

BEVFormer

SparseOcc

Model Zoo

Testing

Task Definition

Evaluation Metrics

Ray-based mIoU

AVE for Occupancy Flow

Occupancy Score

OpenOcc Dataset

Basic Information

Download

Hierarchy

Known Issues

ToDo

References

About

Releases

Packages

Languages

License

saiprakash-c/3d-occupancy-prediction

Folders and files

Latest commit

History

Repository files navigation

3D occupancy Challenge

Table of Contents

Introduction

Description of methods implemented

BEVFormer

SparseOcc

Model Zoo

Testing

Task Definition

Evaluation Metrics

Ray-based mIoU

AVE for Occupancy Flow

Occupancy Score

OpenOcc Dataset

Basic Information

Download

Hierarchy

Known Issues

ToDo

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages