EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds

Project Page (coming soon) | Paper

Setup

This project builds upon InternLM, an open-source training framework for Large Language Models. Please follow its installation instructions for setting up the environment and dependencies. For convenience, we provide the key setup commands below:

git clone [email protected]:InternLM/InternLM.git --recurse-submodules

conda create --name internlm-env python=3.10 -y
conda activate internlm-env
cd internlm
pip install -r requirements/torch.txt 
pip install -r requirements/runtime.txt

FlashAttention is required for optimized training. Install it using the following steps:

cd ./third_party/flash-attention
python setup.py install
cd ./csrc
cd fused_dense_lib && pip install -v .
cd ../xentropy && pip install -v .
cd ../rotary && pip install -v .
cd ../layer_norm && pip install -v .
cd ../../../../

In addition to the InternLM setup, this project includes extra dependencies. Some of these are required for evaluation experiments (e.g., in the TriFinger simulator), but they are not necessary if you only plan to train the model. You can check and select the necessary packages from our full environment list in requirements/full_env.txt and install them using:

pip install -r requirements/full_env.txt

Data Preparation

Download

We use two egocentric datasets, WalkingTours and Ego-Exo4D, to train EgoAgent.

For the WalkingTours Dataset, please follow the instructions from DoRA ICLR24 and visit WalkingTours Dataset to download videos and blur out faces within. Please put the downloaded mp4 videos at data/WTour.

For the Ego-Exo4D dataset, please follow the instructions from Ego-Exo4D Data and Ego-Exo4D Document to sign Ego-Exo4D licenses and download dataset (remember to do the aws configuration).

egoexo -o data/egoexo4d --parts annotations metadata
egoexo -o data/egoexo4d/egoexo4d_v1 --parts take_vrs --release v1
egoexo -o data/egoexo4d/egoexo4d_v2 --parts take_vrs --release v2

Note: (1) "annotations" part contains 3D motion labels estimated from exocentric views, "takes" part contains frame-aligned mp4 videos of egocentric and exocentric views, we only need egocentric videos, "take_vrs" part is also needed since it contains the camera parameters for video undistortion and motion translation. (2) Ego-Exo4D V2 includes more video hours and exocentric labels than V1, please see Ego-Exo4D Change Log for more details.

Preprocess

Extract WTour images

Install ffmpeg and run the following example command to extract video frames and save as images. You can modify the video path to process different videos.

cd data/wtour
mkdir Walking_Tour_Wildlife
ffmpeg -i "Walking_Tour_Wildlife.mp4" "Walking_Tour_Wildlife/frame_%07d.jpg"

Undistort Ego-Exo4D captures

As Ego-Exo4D employs fisheye cameras, we undistort the images to a pinhole camera model using the official Project Aria Tools to align them with the WalkingTours videos. You may run the following script to perform the undistortion:

python dataset_preprocess/undistort_egovideos.py

Normalize Ego-Exo4D body poses

We translate the original body poses in Ego-Exo4D using the official Ego-Exo4D Body Pose dataset. You may run the following script to process the body pose data:

python dataset_preprocess/preprocess_egoexo_body_pose.py

Training

Stage 1: Egocentric visual representation pretraining

Run the following script to pretrain the egocentric visual representation. The example below uses the 300M model configuration, but a 1B configuration is also available.

cd InternLM
${sbatch/srun command} scripts_train/source_slurm_train_stage1_contrastive_w_tokenizer.sh ${number_of_nodes} configs/dino_pretrain/lr5e-4_tmp004_300m.py ${number_of_workers}

Stage 2: Representation-Prediction-Action joint training

After pretraining, use the following command to perform joint training. The example below uses the 300M model configuration:

cd InternLM
${sbatch/srun command} scripts_train/slurm_train_stage2_contrastive_joint.sh configs/joint_train/lr6e-04_tmp004_300m.py ${number_of_gpus} ${seed}

You need to write the sbatch/srun command and modify the related parameters in the configuration file (configs/.../*.py) based on your cloud server and training plan. You can specify which pretrained model checkpoint to use in the configuration file.

Evaluation

EgoAgent is evaluated on three tasks: image classification for visual representation learning, future world state prediction for world model, and 3D human motion prediction for action prediction.

ImageNet classification

Image classification serves as a standard benchmark for visual representation learning. To evaluate EgoAgent's visual representation ability on ImageNet kNN classification, run:

cd InternLM/tools
python convert2hf_dino_conv_stem_cls.py --src_folder ../save_ckpt/${model_name} --remove_vq_embed

cd ..
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 eval_knn_conv_stem_cls.py --pretrained_weights ../hf_models/${model_name}/ --data_path ${imagenet_path} --imgnet 1000

Future world state prediction

To evaluate future world state prediction, use:

cd InternLM
${sbatch/srun command} python -u eval_feature_retrieval.py --config ${CONFIG} --launcher ${LAUNCHER} --port ${PORT} --eval_iters ${eval_iters}

3D human motion prediction

To evaluate 3D human motion prediction, run:

cd InternLM
${sbatch/srun command} python -u python eval_3dposes.py --config ${CONFIG} --launcher ${LAUNCHER} --port ${PORT} --eval_iters ${eval_iters}

Citation

If you find this project useful for your research, please cite it using the following BibTeX entry:

@article{chen2025egoagent,
      title={EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds}, 
      author={Lu Chen and Yizhou Wang and Shixiang Tang and Qianhong Ma and Tong He and Wanli Ouyang and Xiaowei Zhou and Hujun Bao and Sida Peng},
      year={2025},
      eprint={2502.05857},
      url={https://arxiv.org/abs/2502.05857}, 
}

Acknowledgement

We sincerely thank the authors of InternLM, DINO, Project Aria Tools, and Ego-Exo4D for their great works, without which our project/code would not be possible.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
InternLM		InternLM
assets/figures		assets/figures
dataset_preprocess		dataset_preprocess
requirements		requirements
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds

Project Page (coming soon) | Paper

Setup

Data Preparation

Download

Preprocess

Extract WTour images

Undistort Ego-Exo4D captures

Normalize Ego-Exo4D body poses

Training

Stage 1: Egocentric visual representation pretraining

Stage 2: Representation-Prediction-Action joint training

Evaluation

ImageNet classification

Future world state prediction

3D human motion prediction

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

zju3dv/EgoAgent

Folders and files

Latest commit

History

Repository files navigation

EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds

Project Page (coming soon) | Paper

Setup

Data Preparation

Download

Preprocess

Extract WTour images

Undistort Ego-Exo4D captures

Normalize Ego-Exo4D body poses

Training

Stage 1: Egocentric visual representation pretraining

Stage 2: Representation-Prediction-Action joint training

Evaluation

ImageNet classification

Future world state prediction

3D human motion prediction

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages