This project builds upon InternLM, an open-source training framework for Large Language Models. Please follow its installation instructions for setting up the environment and dependencies. For convenience, we provide the key setup commands below:
git clone [email protected]:InternLM/InternLM.git --recurse-submodules
conda create --name internlm-env python=3.10 -y
conda activate internlm-env
cd internlm
pip install -r requirements/torch.txt
pip install -r requirements/runtime.txt
FlashAttention is required for optimized training. Install it using the following steps:
cd ./third_party/flash-attention
python setup.py install
cd ./csrc
cd fused_dense_lib && pip install -v .
cd ../xentropy && pip install -v .
cd ../rotary && pip install -v .
cd ../layer_norm && pip install -v .
cd ../../../../
In addition to the InternLM setup, this project includes extra dependencies. Some of these are required for evaluation experiments (e.g., in the TriFinger simulator), but they are not necessary if you only plan to train the model. You can check and select the necessary packages from our full environment list in requirements/full_env.txt
and install them using:
pip install -r requirements/full_env.txt
We use two egocentric datasets, WalkingTours and Ego-Exo4D, to train EgoAgent.
For the WalkingTours Dataset, please follow the instructions from DoRA ICLR24 and visit WalkingTours Dataset to download videos and blur out faces within. Please put the downloaded mp4 videos at data/WTour
.
For the Ego-Exo4D dataset, please follow the instructions from Ego-Exo4D Data and Ego-Exo4D Document to sign Ego-Exo4D licenses and download dataset (remember to do the aws configuration).
egoexo -o data/egoexo4d --parts annotations metadata
egoexo -o data/egoexo4d/egoexo4d_v1 --parts take_vrs --release v1
egoexo -o data/egoexo4d/egoexo4d_v2 --parts take_vrs --release v2
Note: (1) "annotations" part contains 3D motion labels estimated from exocentric views, "takes" part contains frame-aligned mp4 videos of egocentric and exocentric views, we only need egocentric videos, "take_vrs" part is also needed since it contains the camera parameters for video undistortion and motion translation. (2) Ego-Exo4D V2 includes more video hours and exocentric labels than V1, please see Ego-Exo4D Change Log for more details.
Install ffmpeg and run the following example command to extract video frames and save as images. You can modify the video path to process different videos.
cd data/wtour
mkdir Walking_Tour_Wildlife
ffmpeg -i "Walking_Tour_Wildlife.mp4" "Walking_Tour_Wildlife/frame_%07d.jpg"
As Ego-Exo4D employs fisheye cameras, we undistort the images to a pinhole camera model using the official Project Aria Tools to align them with the WalkingTours videos. You may run the following script to perform the undistortion:
python dataset_preprocess/undistort_egovideos.py
We translate the original body poses in Ego-Exo4D using the official Ego-Exo4D Body Pose dataset. You may run the following script to process the body pose data:
python dataset_preprocess/preprocess_egoexo_body_pose.py
Run the following script to pretrain the egocentric visual representation. The example below uses the 300M model configuration, but a 1B configuration is also available.
cd InternLM
${sbatch/srun command} scripts_train/source_slurm_train_stage1_contrastive_w_tokenizer.sh ${number_of_nodes} configs/dino_pretrain/lr5e-4_tmp004_300m.py ${number_of_workers}
After pretraining, use the following command to perform joint training. The example below uses the 300M model configuration:
cd InternLM
${sbatch/srun command} scripts_train/slurm_train_stage2_contrastive_joint.sh configs/joint_train/lr6e-04_tmp004_300m.py ${number_of_gpus} ${seed}
You need to write the sbatch/srun command and modify the related parameters in the configuration file (configs/.../*.py) based on your cloud server and training plan. You can specify which pretrained model checkpoint to use in the configuration file.
EgoAgent is evaluated on three tasks: image classification for visual representation learning, future world state prediction for world model, and 3D human motion prediction for action prediction.
Image classification serves as a standard benchmark for visual representation learning. To evaluate EgoAgent's visual representation ability on ImageNet kNN classification, run:
cd InternLM/tools
python convert2hf_dino_conv_stem_cls.py --src_folder ../save_ckpt/${model_name} --remove_vq_embed
cd ..
torchrun --nproc_per_node=4 --nnodes=1 --node_rank=0 eval_knn_conv_stem_cls.py --pretrained_weights ../hf_models/${model_name}/ --data_path ${imagenet_path} --imgnet 1000
To evaluate future world state prediction, use:
cd InternLM
${sbatch/srun command} python -u eval_feature_retrieval.py --config ${CONFIG} --launcher ${LAUNCHER} --port ${PORT} --eval_iters ${eval_iters}
To evaluate 3D human motion prediction, run:
cd InternLM
${sbatch/srun command} python -u python eval_3dposes.py --config ${CONFIG} --launcher ${LAUNCHER} --port ${PORT} --eval_iters ${eval_iters}
If you find this project useful for your research, please cite it using the following BibTeX entry:
@article{chen2025egoagent,
title={EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds},
author={Lu Chen and Yizhou Wang and Shixiang Tang and Qianhong Ma and Tong He and Wanli Ouyang and Xiaowei Zhou and Hujun Bao and Sida Peng},
year={2025},
eprint={2502.05857},
url={https://arxiv.org/abs/2502.05857},
}
We sincerely thank the authors of InternLM, DINO, Project Aria Tools, and Ego-Exo4D for their great works, without which our project/code would not be possible.