Reproducing Vila-U Training #6

Pulyong · 2024-11-14T07:04:26Z

Thanks for great work!!

I am currently attempting to reproduce the Vila-U model. As I understand it, Vision Tower training (Image and Video quantization) should be conducted first, followed by LLM training.

From what I've found, there is code available for LLM pretraining and supervised fine-tuning (SFT), but there seems to be no code available for Vision Tower training. (If I missed something, please let me know.)

Therefore, could you provide code or a recipe for Vision Tower training?

Additionally, as mentioned in the paper, COYO-700M, ShareGPT4V, MMC4, an internal dataset, and OpenVid were used for training. I am curious about how sampling was conducted on these datasets, and if the internal dataset is shareable.

I would also greatly appreciate any other details about the training process.(epochs per stage, GPU type, GPU nums, GPU hours, etc...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Vila-U Training #6

Reproducing Vila-U Training #6

Pulyong commented Nov 14, 2024

Reproducing Vila-U Training #6

Reproducing Vila-U Training #6

Comments

Pulyong commented Nov 14, 2024