Skip to content

Co-Training with VLM Data

This guide outlines the process for integrating VLM (Vision-Language Model) data to co-train the StarVLA (Vision-Language-Action) framework.

Why co-train? Training a VLA purely on robot manipulation data can degrade the VLM backbone’s vision and language understanding — this is known as “catastrophic forgetting”: after being trained only on robot data, the model may forget how to interpret images, answer questions, or understand complex instructions. Mixing in VLM data (image QA, captioning, etc.) preserves the model’s general understanding while learning robot control.


The VLM data must adhere to the QwenVL Conversations JSON Data Structure.

Each data instance is a JSON object that links an image file path to a list of human-GPT conversational turns.

{
"image": "path/to/images/001.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat's the main object in this picture?"
// <image> is a placeholder that tells the model "insert the image here";
// the actual image path is specified in the outer "image" field
},
{
"from": "gpt",
"value": "A red apple on a wooden table"
}
]
}

You can download our case dataset LLaVA-OneVision-COCO.

Unzip sharegpt4v_coco.zip and place it in playground/Datasets/LLaVA-OneVision-COCO.

The resulting file structure will look like this:

Terminal window
.../LLaVA-OneVision-COCO
├── images
└── sharegpt4v_coco
└── llava_jsons
└── sharegpt4v_coco.json

To add a custom VLM dataset, follow these steps:

Register your dataset by adding it to the data_dict in starVLA/dataloader/qwenvl_llavajson/qwen_data_config.py:

# Example Registration
# json_root and image_root are defined at the top of this file,
# defaulting to subdirectories under playground/Datasets/LLaVA-OneVision-COCO/:
# json_root = "playground/Datasets/LLaVA-OneVision-COCO/llava_jsons"
# image_root = "playground/Datasets/LLaVA-OneVision-COCO/images"
SHAREGPT4V_COCO = {
"annotation_path": f"{json_root}/sharegpt4v_coco.json",
"data_path": f"{image_root}/",
}
data_dict = {
"sharegpt4v_coco": SHAREGPT4V_COCO, # Use this name in the YAML config
}

Include the VLM dataset configuration in your training YAML file (your_train_config.yaml):

datasets:
vlm_data:
dataset_py: vlm_datasets
dataformat: llava_json
dataset_use: sharegpt4v_coco # Must match the name registered in 2.1

Tip: You can verify the VLM dataloader by running:

Terminal window
python starVLA/dataloader/vlm_datasets.py --config_yaml your_train_config.yaml

Choose the appropriate script based on whether you want to train only on VLM data or co-train with VLA data.

Use this for VLM-specific pre-training or fine-tuning.

Script: starVLA/training/train_starvla_vlm.py

Terminal window
bash examples/CoTrainVLM/train_files/run_train_starvlm.sh

This simultaneously trains the model on both robotics (VLA) and multi-modal (VLM) data.

Script: starVLA/training/train_starvla_cotrain.py

Terminal window
bash examples/CoTrainVLM/train_files/run_libero_cotrain.sh