Use Your Own LeRobot Dataset
This guide walks you through the complete process of training StarVLA on your own robotics data, from data conversion to model training.
Overview
Section titled “Overview”The workflow consists of five main steps:
- Convert Data to LeRobot Format - Transform your raw data into the standardized LeRobot format
- Create Robot Type Config - Define how your robot’s data modalities are structured
- Create Data Mix - Register your dataset in the mixture registry
- Create Training Config - Configure the training parameters
- Run Training - Launch the training script
Step 1: Convert Data to LeRobot Format
Section titled “Step 1: Convert Data to LeRobot Format”StarVLA uses the LeRobot dataset format for VLA training. You need to convert your robotics data to this format first.
LeRobot Data Structure
Section titled “LeRobot Data Structure”A LeRobot dataset requires the following features:
observation.state: Robot state (joint positions, end-effector pose, etc.)action: Robot actions (joint commands, delta positions, etc.)observation.images.*: Camera images (stored as video)language_instructionortask: Task description text
Conversion Example
Section titled “Conversion Example”from lerobot.common.datasets.lerobot_dataset import LeRobotDatasetimport numpy as np
# Define your dataset featuresFEATURES = { "observation.state": { "dtype": "float32", "shape": (7,), # e.g., 6 joints + 1 gripper "names": ["state"], }, "action": { "dtype": "float32", "shape": (7,), "names": ["action"], }, "observation.images.image": { "dtype": "video", "shape": (480, 640, 3), # height, width, channels "names": ["height", "width", "channels"], }, "language_instruction": { "dtype": "string", "shape": (1,), "names": ["instruction"], },}
# Create datasetdataset = LeRobotDataset.create( repo_id="my_robot_dataset", fps=15, robot_type="my_robot", features=FEATURES,)
# Add frames from your data# Assume your raw data is organized by episode (one complete demonstration),# each containing multiple frames.# e.g.: episodes = [load_hdf5("demo_0.hdf5"), load_hdf5("demo_1.hdf5"), ...]for episode in your_episodes: for frame in episode: dataset.add_frame({ "observation.state": np.array(frame["state"], dtype=np.float32), "action": np.array(frame["action"], dtype=np.float32), "observation.images.image": frame["image"], "language_instruction": frame["instruction"], # `task` is a required field used internally by LeRobot for grouping # episodes by task; its content is usually the same as language_instruction "task": frame["instruction"], }) dataset.save_episode()
# Finalize the datasetdataset.finalize()Dataset Directory Structure
Section titled “Dataset Directory Structure”After conversion, your dataset should have this structure:
your_dataset_name/├── meta/│ ├── info.json│ ├── episodes.jsonl│ ├── stats.json│ └── tasks.json├── data/│ └── chunk-000/│ └── episode_000000.parquet└── videos/ └── chunk-000/ └── observation.images.image/ └── episode_000000.mp4Modality JSON File
Section titled “Modality JSON File”Create a modality.json file in your training directory to define the mapping between LeRobot keys and StarVLA keys. Think of it as a “translation table” — it translates the raw column names in your dataset into StarVLA’s unified internal names, so different datasets can be processed by the same code simply by providing their own modality.json:
{ "state": { "arm_joint": {"start": 0, "end": 6}, "gripper_joint": {"start": 6, "end": 7} }, "action": { "arm_joint": {"start": 0, "end": 6}, "gripper_joint": {"start": 6, "end": 7} }, "video": { "camera_1": {"original_key": "observation.images.camera_1"}, "camera_2": {"original_key": "observation.images.camera_2"} }, "annotation": { "human.action.task_description": {"original_key": "language_instruction"} }}StarVLA provides modality.json files for all built-in benchmarks. You can find them in each benchmark’s example directory (e.g., examples/LIBERO/train_files/modality.json, examples/SimplerEnv/train_files/modality.json).
Step 2: Create Robot Type Config
Section titled “Step 2: Create Robot Type Config”The Robot Type Config defines how StarVLA reads and processes your data. Create a new config class in starVLA/dataloader/gr00t_lerobot/data_config.py.
Config Structure
Section titled “Config Structure”class MyRobotDataConfig: # Define the keys for each modality video_keys = [ "video.camera_1", # Maps to observation.images.camera_1 "video.camera_2", # Maps to observation.images.camera_2 ] state_keys = [ "state.arm_joint", "state.gripper_joint", ] action_keys = [ "action.arm_joint", "action.gripper_joint", ] language_keys = ["annotation.human.action.task_description"]
# Index configuration observation_indices = [0] # Which timesteps to use for observation action_indices = list(range(8)) # Action horizon (predict 8 future steps)
def modality_config(self): """Define modality configurations for data loading.""" from starVLA.dataloader.gr00t_lerobot.datasets import ModalityConfig
return { "video": ModalityConfig( delta_indices=self.observation_indices, modality_keys=self.video_keys, ), "state": ModalityConfig( delta_indices=self.observation_indices, modality_keys=self.state_keys, ), "action": ModalityConfig( delta_indices=self.action_indices, modality_keys=self.action_keys, ), "language": ModalityConfig( delta_indices=self.observation_indices, modality_keys=self.language_keys, ), }
def transform(self): """Define data transformations.""" from starVLA.dataloader.gr00t_lerobot.transform.base import ComposedModalityTransform from starVLA.dataloader.gr00t_lerobot.transform.state_action import ( StateActionToTensor, StateActionTransform, )
transforms = [ # State transforms StateActionToTensor(apply_to=self.state_keys), StateActionTransform( apply_to=self.state_keys, normalization_modes={key: "min_max" for key in self.state_keys}, ), # Action transforms StateActionToTensor(apply_to=self.action_keys), StateActionTransform( apply_to=self.action_keys, normalization_modes={key: "min_max" for key in self.action_keys}, ), ] return ComposedModalityTransform(transforms=transforms)Note the mapping relationship implemented by Modality in the DataConfig. For example, if a dataset contains state and action with all degrees of freedom including arm, gripper, body, and wheel, Modality can slice out the meaning of each index range (via the start and end keys), and then reassemble and organize them in the DataConfig.
Concrete example: Suppose your robot has a 7-DOF arm + 1 gripper, and the raw state is an 8-dimensional vector [j0, j1, j2, j3, j4, j5, j6, gripper]. In modality.json, you split it as: "arm_joint": {"start": 0, "end": 7} for the first 7 dims (joint angles) and "gripper_joint": {"start": 7, "end": 8} for the 8th dim (gripper state). This lets StarVLA know which dimensions are arm joints and which are gripper, enabling different normalization strategies for each.
Register the Config
Section titled “Register the Config”Add your config to the ROBOT_TYPE_CONFIG_MAP at the bottom of data_config.py:
ROBOT_TYPE_CONFIG_MAP = { # ... existing configs ... "my_robot": MyRobotDataConfig(),}Normalization Modes
Section titled “Normalization Modes”Available normalization modes for StateActionTransform:
| Mode | Description |
|---|---|
min_max | Normalize to [-1, 1] using min/max statistics |
q99 | Normalize using 1st and 99th percentiles (robust to outliers) |
binary | Map to {-1, 1} for binary actions (e.g., gripper open/close) |
rotation_6d | Convert rotation to 6D representation |
axis_angle | Convert rotation to axis-angle representation |
Step 3: Create Data Mix
Section titled “Step 3: Create Data Mix”Register your dataset in starVLA/dataloader/gr00t_lerobot/mixtures.py:
DATASET_NAMED_MIXTURES = { # ... existing mixtures ...
# Single dataset "my_dataset": [ ("my_dataset_name", 1.0, "my_robot"), # (dataset_folder_name, sampling_weight, robot_type_config) ],
# Multiple datasets with different weights "my_mixed_dataset": [ ("my_dataset_task1", 1.0, "my_robot"), ("my_dataset_task2", 0.5, "my_robot"), # Half sampling weight ("my_dataset_task3", 2.0, "my_robot"), # Double sampling weight ],}Data Directory Structure
Section titled “Data Directory Structure”Your data should be organized as:
playground/Datasets/MY_DATA_ROOT/├── my_dataset_task1/│ ├── meta/│ ├── data/│ └── videos/├── my_dataset_task2/│ ├── meta/│ ├── data/│ └── videos/└── my_dataset_task3/ ├── meta/ ├── data/ └── videos/Step 4: Create Training Config
Section titled “Step 4: Create Training Config”Create a YAML config file (e.g., examples/MyRobot/train_files/starvla_my_robot.yaml):
# ===== Run Configuration =====run_id: my_robot_training # Experiment name; checkpoints saved under run_root_dir/run_id/run_root_dir: results/Checkpoints # Root directory for checkpoint outputseed: 42trackers: [jsonl, wandb] # Logging: jsonl (local) + wandb (online)wandb_entity: your_wandb_entity # Your wandb username or teamwandb_project: my_robot_projectis_debug: false # Set true to use minimal data for quick debugging
# ===== Model Framework Configuration =====framework: name: QwenOFT # Choose: QwenOFT / QwenGR00T / QwenFast / QwenPI qwenvl: base_vlm: ./playground/Pretrained_models/Qwen3-VL-4B-Instruct # VLM base model path attn_implementation: flash_attention_2 vl_hidden_dim: 2048 # VLM hidden dimension (2048 for Qwen3-VL-4B) dino: dino_backbone: dinov2_vits14 # Optional extra vision encoder for spatial features
action_model: action_model_type: DiT-B # Action model type (DiT-B only for GR00T/PI frameworks) hidden_size: 1024 max_seq_len: 1024 action_dim: 14 # Action dimension = your robot's joint count (e.g., 7 joints × 2 arms = 14) state_dim: 14 # State dimension, usually same as action_dim future_action_window_size: 15 # How many future steps the model predicts (action chunk length - 1) action_horizon: 16 # Total action sequence length = future + 1 (current step) past_action_window_size: 0 # Historical action window (0 = no history) repeated_diffusion_steps: 8 # Diffusion sampling repeats during training (GR00T/PI only) num_inference_timesteps: 4 # Diffusion steps at inference (fewer = faster, less precise) num_target_vision_tokens: 32 # Number of compressed vision tokens from VLM # DiT Transformer internals (usually no need to modify): diffusion_model_cfg: cross_attention_dim: 2048 # Must match VLM's hidden_dim dropout: 0.2 num_layers: 16 output_dim: 2560
# ===== Dataset Configuration =====datasets: # VLM data (optional, only needed for co-training) vlm_data: dataset_py: vlm_datasets dataformat: llava_json dataset_use: sharegpt4v_coco # Dataset name registered in qwen_data_config.py per_device_batch_size: 4
# VLA data (robot manipulation data, required) vla_data: dataset_py: lerobot_datasets data_root_dir: playground/Datasets/MY_DATA_ROOT # Dataset root directory data_mix: my_dataset # Mixture name registered in mixtures.py action_type: abs_qpos # Action type: abs_qpos = absolute joint position (target angle values) default_image_resolution: [3, 224, 224] # [channels, height, width] per_device_batch_size: 16 load_all_data_for_training: true # Load all training data into memory at startup (faster training, but uses more RAM) obs: ["image_0"] # Which cameras to use (image_0 = first camera in DataConfig's video_keys list) image_size: [224,224] video_backend: torchvision_av # Video decode backend (torchvision_av or decord)
# ===== Trainer Configuration =====trainer: epochs: 100 max_train_steps: 100000 # Max training steps (stops here regardless of epochs) num_warmup_steps: 5000 # Learning rate warmup steps save_interval: 5000 # Save checkpoint every N steps eval_interval: 100 # Evaluate on validation set every N steps
# Per-module learning rates: different components can use different rates learning_rate: base: 1e-05 # Default LR (used for modules not specified below) qwen_vl_interface: 1.0e-05 # VLM backbone LR action_model: 1.0e-04 # Action head LR (higher since training from scratch)
lr_scheduler_type: cosine_with_min_lr scheduler_specific_kwargs: min_lr: 5.0e-07 # Minimum LR for cosine decay
freeze_modules: '' # Module paths to freeze (empty = all trainable) loss_scale: vla: 1.0 # VLA loss weight vlm: 0.1 # VLM loss weight (for co-training) repeated_diffusion_steps: 4 # Diffusion sampling repeats at train time (overrides action_model value) max_grad_norm: 1.0 # Gradient clipping threshold gradient_accumulation_steps: 1 # Increase if running out of GPU memory
optimizer: name: AdamW betas: [0.9, 0.95] eps: 1.0e-08 weight_decay: 1.0e-08| Framework | Action Head | Best For |
|---|---|---|
QwenOFT | MLP | Fast inference, simple tasks |
QwenGR00T | Flow-matching DiT | Complex manipulation, high precision |
QwenFast | Discrete tokens | Token-based action prediction |
QwenPI | Diffusion | Multimodal action distributions |
You can also choose community-supported models, which share the BaseFramework and can be adapted simply by modifying the config.
Step 5: Run Training
Section titled “Step 5: Run Training”Create a training script (e.g., examples/MyRobot/train_files/run_train.sh):
#!/bin/bash
# ========== Required parameter ==========config_yaml=./examples/MyRobot/train_files/starvla_my_robot.yaml # Training config file (required)
# ========== Optional overrides (CLI takes priority over YAML values) ==========Framework_name=QwenOFTbase_vlm=playground/Pretrained_models/Qwen2.5-VL-3B-Instructdata_root=playground/Datasets/MY_DATA_ROOTdata_mix=my_datasetrun_root_dir=./results/Checkpointsrun_id=my_robot_experiment
# Create output directoryoutput_dir=${run_root_dir}/${run_id}mkdir -p ${output_dir}cp $0 ${output_dir}/
# Launch training# --config_yaml is the only required argument; all other --xxx flags are optional CLI overrides.# If you've already configured everything in your YAML file, you can omit the override flags below.accelerate launch \ --config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \ --num_processes 8 \ starVLA/training/train_starvla.py \ --config_yaml ${config_yaml} \ --framework.name ${Framework_name} \ --framework.qwenvl.base_vlm ${base_vlm} \ --datasets.vla_data.data_root_dir ${data_root} \ --datasets.vla_data.data_mix ${data_mix} \ --datasets.vla_data.per_device_batch_size 4 \ --trainer.max_train_steps 100000 \ --trainer.save_interval 10000 \ --run_root_dir ${run_root_dir} \ --run_id ${run_id}Multi-Node Training
Section titled “Multi-Node Training”For multi-node distributed training:
accelerate launch \ --config_file starVLA/config/deepseeds/deepspeed_zero2.yaml \ --main_process_ip $MASTER_ADDR \ --main_process_port $MASTER_PORT \ --machine_rank $SLURM_PROCID \ --num_machines $SLURM_NNODES \ --num_processes ${TOTAL_GPUS} \ starVLA/training/train_starvla.py \ --config_yaml ${config_yaml} \ # ... other arguments