Skip to content

Project Overview

StarVLA is a lego-like, modular codebase for developing Vision-Language Models (VLMs) into Vision-Language-Action (VLA) models.

In short: VLMs understand images and text; VLAs additionally output robot actions. StarVLA handles this transformation end-to-end — from data preparation and model training to simulation evaluation — with components that are independently debuggable and plug-and-play.

StarVLA officially provides the Qwen-VL-based StarVLA Model Family with 4 different action output strategies:

FrameworkAction OutputReference
Qwen-FASTEncodes actions as discrete tokens predicted by the language modelpi0-FAST
Qwen-OFTMLP head after VLM output, directly regressing continuous action valuesOpenVLA-OFT
Qwen-PIFlow-Matching (diffusion-based) method for generating continuous actionspi0
Qwen-GR00TDual-system: VLM for high-level reasoning + DiT for fast action generationGR00T-N1

Modularity means: you only need to define your model structure in a Framework, and you can reuse the shared Trainer, Dataloader, and evaluation/deployment pipeline — no need to rewrite training loops or evaluation code.

  • Single-task imitation learning (learning from human demonstrations — no reward function needed).
  • Multimodal multi-task co-training (training on multiple data sources simultaneously to prevent the model from forgetting previously learned capabilities).
  • [Planned] Reinforcement learning adaptation.

Supported or planned benchmarks:

  • Supported: SimplerEnv, LIBERO, RoboCasa, RoboTwin, CALVIN, BEHAVIOR.
  • Planned: SO101, RLBench.

StarVLA results on SimplerEnv.

StarVLA results on LIBERO.

StarVLA results on RoboCasa.

Results are continuously tracked in a live Overleaf report (a continuously updated experimental report PDF with the latest benchmark data and analysis): https://www.overleaf.com/read/qqtwrnprctkf#d5bdce


Projects Based on StarVLA:


Latest Updates

  • 2025/12/25: Pipelines established for Behavior-1K, RoboTwin 2.0, and CALVIN; looking to share baselines with the community.
  • 2025/12/25: RoboCasa evaluation support released, achieving SOTA without pretraining. See the RoboCasa documentation.
  • 2025/12/15: Release regression check completed; ongoing updates in the Daily Development Log.
  • 2025/12/09: Open-source training for VLM, VLA, and VLA+VLM co-training. See the VLM co-training documentation.
  • 2025/11/12: Florence-2 support added for resource-constrained VLM training (single A100). See Lego-like Design for workflow notes.
  • 2025/10/30: LIBERO training and evaluation guides released.
  • 2025/10/25: Script links and packaging polished based on community feedback.