Skip to content

Quick Start

If you’re new to these concepts, here’s a quick primer:

  • VLM (Vision-Language Model): An AI model that understands both images and text, such as Qwen-VL or GPT-4V.
  • VLA (Vision-Language-Action Model): Extends a VLM with action output, so the model can not only “see” and “speak” but also “act” — it takes images and natural language instructions as input and outputs robot actions (e.g., joint angles — the target angle values for each joint of a robot arm). Besides building from VLMs, VLAs can also be built from WMs (World Models) — video generation models that predict future states.
  • What StarVLA does: Think of StarVLA as “PyTorch for VLA development” — it provides the full infrastructure for transforming VLMs into VLAs: data loading, training loops, evaluation, and deployment pipelines are all reusable so you can focus on the model itself. Whether you start from a VLM or a WM, you use the same toolkit for training and evaluation.
ItemMinimumRecommended
GPU1x NVIDIA GPU (>=16GB VRAM)8×A800 or more (A100 / H200, etc.)
CUDA12.0+12.4
Python3.103.10
Disk~20GB (code + base model)100GB+ (with datasets)
OSLinux (Ubuntu 20.04+)Ubuntu 22.04
  1. Clone the repository

    Terminal window
    git clone https://github.com/starVLA/starVLA
    cd starVLA
  2. Create a conda environment

    Terminal window
    conda create -n starVLA python=3.10 -y
    conda activate starVLA
  3. Install dependencies

    Terminal window
    # Install base dependencies
    pip install -r requirements.txt
    # Install FlashAttention2 (required for fast Transformer inference)
    # Note: flash-attn compiles from source; first-time installation may take 10-20 minutes — this is normal
    pip install flash-attn --no-build-isolation
    # Install starVLA in development mode (-e = editable mode:
    # code changes take effect immediately without reinstalling)
    pip install -e .

flash-attn is sensitive to CUDA and PyTorch versions. If installation fails, check version compatibility:

Terminal window
# Check CUDA version
nvcc -V
# Check installed package versions
pip list | grep -E 'torch|transformers|flash-attn'

Verified combinations:

  • flash-attn==2.7.4.post1 + CUDA 12.0 / 12.4 + PyTorch 2.6.0

If your nvcc version doesn’t match your PyTorch CUDA version (e.g., nvcc 11.8 but PyTorch cu121), you need to align them. The simplest way is to reinstall PyTorch for your nvcc version:

Terminal window
# Example: for nvcc 12.4
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

Two quick steps to confirm everything works:

  1. Download a base model

    StarVLA is built on the Qwen-VL model family. You need to download a base model first.

    Terminal window
    # Install the Hugging Face CLI (if not already installed)
    pip install huggingface_hub[cli]
    # Download Qwen3-VL-4B (~8GB)
    huggingface-cli download Qwen/Qwen3-VL-4B-Instruct --local-dir ./playground/Pretrained_models/Qwen3-VL-4B-Instruct
  2. Run a framework smoke test

    Run a forward pass with fake data to verify the model loads and predicts correctly:

    Terminal window
    python starVLA/model/framework/QwenGR00T.py

    You should see:

    • The full model structure printed
    • model.predict_action(fake_data) returns an action array (shape: [batch, action_horizon, action_dim])
    • No errors

    If you get CUDA out of memory, try a smaller model (e.g., Qwen2.5-VL-3B).

After installation, here’s the project layout you’ll work with:

starVLA/ # Project root (git repository)
├── starVLA/ # Core package (Python convention: outer dir is the project, inner same-name dir is the actual package code)
│ ├── model/framework/ # Model definitions (QwenOFT.py, QwenGR00T.py, etc.)
│ ├── dataloader/ # Data loading pipelines
│ ├── training/ # Training scripts
│ └── config/ # DeepSpeed and training config templates
├── deployment/ # Deployment (policy server)
├── examples/ # Per-benchmark evaluation and training examples
│ ├── LIBERO/
│ ├── SimplerEnv/
│ ├── Robocasa_tabletop/
│ ├── Robotwin/
│ └── Behavior/
├── playground/ # Convention directory for models and data
│ ├── Pretrained_models/ # Base models (e.g., Qwen3-VL-4B-Instruct)
│ └── Datasets/ # Training datasets
└── results/ # Training outputs (checkpoints, logs)
└── Checkpoints/

Once your installation is verified, choose your path:

Your GoalRecommended Reading
Understand StarVLA’s designLego-like Design
Run evaluation with existing checkpointsCheck Model Zoo for checkpoints, then follow a benchmark guide (LIBERO, SimplerEnv)
Train with your own dataUse Your Own LeRobot Dataset
Co-train with VLM dataCo-Training with VLM Data
Common questionsFAQ