Quick Start
Background
Section titled “Background”If you’re new to these concepts, here’s a quick primer:
- VLM (Vision-Language Model): An AI model that understands both images and text, such as Qwen-VL or GPT-4V.
- VLA (Vision-Language-Action Model): Extends a VLM with action output, so the model can not only “see” and “speak” but also “act” — it takes images and natural language instructions as input and outputs robot actions (e.g., joint angles — the target angle values for each joint of a robot arm). Besides building from VLMs, VLAs can also be built from WMs (World Models) — video generation models that predict future states.
- What StarVLA does: Think of StarVLA as “PyTorch for VLA development” — it provides the full infrastructure for transforming VLMs into VLAs: data loading, training loops, evaluation, and deployment pipelines are all reusable so you can focus on the model itself. Whether you start from a VLM or a WM, you use the same toolkit for training and evaluation.
Requirements
Section titled “Requirements”| Item | Minimum | Recommended |
|---|---|---|
| GPU | 1x NVIDIA GPU (>=16GB VRAM) | 8×A800 or more (A100 / H200, etc.) |
| CUDA | 12.0+ | 12.4 |
| Python | 3.10 | 3.10 |
| Disk | ~20GB (code + base model) | 100GB+ (with datasets) |
| OS | Linux (Ubuntu 20.04+) | Ubuntu 22.04 |
Step 1: Installation
Section titled “Step 1: Installation”-
Clone the repository
Terminal window git clone https://github.com/starVLA/starVLAcd starVLA -
Create a conda environment
Terminal window conda create -n starVLA python=3.10 -yconda activate starVLA -
Install dependencies
Terminal window # Install base dependenciespip install -r requirements.txt# Install FlashAttention2 (required for fast Transformer inference)# Note: flash-attn compiles from source; first-time installation may take 10-20 minutes — this is normalpip install flash-attn --no-build-isolation# Install starVLA in development mode (-e = editable mode:# code changes take effect immediately without reinstalling)pip install -e .
Troubleshooting: flash-attn Installation
Section titled “Troubleshooting: flash-attn Installation”flash-attn is sensitive to CUDA and PyTorch versions. If installation fails, check version compatibility:
# Check CUDA versionnvcc -V
# Check installed package versionspip list | grep -E 'torch|transformers|flash-attn'Verified combinations:
flash-attn==2.7.4.post1+ CUDA 12.0 / 12.4 + PyTorch 2.6.0
If your nvcc version doesn’t match your PyTorch CUDA version (e.g., nvcc 11.8 but PyTorch cu121), you need to align them. The simplest way is to reinstall PyTorch for your nvcc version:
# Example: for nvcc 12.4pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124Step 2: Verify Installation
Section titled “Step 2: Verify Installation”Two quick steps to confirm everything works:
-
Download a base model
StarVLA is built on the Qwen-VL model family. You need to download a base model first.
Terminal window # Install the Hugging Face CLI (if not already installed)pip install huggingface_hub[cli]# Download Qwen3-VL-4B (~8GB)huggingface-cli download Qwen/Qwen3-VL-4B-Instruct --local-dir ./playground/Pretrained_models/Qwen3-VL-4B-Instruct -
Run a framework smoke test
Run a forward pass with fake data to verify the model loads and predicts correctly:
Terminal window python starVLA/model/framework/QwenGR00T.pyYou should see:
- The full model structure printed
model.predict_action(fake_data)returns an action array (shape:[batch, action_horizon, action_dim])- No errors
If you get CUDA out of memory, try a smaller model (e.g., Qwen2.5-VL-3B).
Directory Structure
Section titled “Directory Structure”After installation, here’s the project layout you’ll work with:
starVLA/ # Project root (git repository)├── starVLA/ # Core package (Python convention: outer dir is the project, inner same-name dir is the actual package code)│ ├── model/framework/ # Model definitions (QwenOFT.py, QwenGR00T.py, etc.)│ ├── dataloader/ # Data loading pipelines│ ├── training/ # Training scripts│ └── config/ # DeepSpeed and training config templates├── deployment/ # Deployment (policy server)├── examples/ # Per-benchmark evaluation and training examples│ ├── LIBERO/│ ├── SimplerEnv/│ ├── Robocasa_tabletop/│ ├── Robotwin/│ └── Behavior/├── playground/ # Convention directory for models and data│ ├── Pretrained_models/ # Base models (e.g., Qwen3-VL-4B-Instruct)│ └── Datasets/ # Training datasets└── results/ # Training outputs (checkpoints, logs) └── Checkpoints/What’s Next?
Section titled “What’s Next?”Once your installation is verified, choose your path:
| Your Goal | Recommended Reading |
|---|---|
| Understand StarVLA’s design | Lego-like Design |
| Run evaluation with existing checkpoints | Check Model Zoo for checkpoints, then follow a benchmark guide (LIBERO, SimplerEnv) |
| Train with your own data | Use Your Own LeRobot Dataset |
| Co-train with VLM data | Co-Training with VLM Data |
| Common questions | FAQ |