Quick Start

Background

If you’re new to these concepts, here’s a quick primer:

VLM (Vision-Language Model): An AI model that understands both images and text, such as Qwen-VL or GPT-4V.
VLA (Vision-Language-Action Model): Extends a VLM with action output, so the model can not only “see” and “speak” but also “act” — it takes images and natural language instructions as input and outputs robot actions (e.g., joint angles — the target angle values for each joint of a robot arm). Besides building from VLMs, VLAs can also be built from WMs (World Models) — video generation models that predict future states.
What StarVLA does: Think of StarVLA as “PyTorch for VLA development” — it provides the full infrastructure for transforming VLMs into VLAs: data loading, training loops, evaluation, and deployment pipelines are all reusable so you can focus on the model itself. Whether you start from a VLM or a WM, you use the same toolkit for training and evaluation.

Requirements

Item	Minimum	Recommended
GPU	1x NVIDIA GPU (>=16GB VRAM)	8×A800 or more (A100 / H200, etc.)
CUDA	12.0+	12.4
Python	3.10	3.10
Disk	~20GB (code + base model)	100GB+ (with datasets)
OS	Linux (Ubuntu 20.04+)	Ubuntu 22.04

Step 1: Installation

Clone the repository

git clone https://github.com/starVLA/starVLA
cd starVLA

Create a conda environment

conda create -n starVLA python=3.10 -y
conda activate starVLA

Install dependencies

# Install base dependencies
pip install -r requirements.txt

# Install FlashAttention2 (required for fast Transformer inference)
# Note: flash-attn compiles from source; first-time installation may take 10-20 minutes — this is normal
pip install flash-attn --no-build-isolation

# Install starVLA in development mode (-e = editable mode:
# code changes take effect immediately without reinstalling)
pip install -e .

Troubleshooting: flash-attn Installation

flash-attn is sensitive to CUDA and PyTorch versions. If installation fails, check version compatibility:

# Check CUDA version
nvcc -V

# Check installed package versions
pip list | grep -E 'torch|transformers|flash-attn'

Verified combinations:

flash-attn==2.7.4.post1 + CUDA 12.0 / 12.4 + PyTorch 2.6.0

If your nvcc version doesn’t match your PyTorch CUDA version (e.g., nvcc 11.8 but PyTorch cu121), you need to align them. The simplest way is to reinstall PyTorch for your nvcc version:

# Example: for nvcc 12.4
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

Step 2: Verify Installation

Two quick steps to confirm everything works:

Download a base model

StarVLA is built on the Qwen-VL model family. You need to download a base model first.

# Install the Hugging Face CLI (if not already installed)
pip install huggingface_hub[cli]

# Download Qwen3-VL-4B (~8GB)
huggingface-cli download Qwen/Qwen3-VL-4B-Instruct --local-dir ./playground/Pretrained_models/Qwen3-VL-4B-Instruct

Run a framework smoke test

Run a forward pass with fake data to verify the model loads and predicts correctly:
Terminal window
```
python starVLA/model/framework/QwenGR00T.py
```
The first run needs to load model weights, which takes about 1-2 minutes. Please be patient.

You should see:
- The full model structure printed
- model.predict_action(fake_data) returns an action array (shape: [batch, action_horizon, action_dim])
- No errors
Tip
- WARNING messages during model loading are normal (typically compatibility notices). As long as the script finishes without errors, you’re good.
- If you have multiple GPUs, use CUDA_VISIBLE_DEVICES=0 to select which one to use, e.g.: CUDA_VISIBLE_DEVICES=0 python starVLA/model/framework/QwenGR00T.py
If you get CUDA out of memory, try a smaller model (e.g., Qwen2.5-VL-3B).

Directory Structure

After installation, here’s the project layout you’ll work with:

starVLA/                          # Project root (git repository)
├── starVLA/                      # Core package (Python convention: outer dir is the project, inner same-name dir is the actual package code)
│   ├── model/framework/          # Model definitions (QwenOFT.py, QwenGR00T.py, etc.)
│   ├── dataloader/               # Data loading pipelines
│   ├── training/                 # Training scripts
│   └── config/                   # DeepSpeed and training config templates
├── deployment/                   # Deployment (policy server)
├── examples/                     # Per-benchmark evaluation and training examples
│   ├── LIBERO/
│   ├── SimplerEnv/
│   ├── Robocasa_tabletop/
│   ├── Robotwin/
│   └── Behavior/
├── playground/                   # Convention directory for models and data
│   ├── Pretrained_models/        # Base models (e.g., Qwen3-VL-4B-Instruct)
│   └── Datasets/                 # Training datasets
└── results/                      # Training outputs (checkpoints, logs)
    └── Checkpoints/

What’s Next?

Once your installation is verified, choose your path:

Your Goal	Recommended Reading
Understand StarVLA’s design	Lego-like Design
Run evaluation with existing checkpoints	Check Model Zoo for checkpoints, then follow a benchmark guide (LIBERO, SimplerEnv)
Train with your own data	Use Your Own LeRobot Dataset
Co-train with VLM data	Co-Training with VLM Data
Common questions	FAQ