Evaluation Framework

Overview

StarVLA standardizes the inference pipeline for real-robot or simulation evaluations by tunneling data through WebSocket (a network protocol that enables bidirectional real-time communication between client and server), enabling new models to be integrated into existing evaluation environments with minimal changes.

Architecture

The StarVLA framework uses a client-server architecture to separate the evaluation/deployment environment (client) from the policy server (model inference).

Policy Server: Loads the model, receives observations, and outputs normalized actions.
Client: Interfaces with the simulator or real robot, and post-processes model outputs:
- Unnormalize: Converts the model’s [-1, 1] normalized actions back to physical quantities (e.g., joint angles).
- Delta-to-Absolute: If the model outputs incremental actions relative to the current position, adds them to the current state to get absolute target positions.
- Action Ensemble: The model may predict multiple future steps at once; overlapping predictions from consecutive calls are weighted-averaged for smoother execution.

Policy Server Architecture

Component Description

Component	Description
Sim / Real Controller	External to StarVLA: Contains the core loop of the evaluation environment or robot controller, handling observation collection (`get_obs()`) and action execution (`apply_action()`).
PolicyClient.py & WebSocket & PolicyServer	Standard Communication Flow: Client-side wrapper responsible for data transmission (tunneling) and interfacing the environment with the server.
Framework.py	Model Infer Core: Contains the user-defined model inference function (`Framework.predict_action`), which is the main logic for generating actions.

Data Protocol

Minimal pseudo-code example (evaluation-side client):

# Import path: from deployment.policy_client.policy_client import WebsocketClientPolicy
import WebsocketClientPolicy

client = WebsocketClientPolicy(
    host="127.0.0.1",
    port=10092
)

while True:
    images = capture_multiview()          # returns List[np.ndarray]
    lang = get_instruction()              # may come from task scripts
    example = {
        "image": images,
        "lang": lang,
    }

    result = client.predict_action(example)  # --> forwarded to framework.predict_action
    action = result["normalized_actions"][0] # take the first item in the batch
    apply_action(action)

For the Model Server, simply launch it with:

#!/bin/bash
export PYTHONPATH=$(pwd):${PYTHONPATH}

# Point to your StarVLA conda Python
# $(which python) automatically picks up the Python from your currently activated conda env
# Make sure you've run `conda activate starVLA` before executing this script
export star_vla_python=$(which python)
your_ckpt=results/Checkpoints/xxx.pt   # Replace with your checkpoint path
gpu_id=0
port=5694

# export DEBUG=true
CUDA_VISIBLE_DEVICES=$gpu_id ${star_vla_python} deployment/model_server/server_policy.py \
    --ckpt_path ${your_ckpt} \
    --port ${port} \
    --use_bf16

Notes

Ensure every field in example is JSON-serializable or convertible (lists, floats, ints, strings); convert custom objects explicitly.
Images must be sent as np.ndarray. Perform PIL.Image -> np.ndarray before transmission and convert back on the server using to_pil_preserve (from starVLA.model.utils import to_pil_preserve) if required.
Keep auxiliary metadata (episode IDs, timestamps, etc.) in dedicated keys so the framework can forward or log them without collisions.

PolicyClient Interface Design

Policy Interface

The *2model_interface.py interface is designed to wrap and abstract any variations originating from the simulation or real-world environment. It also supports user-defined controllers, such as converting delta actions to absolute joint positions. You can refer to the implementations for different benchmarks in examples to build your own deployment.

FAQ

Q: Why do examples contain files such as model2{bench}_client.py?

A: They encapsulate benchmark-specific alignment, e.g., action ensembling, converting delta actions to absolute actions, or bridging simulator quirks, so the model server can stay generic.

Q: Why does the model expect PIL images while the transport uses ndarray?

A: WebSocket payloads do not serialize PIL objects directly. Convert to np.ndarray on the client side and restore to PIL inside the framework if the model requires it.

Feedback on environment-specific needs is welcome via issues.