RoboCasa Evaluation

RoboCasa is a large-scale household simulation benchmark. Here we use the GR1 Tabletop Tasks subset, featuring 24 tabletop Pick-and-Place tasks performed by a Fourier GR1 humanoid robot (upper body, dual arms).

This document provides instructions for reproducing our experimental results.

The evaluation process consists of two main parts:

Setting up the robocasa environment and dependencies.
Running the evaluation by launching services in both starVLA and robocasa environments.

We have verified that this workflow runs successfully on NVIDIA A100 GPUs.

Experimental Results

Task	GR00T-N1.6	Qwen3GR00T	Qwen3PI	Qwen3OFT	Qwen3FAST
PnP Bottle To Cabinet Close	51.5	46.0	26.0	30.0	38.0
PnP Can To Drawer Close	13.0	80.0	62.0	76.0	44.0
PnP Cup To Drawer Close	8.5	54.0	42.0	44.0	56.0
PnP Milk To Microwave Close	14.0	48.0	50.0	44.0	44.0
PnP Potato To Microwave Close	41.5	28.0	42.0	32.0	14.0
PnP Wine To Cabinet Close	16.5	46.0	32.0	36.0	14.0
PnP Novel From Cuttingboard To Basket	58.0	48.0	40.0	50.0	54.0
PnP Novel From Cuttingboard To Cardboardbox	46.5	40.0	46.0	40.0	42.0
PnP Novel From Cuttingboard To Pan	68.5	68.0	60.0	70.0	58.0
PnP Novel From Cuttingboard To Pot	65.0	52.0	40.0	54.0	58.0
PnP Novel From Cuttingboard To Tieredbasket	46.5	56.0	44.0	38.0	40.0
PnP Novel From Placemat To Basket	58.5	42.0	44.0	32.0	36.0
PnP Novel From Placemat To Bowl	57.5	44.0	52.0	58.0	38.0
PnP Novel From Placemat To Plate	63.0	48.0	50.0	52.0	42.0
PnP Novel From Placemat To Tieredshelf	28.5	18.0	28.0	24.0	18.0
PnP Novel From Plate To Bowl	57.0	60.0	52.0	60.0	52.0
PnP Novel From Plate To Cardboardbox	43.5	50.0	40.0	50.0	30.0
PnP Novel From Plate To Pan	51.0	54.0	36.0	66.0	48.0
PnP Novel From Plate To Plate	78.7	70.0	48.0	68.0	50.0
PnP Novel From Tray To Cardboardbox	51.5	38.0	34.0	44.0	28.0
PnP Novel From Tray To Plate	71.0	56.0	64.0	56.0	34.0
PnP Novel From Tray To Pot	64.5	50.0	44.0	62.0	46.0
PnP Novel From Tray To Tieredbasket	57.0	36.0	50.0	54.0	36.0
PnP Novel From Tray To Tieredshelf	31.5	16.0	28.0	30.0	16.0
Average	47.6	47.8	43.9	48.8	39.0

Note: All values are success rates in percentage (%). A single model was trained for all 24 tasks. Results are reported over 50 rollouts per task.

RoboCasa Evaluation

0. Download Checkpoints

First, download the checkpoints from:

1. Environment Setup

To set up the environment, please first follow the official RoboCasa installation guide to install the base robocasa-gr1-tabletop-tasks environment.

Then install socket support:

pip install tyro

2. Evaluation Workflow

Step 1. Start the server (starVLA environment)

In the first terminal, activate the starVLA conda environment and run:

python deployment/model_server/server_policy.py \
        --ckpt_path ${your_ckpt} \
        --port 5678 \
        --use_bf16

Step 2. Start the simulation (robocasa environment)

In the second terminal, activate the robocasa conda environment and run:

export PYTHONPATH=$(pwd):${PYTHONPATH}
your_ckpt=StarVLA/Qwen3-VL-OFT-Robocasa/checkpoints/steps_90000_pytorch_model.pt

python examples/Robocasa_tabletop/eval_files/simulation_env.py\
   --args.env_name ${env_name} \
   --args.port 5678 \
   --args.n_episodes 50 \
   --args.n_envs 1 \
   --args.max_episode_steps 720 \
   --args.n_action_steps 12 \
   --args.video_out_path ${video_out_path} \
   --args.pretrained_path ${your_ckpt}

Batch Evaluation (Optional)

If you have more GPUs, you can use the batch evaluation script:

bash examples/Robocasa_tabletop/batch_eval_args.sh

Note: Please ensure that you specify the correct checkpoint path in batch_eval_args.sh

Reproduce Training Results

Step 0: Download the training dataset

Download the PhysicalAI-Robotics-GR00T-X-Embodiment-Sim directory datasets from HuggingFace to the playground/Datasets/nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim directory.

To download only the relevant finetuning folders, you can refer to GR00T-N1.5 repo’s instruction.

Or use the script to download the *_1000 folders:

python examples/Robocasa_tabletop/download_gr00t_ft_data.py

Step 1: Start Training

Different datasets can be selected by modifying the parameter data_mix, and the following script can be used to fine-tune the *_1000 datasets:

bash examples/Robocasa_tabletop/train_files/run_robocasa.sh