BEHAVIOR-1K Evaluation
BEHAVIOR-1K is a household task simulation benchmark by Stanford, featuring 1000 everyday activities (cooking, cleaning, organizing, etc.). We follow the 2025 BEHAVIOR Challenge structure to train and evaluate on 50 full-length household tasks. It uses the R1Pro humanoid robot (dual arms + base + torso, 23-dimensional action space).
The evaluation process consists of two main parts:
- Setting up the
behaviorenvironment and dependencies. - Running the evaluation by launching services in both
starVLAandbehaviorenvironments.
BEHAVIOR Evaluation
Section titled “BEHAVIOR Evaluation”1. Environment Setup
Section titled “1. Environment Setup”To set up the conda environment for behavior:
git clone https://github.com/StanfordVL/BEHAVIOR-1K.gitconda create -n behavior python=3.10 -yconda activate behaviorcd BEHAVIOR-1Kpip install "setuptools<=79"# --omnigibson: Install OmniGibson simulator (BEHAVIOR's physics engine)# --bddl: Install BDDL (Behavior Domain Definition Language for task definitions)# --joylo: Install JoyLo (teleoperation control interface)# --dataset: Download BEHAVIOR dataset assets (scenes, object models, etc.)./setup.sh --omnigibson --bddl --joylo --datasetconda install -c conda-forge libglupip install rich omegaconf hydra-core msgpack websockets av pandas google-authAlso in starVLA environment:
pip install websockets2. Evaluation Workflow
Section titled “2. Evaluation Workflow”Steps:
- Download the checkpoint
- Choose the script below according to your need
(A) Parallel Evaluation Script
Section titled “(A) Parallel Evaluation Script”CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash examples/Behavior/start_parallel_eval.shBefore running start_parallel_eval.sh, set the following paths:
star_vla_python: Python interpreter for the StarVLA environmentsim_python: Python interpreter for the Behavior environmentTASKS_JSONL_PATH: Task description file downloaded from the training dataset (included atexamples/Behavior/tasks.jsonl)BEHAVIOR_ASSET_PATH: Local path to the behavior asset path (default is inBEHAVIOR-1K/datasetsafter installing with./setup.sh)
(B) Debugging with Separate Terminals
Section titled “(B) Debugging with Separate Terminals”For the ease of debugging, you may also start the client (evaluation environment) and server (policy) in two separate terminals:
bash examples/Behavior/start_server.shbash examples/Behavior/start_client.shThe above debugging files will conduct evaluation on train set.
(C) Per-Task Evaluation (Memory-Safe)
Section titled “(C) Per-Task Evaluation (Memory-Safe)”To prevent memory overflow, we implemented another file start_parallel_eval_per_task.sh:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash examples/Behavior/start_parallel_eval_per_task.sh- The script will run evaluation for each task in
INSTANCE_NAMESiteratively - For each task, allocate all instances from
TEST_EVAL_INSTANCE_IDSacross GPUs - Wait for the previous task to finish, then proceed to the next task
Wrapper Types
Section titled “Wrapper Types”-
RGBLowResWrapper: Only uses RGB as visual observation and camera resolutions of 224×224. Only using low-res RGB can help speed up the simulator and reduce evaluation time. This wrapper is OK to use in the standard track.
-
DefaultWrapper: Wrapper with the default observation config used during data collection (RGB + depth + segmentation, 720p for head camera and 480p for wrist camera). This wrapper is OK to use in the standard track, but evaluation will be considerably slower compared to RGBLowResWrapper.
-
RichObservationWrapper: Loads additional observation modalities, such as normal and flow, as well as privileged task information. This wrapper can only be used in the privileged information track.
Action Dimensions
Section titled “Action Dimensions”BEHAVIOR has action dim = 23:
"R1Pro": { "base": np.s_[0:3], # Indices 0-2 "torso": np.s_[3:7], # Indices 3-6 "left_arm": np.s_[7:14], # Indices 7-13 "left_gripper": np.s_[14:15], # Index 14 "right_arm": np.s_[15:22], # Indices 15-21 "right_gripper": np.s_[22:23] # Index 22}Video Saving
Section titled “Video Saving”The video will be saved in the format of {task_name}_{idx}_{epi}.mp4, where idx is the instance number, epi is the episode number.
Common Issues
Section titled “Common Issues”Segmentation fault (core dumped): A likely reason is Vulkan is not successfully installed. Check this link.
ImportError: libGL.so.1: cannot open shared object file:
apt-get install ffmpeg libsm6 libxext6 -y