The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?
Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities.
- Preprint version released
- ScanReQA dataset released
- Evaluation code and cached results released
conda create -n scanreqa python=3.9 -y
conda activate scanreqaThen install the required packages:
pip install -r requirements.txtNotes:
evaluate/em_acc_recall.pyuses the bundledmeteor-1.5.jarfor theMETEORmetric.Javamust be installed on the machine if you want to computeMETEOR.
Download dataset from here. The dataset is formatted as follows:
Each item in Rel_SpatialQA.json corresponds to one relative spatial reasoning question and shares several fields with Abs_SpatialQA.json, while adding relation-specific annotations:
question_idUnique sample identifier, for exampleval-scene0011-2-0.scene_idScanNet scene identifier, for examplescene0011_00.questionThe relative spatial reasoning question.answersA list of valid textual answers.referred_obj_utteranceThe natural-language sentence that states the original relation.referred_obj_nameThe name of the referred target object.referred_obj_idThe integer object id of the referred target object.object_idsA list of scene object ids involved in the question context.object_namesA list of object category names aligned withobject_ids.object_captionsA list of textual descriptions aligned withobject_idsandobject_names.spatial_tripletA three-element list in the form[target, relation, anchor].spatial_triplet_reversedThe reversed relational form corresponding tospatial_triplet.reversibleAn integer flag indicating whether the relation can be meaningfully reversed.
Example:
{
"answers": ["the brown wooden chair"],
"object_ids": [2, 9],
"object_names": ["table", "chair"],
"question": "What is at the head of the long rectangular table?",
"question_id": "val-scene0011-2-0",
"scene_id": "scene0011_00",
"object_captions": [
"Caption for object 2",
"Caption for object 9"
],
"referred_obj_utterance": "The brown wooden chair is placed at the head of the long rectangular table.",
"referred_obj_name": "the brown wooden chair.",
"referred_obj_id": 9,
"spatial_triplet": ["the brown wooden chair", "at the head of", "the long rectangular table"],
"spatial_triplet_reversed": ["the long rectangular table", "at the foot of", "the brown wooden chair"],
"reversible": 1
}Each item in Abs_SpatialQA.json corresponds to one absolute spatial reasoning question and contains the following fields:
question_idUnique sample identifier, for exampleval-scene0011-2.scene_idScanNet scene identifier, for examplescene0011_00.questionThe full multiple-choice absolute spatial reasoning question.answersA list of valid answers. In the current release this is typically a single option such as["B"].referred_obj_utteranceThe natural-language reference sentence that describes the target object.referred_obj_nameThe textual name of the target object mentioned in the question context.referred_obj_idThe integer object id of the referred target object in the scene.referred_obj_ansThe correct multiple-choice option for the referred object, such asB.object_idsA list of scene object ids that are explicitly involved in the question context.object_namesA list of object category names aligned withobject_ids.object_captionsA list of textual descriptions aligned withobject_idsandobject_names.
Example:
{
"answers": ["B"],
"object_ids": [2, 9],
"object_names": ["table", "chair"],
"question": "The brown wooden chair is placed at the head of the long rectangular table. What's the central coordinate of the the brown wooden chair.Please select the option closest to the brown wooden chair from the following coordinate options. Your answer can only be one of A, B, C, and D.",
"question_id": "val-scene0011-2",
"scene_id": "scene0011_00",
"object_captions": [
"Caption for object 2",
"Caption for object 9"
],
"referred_obj_utterance": "The brown wooden chair is placed at the head of the long rectangular table.",
"referred_obj_name": "the brown wooden chair.",
"referred_obj_id": 9,
"referred_obj_ans": "B"
}We provide the evaluation code and example outputs for 3D LLMs. To evaluate a custom model, please ensure that its output format is consistent with the provided examples.
Each output file is a json list. Each item in the list corresponds to one evaluation example and has the following fields:
sourceThe unique sample identifier. Example:val-scene0011-0.scene_idThe ScanNet scene identifier for the sample. Example:scene0011_00.instructionThe full input prompt given to the model, usually including theUSERquestion and theASSISTANTprefix.response_gtA list of ground-truth answers. Multiple equivalent reference answers may be provided for the same question.response_predThe model prediction for the sample.
Example:
{
"source": "val-scene0011-0",
"scene_id": "scene0011_00",
"instruction": "USER: What color is the chair in the kitchen? ASSISTANT:",
"response_gt": ["dark brown", "brown"],
"response_pred": "brown"
}To run the evaluation code on cached output, download output_cache to the current directory and run the following code.
Evaluate different 3D LLMs with EM and refined-EM on spatial QA benchmarks. Report CIDEr, BLEU-4, METEOR, and ROUGE at the same time.
ScanQA:
python evaluate/em_acc_recall.py \
evaluate-main-results \
--repo-root . \
--target-split Eval_result_ScanQA \
--modal PC \
--suffix ''SQA3D:
python evaluate/em_acc_recall.py \
evaluate-main-results \
--repo-root . \
--target-split Eval_result_SQA3D \
--modal PC \
--suffix ''RelSpatialQA:
python evaluate/em_acc_recall.py \
evaluate-main-results \
--repo-root . \
--target-split Eval_result_RespatialQA \
--modal PC \
--suffix ''Measure accuracy and recall on RelSpatialQA.
- Prepare
Eval_result_RespatialQA/PCandEval_result_ScanQA/PC. - Run the RelSpatialQA command below.
python evaluate/em_acc_recall.py \
evaluate-main-results \
--repo-root . \
--target-split Eval_result_RespatialQA \
--scanqa-split Eval_result_ScanQA \
--modal PC \
--suffix ''Measure accuracy on AbsSpatialQA.
- Prepare
Eval_result_AbsSpatialQA/PC. - Run
evaluate-main-results.
python evaluate/em_acc_recall.py \
evaluate-main-results \
--repo-root . \
--target-split Eval_result_AbsSpatialQA \
--modal PC \
--suffix ''Inspect the sliding-window attention distribution from response tokens to point-cloud tokens. Compare token-level attention patterns between successful and failed cases.
- Use
output_cache/Attention_cache/{dataset}_shuffled_token. - Use
output_cache/Attention_eval_result/{dataset}.json. - Run
attention-on-tokens.
ScanQA:
python evaluate/vis_attention.py \
attention-on-tokens \
--dataset scanqaRelSpatialQA:
python evaluate/vis_attention.py \
attention-on-tokens \
--dataset relspatialqaOptional arguments:
--max-cases: the maximum evaluated cases--plot: plot the results
Measure whether the model places attention on the true target-object tokens.
- Choose the dataset.
- Run
attention-on-target-token.
ScanQA:
python evaluate/vis_attention.py \
attention-on-target-token \
--dataset scanqaRelSpatialQA:
python evaluate/vis_attention.py \
attention-on-target-token \
--dataset relspatialqaMeasure how much attention is assigned to sink tokens after shuffling. Compare sink-token attention between successful and failed cases.
ScanQA:
python evaluate/vis_attention.py \
attention-on-sink-token \
--dataset scanqaRelSpatialQA:
python evaluate/vis_attention.py \
attention-on-sink-token \
--dataset relspatialqaInspect the top decoded tokens from point-cloud tokens and response tokens across transformer layers. Analyze when semantic information becomes identifiable during spatial reasoning.
- Prepare a locally available Vicuna/LLaMA base model.
- Prepare a full cache directory that contains
all_hidden_states. - Run
evaluate/logit_lens.py. - Read the output from
--save-dir/results.json.
python evaluate/logit_lens.py \
--base-model /path/to/base_model \
--save-dir output_cache/logit_lens_scanqa \
--eval-count 1python evaluate/logit_lens.py \
--base-model /path/to/base_model \
--att-dir output_cache/Attention_cache/relspatialqa_shuffled_token_full \
--save-dir output_cache/logit_lens_relspatialqa \
--eval-count 1This repository builds on several excellent open-source projects. We thank llava-interp for attention and logit-lens analysis references, embodied-generalist and 3D-LLM for the original 3D evaluation and model analysis codebase.
@article{zhang2025point,
title={The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?},
author={Zhang, Weichen and Peng, Ruiying and Gao, Chen and Fang, Jianjie and Zeng, Xin and Li, Kaiyuan and Wang, Ziyou and Cui, Jinqiang and Wang, Xin and Chen, Xinlei and others},
journal={arXiv preprint arXiv:2504.04540},
year={2025}
}