The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?

Despite some promising results, the advantages of point clouds over other modalities remain unclear. Moreover, existing 3D benchmarks are insufficient for fairly evaluating the ability of multimodal LLMs to comprehend spatial concepts. To address these challenges, we introduce ScanReQA, a 3D spatial reasoning benchmark encompassing text, vision, and point cloud modalities.

📢 News

Preprint version released
ScanReQA dataset released
Evaluation code and cached results released

1. Environment Setup

conda create -n scanreqa python=3.9 -y
conda activate scanreqa

Then install the required packages:

pip install -r requirements.txt

Notes:

evaluate/em_acc_recall.py uses the bundled meteor-1.5.jar for the METEOR metric. Java must be installed on the machine if you want to compute METEOR.

2. Dataset

Download dataset from here. The dataset is formatted as follows:

2.1 `Rel_SpatialQA.json`

Each item in Rel_SpatialQA.json corresponds to one relative spatial reasoning question and shares several fields with Abs_SpatialQA.json, while adding relation-specific annotations:

question_id Unique sample identifier, for example val-scene0011-2-0.
scene_id ScanNet scene identifier, for example scene0011_00.
question The relative spatial reasoning question.
answers A list of valid textual answers.
referred_obj_utterance The natural-language sentence that states the original relation.
referred_obj_name The name of the referred target object.
referred_obj_id The integer object id of the referred target object.
object_ids A list of scene object ids involved in the question context.
object_names A list of object category names aligned with object_ids.
object_captions A list of textual descriptions aligned with object_ids and object_names.
spatial_triplet A three-element list in the form [target, relation, anchor].
spatial_triplet_reversed The reversed relational form corresponding to spatial_triplet.
reversible An integer flag indicating whether the relation can be meaningfully reversed.

Example:

{
  "answers": ["the brown wooden chair"],
  "object_ids": [2, 9],
  "object_names": ["table", "chair"],
  "question": "What is at the head of the long rectangular table?",
  "question_id": "val-scene0011-2-0",
  "scene_id": "scene0011_00",
  "object_captions": [
    "Caption for object 2",
    "Caption for object 9"
  ],
  "referred_obj_utterance": "The brown wooden chair is placed at the head of the long rectangular table.",
  "referred_obj_name": "the brown wooden chair.",
  "referred_obj_id": 9,
  "spatial_triplet": ["the brown wooden chair", "at the head of", "the long rectangular table"],
  "spatial_triplet_reversed": ["the long rectangular table", "at the foot of", "the brown wooden chair"],
  "reversible": 1
}

2.2 `Abs_SpatialQA.json`

Each item in Abs_SpatialQA.json corresponds to one absolute spatial reasoning question and contains the following fields:

question_id Unique sample identifier, for example val-scene0011-2.
scene_id ScanNet scene identifier, for example scene0011_00.
question The full multiple-choice absolute spatial reasoning question.
answers A list of valid answers. In the current release this is typically a single option such as ["B"].
referred_obj_utterance The natural-language reference sentence that describes the target object.
referred_obj_name The textual name of the target object mentioned in the question context.
referred_obj_id The integer object id of the referred target object in the scene.
referred_obj_ans The correct multiple-choice option for the referred object, such as B.
object_ids A list of scene object ids that are explicitly involved in the question context.
object_names A list of object category names aligned with object_ids.
object_captions A list of textual descriptions aligned with object_ids and object_names.

Example:

{
  "answers": ["B"],
  "object_ids": [2, 9],
  "object_names": ["table", "chair"],
  "question": "The brown wooden chair is placed at the head of the long rectangular table. What's the central coordinate of the the brown wooden chair.Please select the option closest to the brown wooden chair from the following coordinate options. Your answer can only be one of A, B, C, and D.",
  "question_id": "val-scene0011-2",
  "scene_id": "scene0011_00",
  "object_captions": [
    "Caption for object 2",
    "Caption for object 9"
  ],
  "referred_obj_utterance": "The brown wooden chair is placed at the head of the long rectangular table.",
  "referred_obj_name": "the brown wooden chair.",
  "referred_obj_id": 9,
  "referred_obj_ans": "B"
}

3. Evaluation

We provide the evaluation code and example outputs for 3D LLMs. To evaluate a custom model, please ensure that its output format is consistent with the provided examples.

Each output file is a json list. Each item in the list corresponds to one evaluation example and has the following fields:

source The unique sample identifier. Example: val-scene0011-0.
scene_id The ScanNet scene identifier for the sample. Example: scene0011_00.
instruction The full input prompt given to the model, usually including the USER question and the ASSISTANT prefix.
response_gt A list of ground-truth answers. Multiple equivalent reference answers may be provided for the same question.
response_pred The model prediction for the sample.

Example:

{
  "source": "val-scene0011-0",
  "scene_id": "scene0011_00",
  "instruction": "USER: What color is the chair in the kitchen? ASSISTANT:",
  "response_gt": ["dark brown", "brown"],
  "response_pred": "brown"
}

To run the evaluation code on cached output, download output_cache to the current directory and run the following code.

3.1 EM of Different 3D LLMs on ScanQA, SQA3D, and RelSpatialQA

Evaluate different 3D LLMs with EM and refined-EM on spatial QA benchmarks. Report CIDEr, BLEU-4, METEOR, and ROUGE at the same time.

ScanQA:

python evaluate/em_acc_recall.py \
  evaluate-main-results \
  --repo-root . \
  --target-split Eval_result_ScanQA \
  --modal PC \
  --suffix ''

SQA3D:

python evaluate/em_acc_recall.py \
  evaluate-main-results \
  --repo-root . \
  --target-split Eval_result_SQA3D \
  --modal PC \
  --suffix ''

RelSpatialQA:

python evaluate/em_acc_recall.py \
  evaluate-main-results \
  --repo-root . \
  --target-split Eval_result_RespatialQA \
  --modal PC \
  --suffix ''

3.2 Accuracy and Recall of 3D LLMs on RelSpatialQA

Measure accuracy and recall on RelSpatialQA.

Prepare Eval_result_RespatialQA/PC and Eval_result_ScanQA/PC.
Run the RelSpatialQA command below.

python evaluate/em_acc_recall.py \
  evaluate-main-results \
  --repo-root . \
  --target-split Eval_result_RespatialQA \
  --scanqa-split Eval_result_ScanQA \
  --modal PC \
  --suffix ''

3.3 Accuracy of 3D LLMs on AbsSpatialQA

Measure accuracy on AbsSpatialQA.

Prepare Eval_result_AbsSpatialQA/PC.
Run evaluate-main-results.

python evaluate/em_acc_recall.py \
  evaluate-main-results \
  --repo-root . \
  --target-split Eval_result_AbsSpatialQA \
  --modal PC \
  --suffix ''

4. Attention Visualization

4.1 Attention on Tokens

Inspect the sliding-window attention distribution from response tokens to point-cloud tokens. Compare token-level attention patterns between successful and failed cases.

Use output_cache/Attention_cache/{dataset}_shuffled_token.
Use output_cache/Attention_eval_result/{dataset}.json.
Run attention-on-tokens.

ScanQA:

python evaluate/vis_attention.py \
  attention-on-tokens \
  --dataset scanqa

RelSpatialQA:

python evaluate/vis_attention.py \
  attention-on-tokens \
  --dataset relspatialqa

Optional arguments:

--max-cases: the maximum evaluated cases
--plot: plot the results

4.2 Attention on Target Token

Measure whether the model places attention on the true target-object tokens.

Choose the dataset.
Run attention-on-target-token.

ScanQA:

python evaluate/vis_attention.py \
  attention-on-target-token \
  --dataset scanqa

RelSpatialQA:

python evaluate/vis_attention.py \
  attention-on-target-token \
  --dataset relspatialqa

4.3 Attention on Sink Tokens

Measure how much attention is assigned to sink tokens after shuffling. Compare sink-token attention between successful and failed cases.

ScanQA:

python evaluate/vis_attention.py \
  attention-on-sink-token \
  --dataset scanqa

RelSpatialQA:

python evaluate/vis_attention.py \
  attention-on-sink-token \
  --dataset relspatialqa

5. Logit Lens Evaluation

Inspect the top decoded tokens from point-cloud tokens and response tokens across transformer layers. Analyze when semantic information becomes identifiable during spatial reasoning.

Prepare a locally available Vicuna/LLaMA base model.
Prepare a full cache directory that contains all_hidden_states.
Run evaluate/logit_lens.py.
Read the output from --save-dir/results.json.

5.1 ScanQA

python evaluate/logit_lens.py \
  --base-model /path/to/base_model \
  --save-dir output_cache/logit_lens_scanqa \
  --eval-count 1

5.2 RelSpatialQA

python evaluate/logit_lens.py \
  --base-model /path/to/base_model \
  --att-dir output_cache/Attention_cache/relspatialqa_shuffled_token_full \
  --save-dir output_cache/logit_lens_relspatialqa \
  --eval-count 1

Acknowledgements

This repository builds on several excellent open-source projects. We thank llava-interp for attention and logit-lens analysis references, embodied-generalist and 3D-LLM for the original 3D evaluation and model analysis codebase.

Citation

@article{zhang2025point,
  title={The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?},
  author={Zhang, Weichen and Peng, Ruiying and Gao, Chen and Fang, Jianjie and Zeng, Xin and Li, Kaiyuan and Wang, Ziyou and Cui, Jinqiang and Wang, Xin and Chen, Xinlei and others},
  journal={arXiv preprint arXiv:2504.04540},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
config		config
evaluate		evaluate
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?

📢 News

1. Environment Setup

2. Dataset

2.1 `Rel_SpatialQA.json`

2.2 `Abs_SpatialQA.json`

3. Evaluation

3.1 EM of Different 3D LLMs on ScanQA, SQA3D, and RelSpatialQA

3.2 Accuracy and Recall of 3D LLMs on RelSpatialQA

3.3 Accuracy of 3D LLMs on AbsSpatialQA

4. Attention Visualization

4.1 Attention on Tokens

4.2 Attention on Target Token

4.3 Attention on Sink Tokens

5. Logit Lens Evaluation

5.1 ScanQA

5.2 RelSpatialQA

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

The Point, the Vision and the Text: Does Point Cloud Boost Spatial Reasoning of Large Language Models?

📢 News

1. Environment Setup

2. Dataset

2.1 Rel_SpatialQA.json

2.2 Abs_SpatialQA.json

3. Evaluation

3.1 EM of Different 3D LLMs on ScanQA, SQA3D, and RelSpatialQA

3.2 Accuracy and Recall of 3D LLMs on RelSpatialQA

3.3 Accuracy of 3D LLMs on AbsSpatialQA

4. Attention Visualization

4.1 Attention on Tokens

4.2 Attention on Target Token

4.3 Attention on Sink Tokens

5. Logit Lens Evaluation

5.1 ScanQA

5.2 RelSpatialQA

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2.1 `Rel_SpatialQA.json`

2.2 `Abs_SpatialQA.json`

Packages