Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling
Authors: Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo
First: 2026-02-11T18:57:29+00:00 · Latest: 2026-02-11T18:57:29+00:00
Comments: Code: https://github.com/HKUST-C4G/diffusion-rm
Abstract
Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
中文标题/摘要
标题:超越基于VLM的奖励:扩散原生潜空间奖励建模
扩散和流匹配模型的偏好优化依赖于既具有判别鲁棒性又具有计算效率的奖励函数。视觉语言模型(VLMs)已成为主要的奖励提供者,利用其丰富的跨模态先验来引导对齐。然而,它们的计算和内存成本可能相当大,通过像素空间奖励优化潜扩散生成器会引入领域不匹配,这会复杂化对齐。在本文中,我们提出了一种扩散原生潜空间奖励模型DiNa-LRM,该模型直接在噪声扩散状态上进行偏好学习。我们的方法引入了一种噪声校准的Thurstone似然性,具有扩散噪声依赖的不确定性。DiNa-LRM 利用预训练的潜扩散主干和时间步条件奖励头,并支持推理时的噪声集成,提供了一种扩散原生的测试时扩展机制和稳健奖励机制。在图像对齐基准测试中,DiNa-LRM 显著优于现有的基于扩散的奖励基线,并以较低的计算成本实现了与最先进的VLMs相当的性能。在偏好优化中,我们证明DiNa-LRM 改进了偏好优化动力学,使模型对齐更快且更节省资源。
Summary / 总结
This paper addresses the challenge of preference optimization for diffusion and flow-matching models by proposing DiNa-LRM, a diffusion-native latent reward model. It formulates preference learning directly on noisy diffusion states, using a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a lower computational cost across image alignment benchmarks. Additionally, it improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.
本文提出了一种扩散本征的潜在奖励模型DiNa-LRM,以解决扩散和流匹配模型的偏好优化问题。该模型直接在噪声扩散状态上进行偏好学习,使用噪声校准的Thurstone似然和时间步长条件的奖励头。实验结果表明,DiNa-LRM在较低的计算成本下超过了现有的基于扩散的奖励基线,并且在偏好优化中提高了优化动态,实现了更快和更高效的模型对齐。
GENIUS: Generative Fluid Intelligence Evaluation Suite
Authors: Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, Wentao Zhang
First: 2026-02-11T18:55:54+00:00 · Latest: 2026-02-11T18:55:54+00:00
Abstract
Unified Multimodal Models (UMMs) have shown remarkable progress in visual generation. Yet, existing benchmarks predominantly assess $\textit{Crystallized Intelligence}$, which relies on recalling accumulated knowledge and learned schemas. This focus overlooks $\textit{Generative Fluid Intelligence (GFI)}$: the capacity to induce patterns, reason through constraints, and adapt to novel scenarios on the fly. To rigorously assess this capability, we introduce $\textbf{GENIUS}$ ($\textbf{GEN}$ Fluid $\textbf{I}$ntelligence Eval$\textbf{U}$ation $\textbf{S}$uite). We formalize $\textit{GFI}$ as a synthesis of three primitives. These include $\textit{Inducing Implicit Patterns}$ (e.g., inferring personalized visual preferences), $\textit{Executing Ad-hoc Constraints}$ (e.g., visualizing abstract metaphors), and $\textit{Adapting to Contextual Knowledge}$ (e.g., simulating counter-intuitive physics). Collectively, these primitives challenge models to solve problems grounded entirely in the immediate context. Our systematic evaluation of 12 representative models reveals significant performance deficits in these tasks. Crucially, our diagnostic analysis disentangles these failure modes. It demonstrates that deficits stem from limited context comprehension rather than insufficient intrinsic generative capability. To bridge this gap, we propose a training-free attention intervention strategy. Ultimately, $\textbf{GENIUS}$ establishes a rigorous standard for $\textit{GFI}$, guiding the field beyond knowledge utilization toward dynamic, general-purpose reasoning. Our dataset and code will be released at: $\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$.
中文标题/摘要
标题:GENIUS:生成流体智能评估套件
统一多模态模型(UMMs)在视觉生成方面取得了显著进展。然而,现有的基准测试主要评估的是$\textit{晶体智力}$,这依赖于回忆积累的知识和学习的模式。这种关注忽视了$\textit{生成流体智能(GFI)}$:即在不断变化的情境中诱导模式、通过约束进行推理和适应新场景的能力。为了严格评估这种能力,我们引入了$\textbf{GENIUS}$($\textbf{GEN}$流体$\textbf{I}$智能评估$\textbf{U}$套件)。我们将$\textit{GFI}$形式化为三个基本要素的综合。这些包括$\textit{诱导隐含模式}$(例如,推断个性化的视觉偏好),$\textit{执行即兴约束}$(例如,可视化抽象的隐喻),以及$\textit{适应背景知识}$(例如,模拟反直觉的物理现象)。这些基本要素共同挑战模型解决完全基于当前背景的问题。我们对12个代表性模型的系统评估揭示了这些任务中的显著性能缺陷。至关重要的是,我们的诊断分析将这些失败模式分离出来。它表明缺陷源于对背景理解的有限性,而不是生成能力的不足。为了弥合这一差距,我们提出了一种无需训练的注意力干预策略。最终,$\textbf{GENIUS}$为$\textit{GFI}$确立了一个严格的标准,引导该领域从知识利用转向动态、通用的推理。我们的数据集和代码将在以下链接发布:$\href{https://github.com/arctanxarc/GENIUS}{https://github.com/arctanxarc/GENIUS}$。
Just on Time: Token-Level Early Stopping for Diffusion Language Models
Authors: Zahar Kohut, Severyn Shykula, Dmytro Khamula, Mykola Vysotskyi, Taras Rumezhak, Volodymyr Karpiv
First: 2026-02-11T18:44:04+00:00 · Latest: 2026-02-11T18:44:04+00:00
Comments: Under review
Abstract
Diffusion language models generate text through iterative refinement, a process that is often computationally inefficient because many tokens reach stability long before the final denoising step. We introduce a training-free, token-level early stopping approach that identifies convergence independently at each position. Our method leverages lightweight signals derived from the model's predictions and local context to dynamically determine when individual tokens can be finalized. This yields adaptive per-token freezing without task-specific fine-tuning, substantially reducing the total number of diffusion steps required. Across diverse benchmarks, spanning mathematical reasoning, general question answering, and scientific understanding, our approach achieves state-of-the-art efficiency gains while preserving generation quality.
中文标题/摘要
标题:刚刚好:扩散语言模型的标记级早期停止
扩散语言模型通过迭代细化生成文本,这一过程通常计算效率低下,因为许多标记在最终去噪步骤之前就已经稳定。我们提出了一种无需训练的标记级早期停止方法,能够在每个位置独立识别收敛。该方法利用模型预测和局部上下文中的轻量级信号,动态确定何时可以最终确定个别标记。这实现了适应性的标记级冻结,无需针对特定任务的微调,大幅减少了所需的扩散步骤总数。在涵盖数学推理、通用问答和科学理解等多种基准测试中,我们的方法在保持生成质量的同时实现了最先进的效率提升。
Summary / 总结
The research aims to improve the efficiency of diffusion language models by introducing a token-level early stopping method that identifies when individual tokens have reached stability. This method uses lightweight signals from the model's predictions and local context to dynamically freeze tokens, reducing the number of diffusion steps needed. Experiments across various benchmarks show that this approach achieves significant efficiency gains without compromising generation quality.
研究旨在通过引入基于令牌的早期停止方法来提高扩散语言模型的效率。该方法能够识别何时单个令牌已达到稳定状态并可以被最终确定,无需特定任务的微调。因此,总的扩散步骤数量减少,导致在各种基准测试中取得了显著的效率提升,同时保持生成质量。
Chatting with Images for Introspective Visual Thinking
Authors: Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tienie Tan
First: 2026-02-11T17:42:37+00:00 · Latest: 2026-02-11T17:42:37+00:00
Abstract
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
中文标题/摘要
标题:基于图像的对话促进内省视觉思考
当前的大规模视觉-语言模型(LVLMs)通常依赖基于单次视觉编码的文本推理,这往往会导致细粒度视觉信息的丢失。最近提出的“通过图像思考”试图通过外部工具或代码操作图像来缓解这一限制;然而,由此产生的视觉状态往往缺乏语言语义的充分支撑,影响了有效的跨模态对齐——特别是在需要在远距离区域或多个图像之间推理视觉语义或几何关系时。为了解决这些挑战,我们提出了一种新的框架“基于图像的对话”,将视觉操作重新构想为语言引导的特征调制。在表达性语言提示的指导下,模型动态地对多个图像区域进行联合重新编码,从而增强了语言推理与视觉状态更新之间的耦合。我们通过ViLaVT这一新型LVLM实例化了这一范式,ViLaVT配备了一个明确设计用于此类交互式视觉推理的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程进行训练,以促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT实现了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现尤为突出。
Summary / 总结
This paper addresses the limitations of current large vision-language models (LVLMs) that rely on text-only reasoning and propose a new framework called 'chatting with images' to enhance visual manipulation. The method involves a language-guided feature modulation process where the model dynamically re-encodes multiple image regions under the guidance of language prompts. ViLaVT, a novel LVLM, is instantiated with a dynamic vision encoder and trained using a two-stage curriculum. Experiments across eight benchmarks show that ViLaVT improves performance, especially in complex multi-image and video-based spatial reasoning tasks.
研究旨在通过解决细粒度视觉信息丢失和跨模态对齐困难的问题,改进大型视觉-语言模型(LVLM)的视觉推理能力。提出的‘与图像对话’框架将视觉操作重新定义为语言引导的特征调制,使模型在语言提示的指导下动态重新编码多个图像区域。ViLaVT 是一种新型的 LVLM,配备了动态视觉编码器,通过结合监督微调和强化学习的两阶段课程进行训练。实验表明,ViLaVT 在八个基准测试中表现出色,特别是在复杂的多图像和视频空间推理任务中取得了显著的提升。
Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models
Authors: Yu Zeng, Wenxuan Huang, Shiting Huang, Xikun Bao, Yukun Qi, Yiming Zhao, Qiuchen Wang, Lin Chen, Zehui Chen, Huaian Chen, Wanli Ouyang, Feng Zhao
First: 2025-10-01T17:58:05+00:00 · Latest: 2026-02-11T16:48:31+00:00
Abstract
Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .
中文标题/摘要
标题:代理拼图互动学习以增强视觉感知和语言模型中的推理能力
尽管当前大型视觉-语言模型(VLMs)在多模态理解和推理方面取得了进展,但其基本的感知和推理能力仍然有限。具体而言,即使在简单的拼图任务上,现有的VLMs的表现也接近随机,揭示了核心感知和推理能力的不足。虽然高质量的视觉-语言数据可以增强这些能力,但由于其稀缺性和有限的可扩展性,这带来了显著的限制。为了解决这个问题,我们提出了AGILE,一种代理拼图互动学习方法,以增强视觉感知和推理能力在VLMs中的表现。AGILE将拼图解决过程公式化为一个互动过程,使模型能够逐步与环境互动。在每一步中,模型根据当前状态生成可执行代码以执行动作,而环境则提供精细的视觉反馈以指导任务完成。通过这种观察和互动的迭代循环,模型通过探索和反馈逐步提高其感知和推理能力。实验结果表明,AGILE不仅在不同复杂度的拼图任务(例如,在2×2设置下将准确率从9.5%提高到82.8%)上显著提升了性能,还在9个通用视觉任务中展示了强大的泛化能力,平均提高了3.1%。这些结果表明,在感知和推理能力方面都有显著提升。这项工作为推进多模态模型中的推理和泛化开辟了一条新途径,并提供了一种高效、可扩展的解决方案来解决多模态强化学习数据的稀缺性。代码和数据集可在https://github.com/yuzeng0-0/AGILE 获取。
Summary / 总结
AGILE is proposed to enhance the perceptual and reasoning abilities of Vision-Language Models (VLMs) by formulating jigsaw solving as an interactive process. The model generates executable code to perform actions based on visual feedback, leading to significant improvements in jigsaw task performance and generalization across various vision tasks. AGILE boosts accuracy from 9.5% to 82.8% on 2x2 jigsaw tasks and achieves an average improvement of 3.1% across 9 general vision tasks.
论文通过提出AGILE方法,解决当前视觉-语言模型在感知和推理能力上的局限性。AGILE将拼图解决过程视为迭代过程,模型根据视觉反馈生成动作并改进其能力。该方法在拼图任务上的性能显著提升,并且在其他视觉任务上表现出良好的泛化能力,展示了感知和推理能力的增强。
ROCKET: Rapid Optimization via Calibration-guided Knapsack Enhanced Truncation for Efficient Model Compression
Authors: Ammar Ali, Baher Mohammad, Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Stamatios Lefkimmiatis
First: 2026-02-11T16:34:52+00:00 · Latest: 2026-02-11T16:34:52+00:00
Abstract
We present ROCKET, a training-free model compression method that achieves state-of-the-art performance in comparison with factorization, structured-sparsification and dynamic compression baselines. Operating under a global compression budget, ROCKET comprises two key innovations: First, it formulates layer-wise compression allocation as a multi-choice knapsack problem, selecting the optimal compression level for each layer to minimize total reconstruction error while adhering to a target model size. Second, it introduces a single-step sparse matrix factorization inspired by dictionary learning: using only a small calibration set, it sparsifies weight coefficients based on activation-weights sensitivity and then updates the dictionary in closed form via least squares bypassing iterative optimization, sparse coding, or backpropagation entirely. ROCKET consistently outperforms existing compression approaches across different model architectures at 20-50\% compression rates. Notably, it retains over 90\% of the original model's performance at 30\% compression without any fine-tuning. Moreover, when applying a light fine-tuning phase, recovery is substantially enhanced: for instance, compressing Qwen3-14B to an 8B-parameter model and healing it with just 30 million tokens yields performance nearly on par with the original Qwen3-8B. The code for ROCKET is at github.com/mts-ai/ROCKET/tree/main.
中文标题/摘要
标题:ROCKET:通过校准引导的背包增强截断实现快速优化的高效模型压缩方法
我们提出了ROCKET,一种无需训练的模型压缩方法,在与因子分解、结构稀疏化和动态压缩基线的比较中实现了最先进的性能。ROCKET 在全局压缩预算下包含两项关键创新:首先,它将层间压缩分配形式化为多选择背包问题,为每个层选择最优压缩级别,以最小化总重构误差并遵守目标模型大小。其次,它引入了一步稀疏矩阵分解,灵感来源于字典学习:仅使用一个小的校准集,根据激活-权重敏感性稀疏化权重系数,然后通过最小二乘法以闭式形式更新字典,完全绕过了迭代优化、稀疏编码或反向传播。ROCKET 在不同模型架构下的压缩率在20-50%时始终优于现有压缩方法。值得注意的是,在30%压缩率下,它保留了原始模型性能的90%以上,无需任何微调。此外,当应用轻量级微调阶段时,恢复效果显著增强:例如,将Qwen3-14B压缩到8B参数模型,并用3000万词进行修复,性能几乎与原始Qwen3-8B持平。ROCKET的代码在github.com/mts-ai/ROCKET/tree/main/。
Summary / 总结
ROCKET is a training-free model compression method that formulates layer-wise compression allocation as a multi-choice knapsack problem and introduces a single-step sparse matrix factorization. It selects the optimal compression level for each layer to minimize reconstruction error while adhering to a target model size. ROCKET consistently outperforms existing compression approaches across different model architectures at 20-50% compression rates and retains over 90% of the original model's performance at 30% compression without any fine-tuning. With light fine-tuning, ROCKET can achieve performance nearly on par with the original model.
ROCKET 是一种无需训练的模型压缩方法,将层间压缩分配问题表述为多选择背包问题,并引入了一步完成的稀疏矩阵分解。它在不同模型架构上以20-50%的压缩率优于现有压缩方法,并在30%压缩率下保留了原模型超过90%的性能,无需任何微调。通过轻量级微调,ROCKET 进一步提高了恢复效果,在压缩后仅使用3000万令牌即可达到与原模型相近的性能。
Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility
Authors: Mengxuan Wang, Yuxin Chen, Gang Xu, Tao He, Hongjie Jiang, Ming Li
First: 2026-02-03T11:26:05+00:00 · Latest: 2026-02-11T16:06:31+00:00
Abstract
Vision language models (VLMs) extend the reasoning capabilities of large language models (LLMs) to cross-modal settings, yet remain highly vulnerable to multimodal jailbreak attacks. Existing defenses predominantly rely on safety fine-tuning or aggressive token manipulations, incurring substantial training costs or significantly degrading utility. Recent research shows that LLMs inherently recognize unsafe content in text, and the incorporation of visual inputs in VLMs frequently dilutes risk-related signals. Motivated by this, we propose Risk Awareness Injection (RAI), a lightweight and training-free framework for safety calibration that restores LLM-like risk recognition by amplifying unsafe signals in VLMs. Specifically, RAI constructs an Unsafe Prototype Subspace from language embeddings and performs targeted modulation on selected high-risk visual tokens, explicitly activating safety-critical signals within the cross-modal feature space. This modulation restores the model's LLM-like ability to detect unsafe content from visual inputs, while preserving the semantic integrity of original tokens for cross-modal reasoning. Extensive experiments across multiple jailbreak and utility benchmarks demonstrate that RAI substantially reduces attack success rate without compromising task performance.
中文标题/摘要
标题:风险意识注入:在不牺牲实用性的情况下为视觉语言模型校准安全性
视觉语言模型(VLMs)将大型语言模型(LLMs)的推理能力扩展到跨模态设置中,但仍然高度易受多模态脱缰攻击的影响。现有防御措施主要依赖于安全微调或激进的标记操作,这会带来巨大的训练成本或显著降低实用性。最近的研究表明,LLMs 本身能够识别文本中的不安全内容,而将视觉输入纳入 VLMs 中通常会稀释与风险相关的信号。受此启发,我们提出了一种轻量级且无需训练的框架——风险意识注入(RAI),通过放大 VLMs 中的不安全信号来恢复 LLM 类似的风险识别能力。具体而言,RAI 从语言嵌入中构建一个不安全原型子空间,并对选定的高风险视觉标记进行有针对性的调节,明确激活跨模态特征空间中的安全关键信号。这种调节恢复了模型从视觉输入中检测不安全内容的 LLM 类似能力,同时保持原始标记的语义完整性以进行跨模态推理。在多个脱缰和实用性基准上的广泛实验表明,RAI 在不牺牲任务性能的情况下显著降低了攻击成功率。
Summary / 总结
The paper addresses the vulnerability of vision-language models (VLMs) to multimodal jailbreak attacks by proposing Risk Awareness Injection (RAI), a lightweight and training-free framework. RAI amplifies unsafe signals in VLMs by constructing an Unsafe Prototype Subspace from language embeddings and modulating selected high-risk visual tokens, thereby restoring the model's ability to detect unsafe content from visual inputs. Experiments show that RAI significantly reduces attack success rates without degrading task performance.
论文提出了一种轻量级且无需训练的框架Risk Awareness Injection (RAI),以解决视觉语言模型(VLMs)对多模态脱缰攻击的脆弱性问题。RAI通过从语言嵌入中构建一个不安全原型子空间,并对高风险视觉标记进行目标化调制,来放大不安全信号。实验结果表明,RAI能够显著降低攻击成功率,同时保持任务性能。
GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation
Authors: Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, Heng Tao Shen
Venue: ICLR 2026
First: 2025-10-02T16:37:56+00:00 · Latest: 2026-02-11T15:41:21+00:00
Comments: Accepted at ICLR 2026. Code available at: https://github.com/tj12323/GeoPurify
Abstract
Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data.
中文标题/摘要
标题:GeoPurify:一种开放词汇3D分割的高效几何蒸馏框架
最近尝试将2D视觉语言模型(VLMs)的特征转移到3D语义分割中,一直存在一个持续的权衡。直接将2D特征投影到3D中会产生嘈杂和碎片化的预测,而强制几何一致性则需要昂贵的训练管道和大规模标注的3D数据。我们认为这一限制源于占主导地位的分割和匹配范式,该范式无法调和2D语义与3D几何结构。在2D到3D的转移过程中,几何线索并未被消除,而是潜藏在嘈杂和视角聚合的特征中。为了利用这一特性,我们提出GeoPurify,它使用3D自监督教师模型中提取的几何先验,通过一个小学生亲和网络来净化2D VLM生成的3D点特征。在推理过程中,我们设计了一个几何引导聚合模块,进一步去噪点云并确保语义和结构的一致性。得益于潜藏的几何信息和学习到的亲和网络,GeoPurify有效地缓解了这一权衡,实现了更高的数据效率。在主要的3D基准测试中,GeoPurify在使用大约1.5%的训练数据的情况下,达到了或超越了最先进的性能。
Summary / 总结
GeoPurify is a data-efficient framework that improves 3D semantic segmentation by purifying 2D VLM-generated 3D point features using geometric priors from a 3D self-supervised teacher model. During inference, a Geometry-Guided Pooling module further denoises the point cloud. Experiments show that GeoPurify outperforms or matches state-of-the-art methods while using only 1.5% of the training data.
GeoPurify 是一个高效的数据框架,通过 3D 自监督教师模型提取的几何先验净化 2D VLM 生成的 3D 点特征,包含学生亲和网络和几何引导聚合模块以增强语义和结构一致性。实验表明,GeoPurify 使用仅 1.5% 的训练数据即可达到或超越现有最佳方法的性能。
Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Authors: Mingyu Cao, Alvaro Correia, Christos Louizos, Shiwei Liu, Lu Yin
First: 2026-02-11T15:41:09+00:00 · Latest: 2026-02-11T15:41:09+00:00
Comments: 11 pages, 8 figures
Abstract
Diffusion Language Models (DLMs) generate text by iteratively denoising a masked sequence, repeatedly deciding which positions to commit at each step. Standard decoding follows a greedy rule: unmask the most confident positions, yet this local choice can lock the model into a suboptimal unmasking order, especially on reasoning-heavy prompts. We present SOAR, a training-free decoding algorithm that adapts its behavior to the model's uncertainty. When confidence is low, SOAR briefly widens the search over alternative unmasking decisions to avoid premature commitments; when confidence is high, it collapses the search and decodes many positions in parallel to reduce the number of denoising iterations. Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and efficiency in DLM decoding.
中文标题/摘要
标题:搜索或加速:基于置信度切换的位置光束搜索算法在扩散语言模型中的应用
扩散语言模型(DLMs)通过迭代去噪掩码序列并反复决定在每一步中提交哪些位置来生成文本。标准解码遵循贪婪规则:去掩码最自信的位置,但这种局部选择可能会使模型陷入次优的去掩码顺序,尤其是在处理推理密集型提示时。我们提出了SOAR,一种无需训练的解码算法,可以根据模型的不确定性调整其行为。当置信度较低时,SOAR 会暂时扩大搜索范围,避免过早做出承诺;当置信度较高时,它会缩小搜索范围并并行解码多个位置,以减少去噪迭代次数。在数学推理和代码生成基准测试(GSM8K、MBPP、HumanEval)上,SOAR 在 Dream-7B 和 LLaDA-8B 上提高了生成质量,同时保持了竞争力的推理速度,提供了一种在 DLM 解码中平衡质量和效率的实用方法。
Summary / 总结
The paper introduces SOAR, a decoding algorithm for Diffusion Language Models that adapts based on model uncertainty. It searches more widely when confidence is low to avoid premature commitments and narrows the search when confidence is high to speed up decoding. Experiments on GSM8K, MBPP, and HumanEval show that SOAR improves generation quality while maintaining competitive inference speed, offering a practical balance between quality and efficiency in DLM decoding.
论文针对扩散语言模型(DLMs)在生成文本时,在推理密集型提示上可能出现的次优解码顺序问题。提出了SOAR,一种无需训练的解码算法,根据模型的不确定性进行调整。当信心较低时,SOAR广泛搜索以避免过早承诺;当信心较高时,它会缩小搜索范围并并行解码多个位置以减少去噪迭代次数。在GSM8K、MBPP和HumanEval基准测试上进行的实验表明,SOAR在保持竞争力的推理速度的同时提高了生成质量,提供了一种平衡质量和效率的实用解决方案。
Unveiling the "Fairness Seesaw": Discovering and Mitigating Gender and Race Bias in Vision-Language Models
Authors: Jian Lan, Udo Schlegel, Tanveer Hannan, Gengyuan Zhang, Haokun Chen, Thomas Seidl
First: 2025-05-26T15:14:16+00:00 · Latest: 2026-02-11T15:06:11+00:00
Abstract
Although Vision-Language Models (VLMs) have achieved remarkable success, the knowledge mechanisms underlying their social biases remain a black box, where fairness- and ethics-related problems harm certain groups of people in society. It is unknown to what extent VLMs yield gender and race bias in generative responses. In this paper, we conduct a systematic discovery of gender and race bias in state-of-the-art VLMs, focusing not only on surface-level responses but also on the internal probability distributions and hidden state dynamics. Our empirical analysis reveals three critical findings: 1) The Fairness Paradox: Models often generate fair text labels while maintaining highly skewed confidence scores (mis-calibration) toward specific social groups. 2) Layer-wise Fluctuation: Fairness knowledge is not uniformly distributed; it peaks in intermediate layers and undergoes substantial knowledge erosion in the final layers. 3) Residual Discrepancy: Within a single hidden layer, different residual streams carry conflicting social knowledge - some reinforcing fairness while others amplifying bias. Leveraging these insights, we propose RES-FAIR (RESidual Flow Adjustment for Inference Recalibration), a post-hoc framework that mitigates bias by localizing and projecting hidden states away from biased residual directions while amplifying fair components. Evaluations on PAIRS and SocialCounterfactuals datasets demonstrate that our discovery-based approach significantly improves response fairness and confidence calibration without compromising general reasoning abilities. Our work provides a new lens for understanding how multi-modal models store and process sensitive social information.
中文标题/摘要
标题:揭开“公平天平”之谜:发现和缓解视觉语言模型中的性别和种族偏见
尽管视觉语言模型(VLMs)取得了显著的成功,但其社会偏见背后的知识机制仍然是一个黑箱,其中公平性和伦理相关的问题损害了社会上的某些群体。目前尚不清楚VLMs在生成响应时在性别和种族方面是否存在偏见。在本文中,我们系统地发现了最先进的VLMs中的性别和种族偏见,不仅关注表面级的响应,还关注内部概率分布和隐藏状态动力学。我们的实证分析揭示了三个关键发现:1)公平悖论:模型在生成公平的文本标签的同时,对特定社会群体保持高度偏斜的信心分数(误校准)。2)逐层波动:公平知识并非均匀分布;它在中间层达到峰值,并在最终层经历大量知识侵蚀。3)残差差异:在单个隐藏层内,不同的残差流携带冲突的社会知识——一些强化公平性,而另一些则放大偏见。利用这些见解,我们提出了RES-FAIR(残差流调整以校准推理),这是一种后处理框架,通过定位并投影隐藏状态远离有偏的残差方向,同时放大公平成分来缓解偏见。在PAIRS和SocialCounterfactuals数据集上的评估表明,我们的发现方法在不牺牲一般推理能力的情况下显著提高了响应的公平性和信心校准。我们的工作为理解多模态模型如何存储和处理敏感的社会信息提供了一个新的视角。
Summary / 总结
This paper aims to uncover and mitigate gender and race bias in Vision-Language Models (VLMs) by analyzing their internal mechanisms. The authors find that VLMs often generate fair text labels but have skewed confidence scores, and that fairness knowledge is unevenly distributed, peaking in intermediate layers and declining in final layers. They also identify conflicting social knowledge within hidden layers. To address these issues, they propose RES-FAIR, a framework that localizes and projects hidden states to reduce bias while enhancing fairness. Evaluations show that RES-FAIR improves response fairness and confidence calibration without affecting general reasoning abilities.
本文通过分析表面级响应和内部机制,研究了最先进的视觉-语言模型(VLMs)中的性别和种族偏见。发现了三个关键发现:公平悖论、公平知识在各层中的波动以及隐藏层内的残差差异。基于这些见解,作者提出了RES-FAIR框架,通过重新校准隐藏状态来减轻偏见。评估结果显示,RES-FAIR在提高响应公平性和置信度校准的同时,不会影响一般推理能力。
Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation
Authors: Minggui He, Mingchen Dai, Jian Zhang, Yilun Liu, Shimin Tao, Pufan Zeng, Osamu Yoshie, Yuya Ieiri
First: 2026-02-11T14:08:06+00:00 · Latest: 2026-02-11T14:08:06+00:00
Comments: under review
Abstract
Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images, yet achieving structural fidelity remains challenging. Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation rather than faithful modeling of underlying chart structure, which often leads to hallucinated or semantically inconsistent outputs. We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision. Chart Specification filters syntactic noise to construct a structurally balanced training set and supports a Spec-Align Reward that provides fine-grained, verifiable feedback on structural correctness, enabling reinforcement learning to enforce consistent plotting logic. Experiments on three public benchmarks show that our method consistently outperforms prior approaches. With only 3K training samples, we achieve strong data efficiency, surpassing leading baselines by up to 61.7% on complex benchmarks, and scaling to 4K samples establishes new state-of-the-art results across all evaluated metrics. Overall, our results demonstrate that precise structural supervision offers an efficient pathway to high-fidelity chart-to-code generation. Code and dataset are available at: https://github.com/Mighten/chart-specification-paper
中文标题/摘要
标题:图表规范:激励VLM进行图表到代码生成的结构表示
视觉-语言模型(VLMs)在从图表图像生成绘图代码方面显示出潜力,但实现结构保真度仍然具有挑战性。现有方法主要依赖于监督微调,鼓励表面级别的标记模仿而不是对底层图表结构的忠实建模,这通常会导致幻觉或语义不一致的输出。我们提出了图表规范,这是一种结构化的中间表示,将训练从文本模仿转移到语义上可验证的监督。图表规范过滤掉语法噪声,构建一个结构平衡的训练集,并支持Spec-Align奖励,提供结构正确性的细粒度、可验证反馈,使强化学习能够强制执行一致的绘图逻辑。在三个公开基准上的实验表明,我们的方法在所有评估指标上都优于先前的方法。仅使用3000个训练样本,我们实现了强大的数据效率,在复杂基准上超越领先基线高达61.7%,扩展到4000个样本在所有评估指标上建立了新的最佳结果。总体而言,我们的结果表明,精确的结构监督为高保真度的图表到代码生成提供了一条高效途径。代码和数据集可在:https://github.com/Mighten/chart-specification-paper 获取。
Summary / 总结
The research aims to improve the structural fidelity of chart-to-code generation using Vision-Language Models (VLMs) by proposing Chart Specification, a structured intermediate representation. This method shifts training focus from text imitation to semantically grounded supervision, filtering out syntactic noise and providing fine-grained feedback through a Spec-Align Reward. Experiments on three public benchmarks show that the proposed method outperforms existing approaches, achieving strong data efficiency and surpassing leading baselines by up to 61.7% on complex benchmarks. Overall, the results indicate that precise structural supervision is an effective approach for high-fidelity chart-to-code generation.
研究旨在通过提出结构化中间表示Chart Specification,提高Vision-Language Models (VLMs)在生成图表代码时的结构准确性。该方法过滤掉语法噪声,并通过Spec-Align Reward提供语义上一致的监督,增强模型生成一致绘图逻辑的能力。实验表明,该方法在三个公开基准上优于现有方法,实现了强大的数据效率,并在复杂基准上取得了61.7%的改进,达到新的最佳结果。
SimuScene: Training and Benchmarking Code Generation to Simulate Physical Scenarios
Authors: Yanan Wang, Renxi Wang, Yongxin Wang, Xuezhi Liang, Fajri Koto, Timothy Baldwin, Xiaodan Liang, Haonan Li
First: 2026-02-11T13:26:02+00:00 · Latest: 2026-02-11T13:26:02+00:00
Abstract
Large language models (LLMs) have been extensively studied for tasks like math competitions, complex coding, and scientific reasoning, yet their ability to accurately represent and simulate physical scenarios via code remains underexplored. We propose SimuScene, the first systematic study that trains and evaluates LLMs on simulating physical scenarios across five physics domains and 52 physical concepts. We build an automatic pipeline to collect data, with human verification to ensure quality. The final dataset contains 7,659 physical scenarios with 334 human-verified examples as the test set. We evaluated 10 contemporary LLMs and found that even the strongest model achieves only a 21.5% pass rate, demonstrating the difficulty of the task. Finally, we introduce a reinforcement learning pipeline with visual rewards that uses a vision-language model as a judge to train textual models. Experiments show that training with our data improves physical simulation via code while substantially enhancing general code generation performance.
中文标题/摘要
标题:SimuScene: 训练和基准测试代码生成以模拟物理场景
大型语言模型(LLMs)已在数学竞赛、复杂编程和科学推理等领域进行了广泛研究,但它们通过代码准确表示和模拟物理场景的能力仍被严重忽视。我们提出了SimuScene,这是首个系统性研究,训练和评估LLMs在五个物理学领域和52个物理概念上模拟物理场景。我们构建了一个自动数据收集管道,并通过人工验证确保数据质量。最终数据集包含7,659个物理场景,其中334个人工验证的示例作为测试集。我们评估了10个当代LLM,发现最强的模型通过率仅为21.5%,表明任务的难度。最后,我们引入了一种基于视觉奖励的强化学习管道,使用视觉语言模型作为裁判来训练文本模型。实验表明,使用我们的数据进行训练可以提高代码生成性能,同时显著增强代码生成的一般性能。
Summary / 总结
The research aims to explore the capability of large language models (LLMs) in simulating physical scenarios through code, which has been underexplored. The study introduces SimuScene, a dataset containing 7,659 physical scenarios across five physics domains, and evaluates 10 contemporary LLMs, finding that even the strongest model only achieves a 21.5% pass rate. The research also proposes a reinforcement learning pipeline using visual rewards to train textual models, showing improvements in both physical simulation and general code generation performance.
研究旨在探索大型语言模型(LLMs)在通过代码模拟物理场景方面的能力,这方面的研究尚属空白。研究引入了SimuScene数据集,包含7,659个物理场景,覆盖五个物理领域,并有334个人工验证的示例用于测试。评估10个当前的LLM模型后,研究发现最强的模型仅达到21.5%的通过率。研究还提出了一种结合视觉奖励的强化学习训练管道,用于训练文本模型,结果显示这不仅提高了物理模拟的性能,还显著增强了通用代码生成的能力。
Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
Authors: Aojun Lu, Tao Feng, Hangjie Yuan, Wei Li, Yanan Sun
First: 2026-02-11T12:55:15+00:00 · Latest: 2026-02-11T12:55:15+00:00
Abstract
The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.
中文标题/摘要
标题:为什么RL比SFT泛化更好?基于数据视角的VLM后训练解释
通过后训练对大规模视觉-语言模型(VLMs)进行适应揭示了一个明显的泛化差距:使用强化学习(RL)微调的模型在分布外(OOD)性能上始终优于使用监督微调(SFT)训练的模型。本文从数据视角解释了这一现象,认为RL的泛化优势源于一种隐含的数据过滤机制,该机制本质上优先处理中等难度的训练样本。为了验证这一假设,我们系统地评估了不同难度级别的训练数据集上SFT模型的分布外泛化能力。结果证实了数据难度是一个关键因素,表明在困难样本上进行训练会显著降低分布外性能。受此发现的启发,我们提出了难度筛选监督微调(DC-SFT),这是一种简单的方法,可以显式地根据样本难度筛选训练集。实验表明,DC-SFT不仅在分布外泛化上显著优于标准SFT,而且在性能上也超过了基于RL的训练,同时提供了更高的稳定性和计算效率。本文为VLMs的分布外泛化差距提供了数据视角的解释,并建立了更高效的稳健泛化途径。代码可在https://github.com/byyx666/DC-SFT获取。
Summary / 总结
This paper investigates why Reinforcement Learning (RL) outperforms Supervised Fine-Tuning (SFT) in out-of-distribution (OOD) generalization for Vision-Language Models (VLMs). It proposes that RL's advantage stems from an implicit data filtering mechanism that prioritizes medium-difficulty samples. Experiments confirm that training on hard samples degrades OOD performance, leading to the introduction of Difficulty-Curated SFT (DC-SFT), which filters training data based on difficulty. DC-SFT improves OOD generalization and surpasses RL-based training while being more stable and computationally efficient.
本文探讨了为什么强化学习(RL)在视觉语言模型(VLM)的跨分布(OOD)泛化中优于监督微调(SFT)。研究提出,RL的优势源于一种隐含的数据过滤机制,优先选择中等难度的样本。实验表明,使用困难样本进行训练会降低OOD泛化性能,因此提出了基于难度筛选的监督微调(DC-SFT)方法,该方法根据样本难度筛选训练集。DC-SFT不仅显著提高了OOD泛化性能,还超过了基于RL的训练方法,同时具有更高的稳定性和计算效率。
DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
Authors: Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou
First: 2026-02-11T12:51:10+00:00 · Latest: 2026-02-11T12:51:10+00:00
Comments: 17 pages, 5 figures
Abstract
Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.
中文标题/摘要
标题:DeepImageSearch:针对视觉历史中的上下文感知图像检索的多模态代理基准测试
现有的多模态检索系统在语义匹配方面表现出色,但隐含地假设查询图像的相关性可以在孤立状态下进行测量。这种范式忽视了现实视觉流中固有的丰富依赖关系,其中信息分布在时间序列中,而不是局限于单个快照。为了弥合这一差距,我们引入了DeepImageSearch,这是一种新颖的代理范式,将图像检索重新定义为自主探索任务。模型必须计划并执行多步推理,基于隐含的上下文线索在原始视觉历史中定位目标。我们构建了DISBench,这是一个基于互联视觉数据的具有挑战性的基准测试。为了应对创建上下文相关查询的可扩展性挑战,我们提出了一种人机协作管道,利用视觉语言模型挖掘潜在的时空关联,有效地在人工验证之前卸载了密集的上下文发现。此外,我们构建了一个稳健的基线,使用模块化代理框架,配备精细的工具和双记忆系统进行长时导航。广泛的实验表明,DISBench对最先进的模型提出了重大挑战,突显了将代理推理纳入下一代检索系统中的必要性。
Summary / 总结
DeepImageSearch aims to address the limitations of existing multimodal retrieval systems by introducing an agentic paradigm that considers the context within visual histories. It reformulates image retrieval as an autonomous exploration task, requiring models to reason over raw visual data. The DISBench benchmark, built on interconnected visual data, poses significant challenges to state-of-the-art models, emphasizing the need for agentic reasoning in next-generation retrieval systems.
DeepImageSearch旨在通过引入一种考虑时间序列上下文的代理式范式来解决现有跨模态检索系统的局限性。该系统将图像检索重新定义为一个自主探索任务,要求模型在视觉历史中进行多步推理。DISBench是一个具有挑战性的基准,用于评估这些模型,提出了一种人-模型协作管道来挖掘时空关联。实验表明,DISBench提出了重大挑战,强调未来检索系统中需要包含代理式推理的重要性。
DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples
Authors: Zi Wang, Katsuya Hotta, Koichiro Kamide, Yawen Zou, Jianjian Qin, Chao Zhang, Jun Yu
First: 2026-02-11T12:47:38+00:00 · Latest: 2026-02-11T12:47:38+00:00
Abstract
Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.
Summary / 总结
The research aims to develop a method for cross-category 3D anomaly detection using a few normal samples without category-specific training. DMP-3DAD uses multi-view realistic depth map projection and a frozen CLIP visual encoder to extract multi-view representations for anomaly detection. Experiments on the ShapeNetPart dataset show that DMP-3DAD outperforms existing methods in few-shot scenarios, providing a simple and effective solution for practical cross-category 3D anomaly detection.
研究旨在开发一种使用少量正常样本进行跨类别3D异常检测的灵活方法。DMP-3DAD是一种无需训练的框架,将3D点云投影到现实深度图,并使用冻结的CLIP视觉编码器提取多视图表示进行异常检测。实验表明,DMP-3DAD在少量样本设置下优于现有方法,提供了一种简单而有效的跨类别3D异常检测解决方案。
Non-Contrastive Vision-Language Learning with Predictive Embedding Alignment
Authors: Lukas Kuhn, Giuseppe Serra, Florian Buettner
First: 2026-01-31T10:57:46+00:00 · Latest: 2026-02-11T12:42:24+00:00
Abstract
Vision-language models have transformed multimodal representation learning, yet dominant contrastive approaches like CLIP require large batch sizes, careful negative sampling, and extensive hyperparameter tuning. We introduce NOVA, a NOn-contrastive Vision-language Alignment framework based on joint embedding prediction with distributional regularization. NOVA aligns visual representations to a frozen, domain-specific text encoder by predicting text embeddings from augmented image views, while enforcing an isotropic Gaussian structure via Sketched Isotropic Gaussian Regularization (SIGReg). This eliminates the need for negative sampling, momentum encoders, or stop-gradients, reducing the training objective to a single hyperparameter. We evaluate NOVA on zeroshot chest X-ray classification using ClinicalBERT as the text encoder and Vision Transformers trained from scratch on MIMIC-CXR. On zero-shot classification across three benchmark datasets, NOVA outperforms multiple standard baselines while exhibiting substantially more consistent training runs. Our results demonstrate that non-contrastive vision-language pretraining offers a simpler, more stable, and more effective alternative to contrastive methods.
中文标题/摘要
标题:非对比视觉-语言学习与预测嵌入对齐
视觉-语言模型已经改变了多模态表示学习,但主导的对比学习方法如CLIP需要大批次大小、精细的负样本采样和广泛的超参数调整。我们引入了NOVA,这是一种基于联合嵌入预测和分布正则化的非对比视觉-语言对齐框架。NOVA通过从增强的图像视图预测文本嵌入来将视觉表示与冻结的领域特定文本编码器对齐,同时通过素描等向量高斯正则化(SIGReg)强制等向高斯结构。这消除了负样本采样、动量编码器或停止梯度的需求,将训练目标简化为一个超参数。我们使用ClinicalBERT作为文本编码器,在MIMIC-CXR上从零开始训练Vision Transformers,对零样本分类进行评估。在三个基准数据集上的零样本分类中,NOVA优于多个标准基线,表现出更为一致的训练运行。我们的结果表明,非对比视觉-语言预训练提供了一种更简单、更稳定且更有效的对比方法替代方案。
Summary / 总结
The paper introduces NOVA, a non-contrastive vision-language alignment framework that uses joint embedding prediction with distributional regularization to align visual representations with a frozen text encoder. NOVA eliminates the need for negative sampling and momentum encoders, reducing the training objective to a single hyperparameter. Evaluations on zero-shot chest X-ray classification show that NOVA outperforms standard baselines and demonstrates more consistent training runs.
论文提出了NOVA框架,该框架通过联合嵌入预测和分布正则化来对齐视觉表示与冻结的文本编码器。NOVA消除了负样本采样、动量编码器和停止梯度的需求,将训练目标简化为一个超参数。在零样本胸部X光分类中,NOVA优于多个标准基线,并表现出更一致的训练运行。
RSHallu: Dual-Mode Hallucination Evaluation for Remote-Sensing Multimodal Large Language Models with Domain-Tailored Mitigation
Authors: Zihui Zhou, Yong Feng, Yanying Chen, Guofan Duan, Zhenxi Song, Mingliang Zhou, Weijia Jia
First: 2026-02-11T12:41:33+00:00 · Latest: 2026-02-11T12:41:33+00:00
Abstract
Multimodal large language models (MLLMs) are increasingly adopted in remote sensing (RS) and have shown strong performance on tasks such as RS visual grounding (RSVG), RS visual question answering (RSVQA), and multimodal dialogue. However, hallucinations, which are responses inconsistent with the input RS images, severely hinder their deployment in high-stakes scenarios (e.g., emergency management and agricultural monitoring) and remain under-explored in RS. In this work, we present RSHallu, a systematic study with three deliverables: (1) we formalize RS hallucinations with an RS-oriented taxonomy and introduce image-level hallucination to capture RS-specific inconsistencies beyond object-centric errors (e.g., modality, resolution, and scene-level semantics); (2) we build a hallucination benchmark RSHalluEval (2,023 QA pairs) and enable dual-mode checking, supporting high-precision cloud auditing and low-cost reproducible local checking via a compact checker fine-tuned on RSHalluCheck dataset (15,396 QA pairs); and (3) we introduce a domain-tailored dataset RSHalluShield (30k QA pairs) for training-friendly mitigation and further propose training-free plug-and-play strategies, including decoding-time logit correction and RS-aware prompting. Across representative RS-MLLMs, our mitigation improves the hallucination-free rate by up to 21.63 percentage points under a unified protocol, while maintaining competitive performance on downstream RS tasks (RSVQA/RSVG). Code and datasets will be released.
中文标题/摘要
标题:RSHallu:针对遥感多模态大型语言模型的双模式幻觉评估及其领域定制化缓解
多模态大型语言模型(MLLMs)在遥感(RS)领域中被越来越多地采用,并在遥感视觉定位(RSVG)、遥感视觉问答(RSVQA)和多模态对话等任务中表现出强大的性能。然而,不一致的幻觉响应严重阻碍了它们在高风险场景(如紧急管理和农业监测)中的部署,并且在遥感领域中对此类幻觉的研究仍然不足。在本文中,我们提出了RSHallu,一个系统性研究,包含三个成果:(1)我们以遥感为导向的分类法形式化了RS幻觉,并引入了图像级幻觉以捕捉超出对象中心错误(如模态、分辨率和场景级语义)的RS特定不一致性;(2)我们构建了幻觉基准RSHalluEval(2,023个问答对),并支持双模式检查,通过一个在RSHalluCheck数据集(15,396个问答对)上微调的紧凑型检查器,实现高精度云审计和低成本可重复的本地检查;(3)我们引入了领域定制化数据集RSHalluShield(30,000个问答对)用于训练友好的缓解,并进一步提出了无需训练的即插即用策略,包括解码时logit校正和遥感感知提示。在代表性遥感MLLMs中,我们的缓解措施在统一协议下将无幻觉率提高了最多21.63个百分点,同时在下游遥感任务(RSVQA/RSVG)上保持了竞争力。代码和数据集将被发布。
Summary / 总结
This work introduces RSHallu, a systematic study on hallucinations in remote sensing (RS) multimodal large language models (MLLMs). It formalizes RS hallucinations with a taxonomy, introduces image-level hallucinations, and builds a benchmark RSHalluEval. The study also proposes a domain-tailored dataset RSHalluShield and mitigation strategies, improving the hallucination-free rate by up to 21.63 percentage points while maintaining performance on downstream RS tasks. Dual-mode checking is enabled for both cloud auditing and local checking. Code and datasets are to be released.
这项研究介绍了RSHallu,对遥感(RS)多模态大型语言模型(MLLMs)中的幻觉进行了系统研究。它提出了RS幻觉的分类法,引入了图像级幻觉,并构建了基准RSHalluEval。该研究还提出了一个领域定制的数据集RSHalluShield和缓解策略,将幻觉无误率提高了最高21.63个百分点,同时在下游RS任务(如RSVQA/RSVG)上保持了竞争力。还实现了云审计和本地检查的双重模式检查。代码和数据集将要发布。
Defect-aware Hybrid Prompt Optimization via Progressive Tuning for Zero-Shot Multi-type Anomaly Detection and Segmentation
Authors: Nadeem Nazer, Hongkuan Zhou, Lavdim Halilaj, Ylli Sadikaj, Steffen Staab
First: 2025-12-10T09:19:17+00:00 · Latest: 2026-02-11T12:13:18+00:00
Abstract
Recent vision language models (VLMs) like CLIP have demonstrated impressive anomaly detection performance under significant distribution shift by utilizing high-level semantic information through text prompts. However, these models often neglect fine-grained details, such as which kind of anomalies, like "hole", "cut", "scratch" that could provide more specific insight into the nature of anomalies. We argue that recognizing fine-grained anomaly types 1) enriches the representation of "abnormal" with structured semantics, narrowing the gap between coarse anomaly signals and fine-grained defect categories; 2) enables manufacturers to understand the root causes of the anomaly and implement more targeted and appropriate corrective measures quickly. While incorporating such detailed semantic information is crucial, designing handcrafted prompts for each defect type is both time-consuming and susceptible to human bias. For this reason, we introduce DAPO, a novel approach for Defect-aware Prompt Optimization based on progressive tuning for the zero-shot multi-type and binary anomaly detection and segmentation under distribution shifts. Our approach aligns anomaly-relevant image features with their corresponding text semantics by learning hybrid defect-aware prompts with both fixed textual anchors and learnable token embeddings. We conducted experiments on public benchmarks (MPDD, VisA, MVTec-AD, MAD, and Real-IAD) and an internal dataset. The results suggest that compared to the baseline models, DAPO achieves a 3.7% average improvement in AUROC and average precision metrics at the image level under distribution shift, and a 6.5% average improvement in localizing novel anomaly types under zero-shot settings.
中文标题/摘要
标题:基于渐进调优的缺陷感知混合提示优化
近期的视觉语言模型(VLMs)如CLIP通过利用文本提示中的高层语义信息,在显著分布偏移下展示了令人印象深刻的异常检测性能。然而,这些模型往往忽略了细微的细节,例如“孔洞”、“切割”、“划痕”等具体的异常类型,这些信息能提供更具体的异常本质洞察。我们认为,识别细微的异常类型1)通过结构化的语义丰富了“异常”的表示,缩小了粗略异常信号与细微缺陷类别的差距;2)使制造商能够快速理解异常的根本原因并实施更针对性和适当的纠正措施。虽然整合这种详细的语义信息至关重要,但为每种缺陷类型设计手工制作的提示既耗时又容易受到人为偏见的影响。因此,我们提出了DAPO,一种基于渐进调优的缺陷感知提示优化方法,用于零样本多类型和二元异常检测与分割下的分布偏移。我们的方法通过学习混合的缺陷感知提示,结合固定的文本锚点和可学习的标记嵌入,使异常相关的图像特征与相应的文本语义对齐。我们在公共基准(MPDD、VisA、MVTec-AD、MAD和Real-IAD)和内部数据集上进行了实验。结果表明,与基线模型相比,DAPO在分布偏移下实现了图像级别3.7%的平均AUROC和平均精度改进,在零样本设置下定位新型异常类型时实现了6.5%的平均改进。
Summary / 总结
The research aims to improve zero-shot multi-type anomaly detection and segmentation by incorporating fine-grained defect information through a novel approach called DAPO, which uses progressive tuning to align image features with text semantics. The method introduces hybrid defect-aware prompts combining fixed textual anchors and learnable token embeddings. Experiments on various benchmarks show that DAPO outperforms baseline models, achieving a 3.7% average improvement in AUROC and average precision metrics under distribution shift and a 6.5% average improvement in localizing novel anomaly types in zero-shot settings.
该论文提出了DAPO,一种基于渐进调优的缺陷感知提示优化方法,用于零样本多类型异常检测和分割。它通过引入细粒度的异常类型来弥补现有视觉语言模型如CLIP的不足,提供更具体的洞察。实验结果显示,DAPO在各种基准上的AUROC和平均精度分别提高了3.7%和6.5%,并在零样本设置下更好地定位了新型异常类型。
Kelix Technique Report
Authors: Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang
First: 2026-02-10T14:48:26+00:00 · Latest: 2026-02-11T12:04:36+00:00
Comments: Work in progress
Abstract
Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.
中文标题/摘要
标题:Kelix 技术报告
自回归大型语言模型(LLMs)通过将多样化的任务表示为离散自然语言标记序列,并通过下一个标记预测进行训练,从而能够很好地扩展,这将理解与生成统一在自我监督之下。将这一范式扩展到多模态数据需要在不同模态之间共享离散表示。然而,大多数视觉语言模型(VLMs)仍然依赖于混合界面:离散文本标记配以连续的视觉变换器(ViT)特征。由于监督主要来自文本,这些模型往往偏向于理解,而无法充分利用大规模的非文本数据的自我监督学习。最近的工作探索了离散视觉标记化,以实现完全自回归的多模态建模,显示出统一理解和生成的有希望的进展。然而,现有的离散视觉标记经常由于编码容量有限而丢失信息,导致理解能力明显弱于连续特征的VLMs。我们提出了Kelix,一种完全离散的自回归统一模型,以弥合离散和连续视觉表示之间的理解差距。
From Steering to Pedalling: Do Autonomous Driving VLMs Generalize to Cyclist-Assistive Spatial Perception and Planning?
Authors: Krishna Kanth Nakka, Vedasri Nakka
First: 2026-02-11T12:01:37+00:00 · Latest: 2026-02-11T12:01:37+00:00
Comments: Preprint
Abstract
Cyclists often encounter safety-critical situations in urban traffic, highlighting the need for assistive systems that support safe and informed decision-making. Recently, vision-language models (VLMs) have demonstrated strong performance on autonomous driving benchmarks, suggesting their potential for general traffic understanding and navigation-related reasoning. However, existing evaluations are predominantly vehicle-centric and fail to assess perception and reasoning from a cyclist-centric viewpoint. To address this gap, we introduce CyclingVQA, a diagnostic benchmark designed to probe perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning from a cyclist's perspective. Evaluating 31+ recent VLMs spanning general-purpose, spatially enhanced, and autonomous-driving-specialized models, we find that current models demonstrate encouraging capabilities, while also revealing clear areas for improvement in cyclist-centric perception and reasoning, particularly in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes. Notably, several driving-specialized models underperform strong generalist VLMs, indicating limited transfer from vehicle-centric training to cyclist-assistive scenarios. Finally, through systematic error analysis, we identify recurring failure modes to guide the development of more effective cyclist-assistive intelligent systems.
中文标题/摘要
标题:从操控转向踏蹬:自动驾驶VLMs能否在骑行辅助的空间感知与规划中泛化?
城市交通中,骑车人经常遇到安全关键情况,突显了需要支持安全和明智决策的辅助系统的必要性。最近,视觉-语言模型(VLMs)在自动驾驶基准测试中表现出色,表明其在通用交通理解和导航相关推理方面的潜力。然而,现有的评估主要以车辆为中心,未能从骑车人的视角评估感知和推理能力。为解决这一差距,我们引入了CyclingVQA,这是一个诊断基准,旨在从骑车人的视角探究感知、时空理解和交通规则到车道推理。评估了31+种最近的VLMs,涵盖通用、空间增强和自动驾驶专用模型,我们发现当前模型展示了令人鼓舞的能力,但也揭示了在骑车人中心的感知和推理方面存在明显改进空间,特别是在解释骑车人特定的交通提示和将标志与正确的导航车道关联方面。值得注意的是,一些驾驶专用模型的表现不如强大的通用VLMs,表明从车辆中心训练到骑行辅助场景的迁移有限。最后,通过系统性错误分析,我们确定了反复出现的失败模式,以指导更有效的骑行辅助智能系统的开发。
Summary / 总结
The study aims to evaluate the generalization capabilities of vision-language models (VLMs) in cyclist-centric scenarios, addressing the gap in existing vehicle-centric evaluations. By introducing CyclingVQA, a diagnostic benchmark, the research assesses the perception, spatio-temporal understanding, and traffic-rule-to-lane reasoning of 31+ VLMs. The findings indicate that while current models show promising capabilities, they still require improvement in interpreting cyclist-specific traffic cues and associating signs with the correct navigational lanes, especially for driving-specialized models which underperform generalist VLMs in cyclist-assistive scenarios.
研究探讨了是否可以利用为自动驾驶训练的视觉语言模型(VLMs)来支持城市交通中的骑行者。为了解决缺乏骑行者为中心的评估问题,研究人员开发了CyclingVQA这一诊断基准。评估了31+个VLMs后,他们发现虽然当前模型表现出一定的潜力,但在解读骑行者特定的交通提示和将标志与正确的导航车道关联方面仍存在困难,这表明在骑行者辅助场景中的感知和推理方面仍需改进。
Beyond Gemini-3-Pro: Revisiting LLM Routing and Aggregation at Scale
Authors: Shengji Tang, Weihao Lin, Peng Ye, Jingqi Ye, Hao Li, Yiqun Zhang, Xiaosong Wang, Bo Zhang, Shuyue Hu, Tao Chen, Lei Bai, Wanli Ouyang
First: 2026-01-04T02:05:52+00:00 · Latest: 2026-02-11T11:11:48+00:00
Comments: 21 pages
Abstract
Large Language Models (LLMs) have rapidly advanced, with Gemini-3-Pro setting a new performance milestone. In this work, we explore collective intelligence as an alternative to monolithic scaling, and demonstrate that open-source LLMs' collaboration can surpass Gemini-3-Pro. We first revisit LLM routing and aggregation at scale and identify three key bottlenecks: (1) current train-free routers are limited by a query-based paradigm focusing solely on textual similarity; (2) recent aggregation methods remain largely static, failing to select appropriate aggregators for different tasks;(3) the complementarity of routing and aggregation remains underutilized. To address these problems, we introduce JiSi, a novel framework designed to release the full potential of LLMs' collaboration through three innovations: (1) Query-Response Mixed Routing capturing both semantic information and problem difficulty; (2) Support-Set-based Aggregator Selection jointly evaluating the aggregation and domain capacity of aggregators; (3) Adaptive Routing-Aggregation Switch dynamically leveraging the advantages of routing and aggregation. Comprehensive experiments on nine benchmarks demonstrate that JiSi can surpass Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs, while outperforming mainstream baselines. It suggests that collective intelligence represents a novel path towards Artificial General Intelligence (AGI).
中文标题/摘要
标题:超越Gemini-3-Pro:重新审视大规模LLM路由与聚合
大型语言模型(LLMs)迅速发展,Gemini-3-Pro设立了新的性能里程碑。本文探讨了集体智能作为替代单一扩展的方法,并证明开源LLMs的合作可以超越Gemini-3-Pro。我们首先重新审视大规模LLM路由与聚合,并识别三个关键瓶颈:(1)当前的无训练路由器受限于基于查询的模式,仅关注文本相似性;(2)最近的聚合方法仍然相对静态,无法为不同任务选择合适的聚合器;(3)路由与聚合的互补性尚未充分利用。为了解决这些问题,我们引入了JiSi,这是一种通过三项创新来释放LLMs合作潜力的新框架:(1)查询-响应混合路由,同时捕捉语义信息和问题难度;(2)基于支持集的聚合器选择,联合评估聚合器的聚合能力和领域容量;(3)自适应路由-聚合切换,动态利用路由和聚合的优势。在九个基准上的全面实验表明,JiSi仅通过协调十个开源LLMs,就能以47%的成本超越Gemini-3-Pro,同时优于主流基线。这表明集体智能代表了通向通用人工智能(AGI)的新途径。
Summary / 总结
This work revisits LLM routing and aggregation to address key bottlenecks in monolithic scaling, proposing JiSi, a framework that introduces Query-Response Mixed Routing, Support-Set-based Aggregator Selection, and Adaptive Routing-Aggregation Switch. JiSi demonstrates superior performance to Gemini-3-Pro with only 47% costs by orchestrating ten open-source LLMs across nine benchmarks, suggesting a new path towards AGI through collective intelligence.
该研究重新审视了大规模语言模型(LLM)路由和聚合中的挑战,指出现有基于查询的路由和静态聚合方法的局限性。为了解决这些问题,作者提出了JiSi框架,引入了查询-响应混合路由、基于支持集的聚合器选择以及动态路由-聚合切换。JiSi通过协调十个多源开源LLM在九个基准测试中表现出色,仅使用47%的成本超越了Gemini-3-Pro,表明通过集体智能可能开辟一条通往AGI的新路径。
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
Authors: Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu
First: 2025-09-21T06:54:04+00:00 · Latest: 2026-02-11T09:45:24+00:00
Comments: 20 pages, 6 figures
Abstract
Multimodal Large Language Models (MLLMs) require high-resolution visual information to perform fine-grained perception, yet processing entire high-resolution images is computationally prohibitive. While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient and less accurate, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process. In this paper, we propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. The SD-RPN is built around a pipeline that transforms the noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels by explicitly denoising the signal and resolving ambiguity. We use these labels to train a lightweight Region Proposal Network (RPN) that learns a more precise localization. This RPN is also highly efficient, predicting the RoI in a single forward pass using features from the MLLM's middle layers, decoupling RoI identification from the auto-regressive generation and avoiding costly multi-pass operations. To validate our approach, we integrate the framework into multiple MLLM families. Despite being trained on only a few (e.g. 10K) question-answer pairs, our method demonstrates exceptional data efficiency and generalization, achieving over a 10% absolute accuracy improvement on unseen benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a practical and scalable solution for enhancing the fine-grained perception of MLLMs without requiring costly supervision or full model fine-tuning. Code is available at https://github.com/YuHengsss/SD-RPN.
中文标题/摘要
标题:捕捉细节:自蒸馏RoI预测器实现细粒度MLLM感知
多模态大型语言模型(MLLMs)需要高分辨率的视觉信息来执行细粒度感知,但处理整个高分辨率图像在计算上是不可行的。虽然最近的方法利用区域兴趣(RoI)机制专注于显著区域,但它们通常面临一个困难的权衡:基于训练的方法依赖于大规模标注数据集,而无需训练的方法利用模型内部注意力则计算效率低下且准确性较低,需要多轮预填充阶段或依赖于缓慢的自回归解码过程。在本文中,我们提出了一种高效的、无需标注的自蒸馏区域建议网络(SD-RPN),解决了这一权衡问题。SD-RPN围绕一个管道构建,该管道将MLLM中间层的嘈杂注意力图转换为高质量的伪RoI标签,通过明确去噪和解决歧义。我们使用这些标签训练一个轻量级的区域建议网络(RPN),使其学习更精确的定位。该RPN也非常高效,在单次前向传播中使用MLLM中间层的特征预测RoI,将RoI识别与自回归生成解耦,避免了昂贵的多轮操作。为了验证我们的方法,我们将框架集成到多个MLLM家族中。尽管仅在少量(例如10K)问答对上进行训练,我们的方法仍表现出色,数据效率和泛化能力极佳,在TextVQA、DocVQA和V-Star等未见基准上实现了超过10%的绝对准确率提升。我们的工作提供了一种实用且可扩展的解决方案,无需昂贵的监督或全面模型微调即可增强MLLM的细粒度感知。代码可在https://github.com/YuHengsss/SD-RPN/ 获取。
Summary / 总结
This paper addresses the challenge of fine-grained perception in Multimodal Large Language Models (MLLMs) by proposing an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN). The SD-RPN transforms noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels, which are then used to train a lightweight RPN for precise localization. This method avoids the need for large-scale annotated datasets or computationally expensive multi-pass operations, achieving over a 10% absolute accuracy improvement on various benchmarks without full model fine-tuning. The SD-RPN predicts RoI in a single forward pass using features from the MLLM's middle layers, enhancing the model's fine-grained perception capabilities efficiently.
本文提出了一种高效的无注释Self-Distilled Region Proposal Network (SD-RPN),以解决多模态大型语言模型(MLLMs)的细粒度感知问题。SD-RPN 将 MLLM 中间层的嘈杂注意力图转换为高质量的伪RoI标签,然后使用这些标签训练一个轻量级的RPN进行精确定位。该方法避免了需要大规模标注数据集或计算密集型多遍操作,无需全模型微调即可在各种基准上实现超过10%的绝对准确率提升。SD-RPN 使用MLLM中间层的特征在单次前向传递中预测RoI,高效地增强了模型的细粒度感知能力。
Dynamic Frequency Modulation for Controllable Text-driven Image Generation
Authors: Tiandong Shi, Ling Zhao, Ji Qi, Jiayi Ma, Chengli Peng
First: 2026-02-11T09:06:44+00:00 · Latest: 2026-02-11T09:06:44+00:00
Abstract
The success of text-guided diffusion models has established a new image generation paradigm driven by the iterative refinement of text prompts. However, modifying the original text prompt to achieve the expected semantic adjustments often results in unintended global structure changes that disrupt user intent. Existing methods rely on empirical feature map selection for intervention, whose performance heavily depends on appropriate selection, leading to suboptimal stability. This paper tries to solve the aforementioned problem from a frequency perspective and analyzes the impact of the frequency spectrum of noisy latent variables on the hierarchical emergence of the structure framework and fine-grained textures during the generation process. We find that lower-frequency components are primarily responsible for establishing the structure framework in the early generation stage. Their influence diminishes over time, giving way to higher-frequency components that synthesize fine-grained textures. In light of this, we propose a training-free frequency modulation method utilizing a frequency-dependent weighting function with dynamic decay. This method maintains the structure framework consistency while permitting targeted semantic modifications. By directly manipulating the noisy latent variable, the proposed method avoids the empirical selection of internal feature maps. Extensive experiments demonstrate that the proposed method significantly outperforms current state-of-the-art methods, achieving an effective balance between preserving structure and enabling semantic updates.
中文标题/摘要
标题:可控文本驱动图像生成的动态频率调制
文本引导扩散模型的成功确立了新的基于文本提示迭代细化的图像生成范式。然而,修改原始文本提示以实现预期的语义调整往往会导致意外的整体结构变化,从而破坏用户意图。现有方法依赖于经验特征图的选择进行干预,其性能高度依赖于适当的选择,导致稳定性不足。本文从频率的角度出发,分析了噪声潜在变量的频率谱对生成过程中层次结构框架和细粒度纹理的级联出现的影响。我们发现,在早期生成阶段,低频分量主要负责建立结构框架。随着时间的推移,它们的影响减弱,让位于高频分量,这些高频分量合成细粒度纹理。鉴于此,我们提出了一种无需训练的频率调制方法,利用频率依赖的加权函数和动态衰减。该方法在保持结构框架一致性的同时,允许进行有针对性的语义修改。通过直接操作噪声潜在变量,所提出的方法避免了内部特征图的经验选择。大量实验表明,所提出的方法显著优于当前最先进的方法,实现了结构保持和语义更新的有效平衡。
Summary / 总结
This paper addresses the issue of unintended global structure changes in text-driven image generation by proposing a frequency modulation method. It analyzes the role of different frequency components in the generation process and introduces a training-free method that uses a frequency-dependent weighting function with dynamic decay. This method allows for targeted semantic modifications while maintaining structural consistency, outperforming existing methods in preserving structure and enabling semantic updates.
本文提出了一种基于频率调制的方法来解决文本驱动图像生成中意外的整体结构变化问题。研究分析了生成过程中不同频率成分的作用,发现较低频率对于初始结构框架至关重要,而较高频率则贡献于精细纹理。该方法使用具有动态衰减的频率依赖加权函数来保持结构一致性的同时允许目标语义修改,实验表明该方法在保持结构和实现语义更新方面优于现有方法。
VeriSciQA: An Auto-Verified Dataset for Scientific Visual Question Answering
Authors: Yuyi Li, Daoyuan Chen, Zhen Wang, Yutong Lu, Yaliang Li
First: 2025-11-25T04:14:52+00:00 · Latest: 2026-02-11T08:50:03+00:00
Abstract
Large Vision-Language Models (LVLMs) show promise for scientific applications, yet open-source models still struggle with Scientific Visual Question Answering (SVQA), namely answering questions about figures from scientific papers. A key bottleneck is the lack of public, large-scale, high-quality SVQA datasets. Although recent work uses LVLMs to synthesize data at scale, we identify systematic errors in their resulting QA pairs, stemming from LVLMs' inherent limitations and information asymmetry between figures and text. To address these challenges, we propose a Cross-Modal verification framework that generates questions and answers purely from figure-citing paragraphs, then verifies them against the figures themselves, leveraging the inherent text-figure alignment in scientific papers to filter out erroneous QA pairs. We instantiate this framework to curate VeriSciQA, a dataset of 20,272 QA pairs spanning 20 scientific domains and 12 figure types. Difficulty assessment reveals a notable accuracy gap between the best open-source model (65%) and the best proprietary model (80.5%), demonstrating room for improvement. Moreover, models fine-tuned on VeriSciQA achieve consistent improvements on SVQA benchmarks, with performance gains that scale with data size, surpassing models trained on existing datasets. Human evaluation further validates the improved quality of VeriSciQA. These results demonstrate that continued data expansion via our scalable framework can further advance SVQA capability in the open-source community. Our dataset is publicly available at https://huggingface.co/datasets/datajuicer/VeriSciQA.
中文标题/摘要
标题:VeriSciQA:一个自动验证的科学视觉问答数据集
大型视觉-语言模型(LVLMs)在科学应用中显示出潜力,但开源模型仍然难以应对科学视觉问答(SVQA),即回答科学论文中图表的问题。一个关键瓶颈是缺乏公开、大规模、高质量的SVQA数据集。尽管最近的工作使用LVLMs大规模生成数据,但我们发现它们生成的问答对中存在系统性错误,源于LVLMs的固有限制以及图表与文本之间的信息不对称。为了解决这些挑战,我们提出了一种跨模态验证框架,该框架仅从引用图表的段落生成问题和答案,然后通过利用科学论文中固有的图文对齐来验证它们,从而筛选出错误的问答对。我们实例化了该框架来编纂VeriSciQA数据集,该数据集包含20,272个问答对,覆盖20个科学领域和12种图表类型。难度评估显示,最佳开源模型的准确率为65%,而最佳专有模型为80.5%,表明仍有改进空间。此外,使用VeriSciQA进行微调的模型在SVQA基准测试中表现出一致的改进,性能提升随着数据量的增加而增加,超过了在现有数据集上训练的模型。进一步的人类评估还验证了VeriSciQA的质量提升。这些结果表明,通过我们的可扩展框架继续扩展数据可以进一步推动开源社区的SVQA能力。我们的数据集可在https://huggingface.co/datasets/datajuicer/VeriSciQA上公开获取。
Summary / 总结
The research aims to address the challenges in Scientific Visual Question Answering (SVQA) by proposing VeriSciQA, a dataset generated using a cross-modal verification framework. This framework generates questions and answers from figure-citing paragraphs and verifies them against the figures themselves, filtering out erroneous pairs. VeriSciQA comprises 20,272 QA pairs across 20 scientific domains and 12 figure types. The dataset shows a significant accuracy gap between open-source and proprietary models, with improvements in performance on SVQA benchmarks for models fine-tuned on VeriSciQA, especially as the dataset size increases. Human evaluation confirms the improved quality of VeriSciQA, suggesting that continued data expansion can further enhance SVQA capabilities in the open-source community.
研究旨在通过解决高质量数据集缺乏的问题来提高科学视觉问答(SVQA)的能力。方法是使用跨模态验证框架从图例段落生成和验证问题与答案,并过滤掉错误的问答对。关键发现表明,VeriSciQA 数据集包含 20,272 个问答对,优于现有数据集,开源模型的准确率为 65%,而专有模型为 80.5%。在 VeriSciQA 上进行微调可以实现一致的性能提升,并且人工评估确认了数据集的质量提高。
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification
Authors: Xiangyan Qu, Gaopeng Gou, Jiamin Zhuang, Jing Yu, Kun Song, Qihao Wang, Yili Li, Gang Xiong
Venue: CVPR
First: 2025-02-27T07:39:23+00:00 · Latest: 2026-02-11T08:05:30+00:00
Comments: Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025
Abstract
Vision-language models (VLMs) have made significant progress in image classification by training with large-scale paired image-text data. Their performances largely depend on the prompt quality. While recent methods show that visual descriptions generated by large language models (LLMs) enhance the generalization of VLMs, class-specific prompts may be inaccurate or lack discrimination due to the hallucination in LLMs. In this paper, we aim to find visually discriminative prompts for fine-grained categories with minimal supervision and no human-in-the-loop. An evolution-based algorithm is proposed to progressively optimize language prompts from task-specific templates to class-specific descriptions. Unlike optimizing templates, the search space shows an explosion in class-specific candidate prompts. This increases prompt generation costs, iterative times, and the overfitting problem. To this end, we first introduce several simple yet effective edit-based and evolution-based operations to generate diverse candidate prompts by one-time query of LLMs. Then, two sampling strategies are proposed to find a better initial search point and reduce traversed categories, saving iteration costs. Moreover, we apply a novel fitness score with entropy constraints to mitigate overfitting. In a challenging one-shot image classification setting, our method outperforms existing textual prompt-based methods and improves LLM-generated description methods across 13 datasets. Meanwhile, we demonstrate that our optimal prompts improve adapter-based methods and transfer effectively across different backbones.
中文标题/摘要
标题:ProAPO: 视觉分类中的渐进自动提示优化
视觉语言模型(VLMs)通过大规模配对图像-文本数据训练,在图像分类方面取得了显著进展。它们的表现很大程度上取决于提示的质量。虽然最近的方法表明,由大型语言模型(LLMs)生成的视觉描述可以增强VLMs的泛化能力,但类特定的提示可能由于LLMs的幻觉而不够准确或缺乏区分性。在本文中,我们旨在在最少监督和无需人工干预的情况下,找到细粒度类别的视觉区分性提示。提出了一种基于进化的算法,逐步优化从任务特定模板到类特定描述的语言提示。与优化模板不同,类特定候选提示的空间爆炸性增加了提示生成成本、迭代次数和过拟合问题。为此,我们首先引入了几种简单而有效的编辑和进化操作,通过一次查询LLMs生成多样化的候选提示。然后,提出了两种采样策略,以找到更好的初始搜索点并减少遍历的类别,从而节省迭代成本。此外,我们应用了一种新颖的具有熵约束的适应度评分来缓解过拟合。在具有挑战性的单次图像分类设置中,我们的方法优于现有的基于文本提示的方法,并且在13个数据集上改进了LLM生成描述的方法。同时,我们证明了我们的最优提示改进了基于适配器的方法,并且在不同的骨干网络之间具有良好的迁移性。
Summary / 总结
This paper addresses the challenge of generating visually discriminative prompts for fine-grained categories in image classification using an evolution-based algorithm. The method optimizes language prompts from task-specific templates to class-specific descriptions with minimal supervision. Key findings include outperforming existing textual prompt-based methods and improving LLM-generated description methods across 13 datasets in a one-shot image classification setting. Additionally, the optimal prompts enhance adapter-based methods and transfer effectively across different backbones.
本文提出了一种基于进化算法的方法,用于生成细粒度类别在图像分类中的视觉区分性提示。该方法从任务特定模板优化到类别特定描述,无需人工干预。它引入了基于编辑和进化操作来高效生成多样候选提示,并提出了两种采样策略以减少搜索空间并缓解过拟合。实验结果表明,所提出的方法在13个数据集上优于现有的文本提示方法,并且最佳提示也增强了基于适配器的方法,并且在不同骨干网络之间具有良好的迁移性。
Fake-HR1: Rethinking Reasoning of Vision Language Model for Synthetic Image Detection
Authors: Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu
Venue: ICASSP 2026
First: 2026-02-10T18:10:08+00:00 · Latest: 2026-02-11T07:32:53+00:00
Comments: Accepted by ICASSP 2026
Abstract
Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.
中文标题/摘要
标题:Fake-HR1:重新思考视觉语言模型在合成图像检测中的推理机制
近期研究表明,在检测过程中引入链式思考(CoT)推理可以增强模型检测合成图像的能力。然而,过长的推理过程会带来显著的资源开销,包括令牌消耗和延迟,特别是在处理明显伪造的图像时尤为冗余。为解决这一问题,我们提出了一种大规模混合推理模型Fake-HR1,据我们所知,这是首个能够根据生成检测任务的特性自适应地决定是否需要推理的模型。为此,我们设计了一个两阶段训练框架:首先进行混合微调(HFT)以进行冷启动初始化,然后通过混合推理组策略优化(HGRPO)进行在线强化学习,以隐式学习何时选择合适的推理模式。实验结果表明,Fake-HR1能够在不同类型的查询中自适应地进行推理,不仅在推理能力和生成检测性能上超越现有语言模型,还显著提高了响应效率。
Summary / 总结
The research aims to improve the ability of vision language models to detect synthetic images by incorporating adaptive reasoning. The method involves a two-stage training framework: Hybrid Fine-Tuning for initialization and Hybrid-Reasoning Grouped Policy Optimization for online learning. The key finding is that Fake-HR1 outperforms existing models in both reasoning ability and generative detection performance, while also enhancing response efficiency.
研究旨在通过引入自适应推理来提高视觉语言模型检测合成图像的能力。方法包括两阶段训练框架:Hybrid Fine-Tuning进行初始化,Hybrid-Reasoning Grouped Policy Optimization进行在线学习。关键发现是Fake-HR1在推理能力和生成检测性能上都超过了现有模型,同时提高了响应效率。
LAP: Language-Action Pre-Training Enables Zero-shot Cross-Embodiment Transfer
Authors: Lihan Zha, Asher J. Hancock, Mingtong Zhang, Tenny Yin, Yixuan Huang, Dhruv Shah, Allen Z. Ren, Anirudha Majumdar
First: 2026-02-11T06:09:11+00:00 · Latest: 2026-02-11T06:09:11+00:00
Comments: Project website: https://lap-vla.github.io
Abstract
A long-standing goal in robotics is a generalist policy that can be deployed zero-shot on new robot embodiments without per-embodiment adaptation. Despite large-scale multi-embodiment pre-training, existing Vision-Language-Action models (VLAs) remain tightly coupled to their training embodiments and typically require costly fine-tuning. We introduce Language-Action Pre-training (LAP), a simple recipe that represents low-level robot actions directly in natural language, aligning action supervision with the pre-trained vision-language model's input-output distribution. LAP requires no learned tokenizer, no costly annotation, and no embodiment-specific architectural design. Based on LAP, we present LAP-3B, which to the best of our knowledge is the first VLA to achieve substantial zero-shot transfer to previously unseen robot embodiments without any embodiment-specific fine-tuning. Across multiple novel robots and manipulation tasks, LAP-3B attains over 50% average zero-shot success, delivering roughly a 2x improvement over the strongest prior VLAs. We further show that LAP enables efficient adaptation and favorable scaling, while unifying action prediction and VQA in a shared language-action format that yields additional gains through co-training.
中文标题/摘要
标题:LAP:语言-动作预训练使零样本跨体态迁移成为可能
机器人领域的一个长期目标是能够零样本部署在新机器人体态上的通用策略,无需针对每个体态进行适应。尽管进行了大规模的多体态预训练,现有的视觉-语言-动作模型(VLAs)仍然紧密耦合于其训练体态,并且通常需要昂贵的微调。我们提出了语言-动作预训练(LAP),这是一种简单的配方,直接将低级机器人动作表示为自然语言,使动作监督与预训练的视觉-语言模型的输入-输出分布相一致。LAP 不需要学习分词器,不需要昂贵的标注,也不需要针对特定体态的架构设计。基于 LAP,我们提出了 LAP-3B,据我们所知,这是第一个无需任何体态特定微调即可实现显著零样本迁移至未见过的机器人体态的 VLAs。在多个新型机器人和操作任务上,LAP-3B 达到了超过 50% 的平均零样本成功率,相比之前最强的 VLAs 提供了大约 2 倍的改进。我们还展示了 LAP 能够实现高效的适应和有利的扩展,并在共享的语言-动作格式中统一了动作预测和 VQA,通过协同训练进一步提高了性能。
Summary / 总结
The research aims to develop a generalist policy for robots that can be deployed without adaptation on new robot embodiments. The Language-Action Pre-training (LAP) method represents low-level robot actions in natural language, aligning with the pre-trained vision-language model's input-output distribution. LAP-3B, based on LAP, achieves over 50% average zero-shot success across multiple novel robots and manipulation tasks, demonstrating a 2x improvement over previous models without any embodiment-specific fine-tuning. This method enables efficient adaptation and unifies action prediction and VQA in a shared language-action format, yielding additional gains through co-training.
研究旨在开发一种无需适应即可部署到新体态的机器人通用策略。Language-Action Pre-training (LAP) 方法将机器人动作表示为自然语言,与预训练的视觉语言模型对齐。基于 LAP 的 LAP-3B 在多种机器人和任务上实现了超过 50% 的零样本成功率,性能比之前的最佳模型提高了约两倍。该方法无需学习分词器、昂贵的标注或体态特定的设计,展示了高效的适应性和有利的扩展性。
Out of the box age estimation through facial imagery: A Comprehensive Benchmark of Vision-Language Models vs. out-of-the-box Traditional Architectures
Authors: Simiao Ren, Xingyu Shen, Ankit Raj, Albert Dai, Caroline, Zhang, Yuan Xu, Zexi Chen, Siqi Wu, Chen Gong, Yuxin Zhang
First: 2026-02-08T04:44:31+00:00 · Latest: 2026-02-11T06:08:19+00:00
Abstract
Facial age estimation plays a critical role in content moderation, age verification, and deepfake detection. However, no prior benchmark has systematically compared modern vision-language models (VLMs) with specialized age estimation architectures. We present the first large-scale cross-paradigm benchmark, evaluating 34 models - 22 specialized architectures with publicly available pretrained weights and 12 general-purpose VLMs - across eight standard datasets (UTKFace, IMDB-WIKI, MORPH, AFAD, CACD, FG-NET, APPA-REAL, and AgeDB), totaling 1,100 test images per model. Our key finding is striking: zero-shot VLMs significantly outperform most specialized models, achieving an average mean absolute error (MAE) of 5.65 years compared to 9.88 years for non-LLM models. The best-performing VLM (Gemini 3 Flash Preview, MAE 4.32) surpasses the strongest non-LLM model (MiVOLO, MAE 5.10) by 15%. MiVOLO - unique in combining face and body features using Vision Transformers - is the only specialized model that remains competitive with VLMs. We further analyze age verification at the 18-year threshold and find that most non-LLM models exhibit false adult rates between 39% and 100% for minors, whereas VLMs reduce this to 16%-29%. Additionally, coarse age binning (8-9 classes) consistently increases MAE beyond 13 years. Stratified analysis across 14 age groups reveals that all models struggle most at extreme ages (under 5 and over 65). Overall, these findings challenge the assumption that task-specific architectures are necessary for high-performance age estimation and suggest that future work should focus on distilling VLM capabilities into efficient specialized models.
中文标题/摘要
标题:基于面部图像的开箱即用年龄估计:视觉-语言模型与传统架构的全面基准比较
面部年龄估计在内容审核、年龄验证和深度伪造检测中起着关键作用。然而,此前没有系统地将现代视觉-语言模型(VLMs)与专门的年龄估计架构进行比较。我们首次提出了一个大规模跨范式基准,评估了34个模型——包括22个具有公开预训练权重的专门架构和12个通用视觉-语言模型——在八个标准数据集(UTKFace、IMDB-WIKI、MORPH、AFAD、CACD、FG-NET、APPA-REAL和AgeDB)上的表现,每个模型总计1100张测试图像。我们的主要发现令人惊讶:零样本VLMs显著优于大多数专门模型,平均绝对误差(MAE)为5.65年,而非LLM模型的MAE为9.88年。表现最佳的VLM(Gemini 3 Flash Preview,MAE 4.32)比最强的非LLM模型(MiVOLO,MAE 5.10)高出15%。MiVOLO——唯一一个结合了面部和身体特征的视觉变换器模型——是唯一一个与VLMs竞争的专门模型。我们进一步分析了18岁阈值下的年龄验证,发现大多数非LLM模型对未成年人的假成年率在39%到100%之间,而VLMs将这一比例降低到16%-29%。此外,粗略的年龄分组(8-9个类别)始终将MAE提高到13年以上。在14个年龄组的分层分析中,所有模型在极端年龄(小于5岁和大于65岁)的表现最差。总体而言,这些发现挑战了任务特定架构对于高性能年龄估计是必要的假设,并建议未来的工作应集中在将VLM能力提炼为高效的专门模型上。
Summary / 总结
The study benchmarks 34 models, including 22 specialized age estimation architectures and 12 general-purpose vision-language models (VLMs), across eight datasets. Key findings show that zero-shot VLMs significantly outperform specialized models, with an average MAE of 5.65 years compared to 9.88 years for non-VLM models. The best VLM, Gemini 3 Flash Preview, achieves an MAE of 4.32 years, 15% better than the best non-VLM model, MiVOLO. VLMs also perform better in age verification, reducing false adult rates for minors to 16%-29% compared to 39%-100% for non-VLM models. The study suggests that specialized architectures are not necessary for high-performance age estimation.
研究旨在评估现代视觉语言模型(VLMs)与专门的年龄估计架构在面部年龄估计中的性能。它在八个数据集上对34个模型进行了基准测试,结果显示零样本VLMs显著优于专门模型,平均绝对误差(MAE)为5.65年,而非VLM模型的MAE为9.88年。最佳VLM Gemini 3 Flash Preview的MAE为4.32年,比最佳非VLM模型MiVOLO的5.10年高出15%。VLMs在年龄验证方面也表现出色,将未成年人的假成年率降低到16%-29%,而非VLM模型的假成年率在39%-100%之间。研究建议,特定任务的架构可能不是实现高性能年龄估计所必需的。
MapVerse: A Benchmark for Geospatial Question Answering on Diverse Real-World Maps
Authors: Sharat Bhat, Harshita Khandelwal, Tushar Kataria, Vivek Gupta
First: 2026-02-11T04:36:14+00:00 · Latest: 2026-02-11T04:36:14+00:00
Abstract
Maps are powerful carriers of structured and contextual knowledge, encompassing geography, demographics, infrastructure, and environmental patterns. Reasoning over such knowledge requires models to integrate spatial relationships, visual cues, real-world context, and domain-specific expertise-capabilities that current large language models (LLMs) and vision-language models (VLMs) still struggle to exhibit consistently. Yet, datasets used to benchmark VLMs on map-based reasoning remain narrow in scope, restricted to specific domains, and heavily reliant on artificially generated content (outputs from LLMs or pipeline-based methods), offering limited depth for evaluating genuine geospatial reasoning. To address this gap, we present MapVerse, a large-scale benchmark built on real-world maps. It comprises 11,837 human-authored question-answer pairs across 1,025 maps, spanning ten diverse map categories and multiple question categories for each. The dataset provides a rich setting for evaluating map reading, interpretation, and multimodal reasoning. We evaluate ten state-of-the-art models against our benchmark to establish baselines and quantify reasoning gaps. Beyond overall performance, we conduct fine-grained categorical analyses to assess model inference across multiple dimensions and investigate the visual factors shaping reasoning outcomes. Our findings reveal that while current VLMs perform competitively on classification-style tasks, both open- and closed-source models fall short on advanced tasks requiring complex spatial reasoning.
中文标题/摘要
标题:MapVerse:地理空间问答的大规模基准
地图是承载结构化和上下文知识的强大载体,包括地理、人口统计、基础设施和环境模式。在这些知识上进行推理需要模型整合空间关系、视觉线索、现实世界背景和领域特定的专业知识——而当前的大语言模型(LLMs)和视觉-语言模型(VLMs)在这些方面仍然难以一致地展现。然而,用于评估VLMs的地图推理基准数据集在范围上仍然狭窄,局限于特定领域,并且高度依赖于人工生成的内容(来自LLMs或管道方法的输出),这为评估真正的地理空间推理提供了有限的深度。为了解决这一差距,我们提出了MapVerse,一个基于真实地图的大规模基准。它包含11,837个人工撰写的问答对,覆盖1,025张地图,涵盖十个不同的地图类别和每个类别的多个问题类别。该数据集为评估地图阅读、解释和多模态推理提供了丰富的环境。我们评估了十种最先进的模型以建立基线并量化推理差距。除了整体性能外,我们还进行了细粒度的类别分析,以评估模型在多个维度上的推理能力,并探讨视觉因素如何影响推理结果。我们的研究发现,虽然当前的VLMs在分类任务上表现竞争,但开源和闭源模型在需要复杂空间推理的高级任务上表现不佳。
Summary / 总结
MapVerse is a benchmark for geospatial question answering on real-world maps, addressing the limitations of existing datasets by using 11,837 human-authored question-answer pairs across 1,025 maps from ten diverse categories. The benchmark evaluates ten state-of-the-art models, revealing that while current vision-language models perform well on classification tasks, they struggle with complex spatial reasoning tasks. The study provides insights into the reasoning gaps of these models across different categories and visual factors influencing their performance.
研究旨在通过真实地图评估模型处理地理空间知识的能力。MapVerse基准数据集包含11,837个问题-答案对,覆盖1,025张地图的十个不同类别。十种最先进的模型被评估,显示在分类任务上表现良好,但在需要复杂空间推理的高级任务上表现不佳。研究还提供了详细的类别分析,以了解模型在不同维度上的表现以及视觉因素如何影响推理结果。
SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning
Authors: Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu
First: 2026-02-10T06:57:12+00:00 · Latest: 2026-02-11T03:34:02+00:00
Abstract
Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model's reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
中文标题/摘要
标题:SpotAgent:通过代理推理在大型视觉语言模型中实现视觉地理定位
大型视觉语言模型(LVLMs)在地理定位方面展示了强大的推理能力,但在现实世界场景中,由于视觉线索稀疏、长尾且高度模糊,它们往往难以应对。以往的方法受限于内部知识,往往无法提供可验证的结果,在面对混淆证据时只能给出自信但未验证的预测。为解决这些挑战,我们提出了SpotAgent框架,该框架将地理定位形式化为一种代理推理过程,利用专家级推理来协同视觉解释与工具辅助验证。SpotAgent通过ReAct图利用外部工具(例如网络搜索、地图)积极探索和验证视觉线索。我们引入了一个三阶段后训练管道,首先通过监督微调(SFT)阶段进行基本对齐,然后通过多代理框架合成高质量轨迹进行代理冷启动阶段,旨在灌输工具调用的专业知识。随后,通过强化学习精炼模型的推理能力。我们提出了一种基于空间感知的动态过滤策略,以增强强化学习阶段的效率,优先处理基于空间难度的学习样本。在标准基准上的广泛实验表明,SpotAgent实现了最先进的性能,有效减少了幻觉,提供了精确且可验证的地理定位。
Summary / 总结
The research aims to improve the geo-localization capabilities of large vision-language models in real-world scenarios with sparse and ambiguous visual cues. SpotAgent is proposed as a framework that formalizes geo-localization into an agentic reasoning process, using external tools for verification. The method includes a 3-stage post-training pipeline: Supervised Fine-Tuning, Agentic Cold Start, and Reinforcement Learning, with a Spatially-Aware Dynamic Filtering strategy to enhance efficiency. Experimental results show that SpotAgent outperforms existing methods, providing precise and verifiable geo-localization while reducing hallucinations.
研究旨在提高大型视觉语言模型在稀疏和模糊视觉线索下的地理定位能力。SpotAgent 是一个将地理定位过程形式化为代理推理过程的框架,利用外部工具进行验证。方法包括三个阶段的后训练管道:监督微调、使用多代理框架的冷启动以及强化学习。关键发现表明,SpotAgent 在标准基准测试中表现出色,减少了幻觉并提供了精确且可验证的地理定位结果。