arXiv 论文速递

Snapshot: 20260206_0347

When LLaVA Meets Objects: Token Composition for Vision-Language-Models

Authors: Soumya Jahagirdar, Walid Bousselham, Anna Kukleva, Hilde Kuehne

First: 2026-02-04T18:50:46+00:00 · Latest: 2026-02-04T18:50:46+00:00

Abstract

Current autoregressive Vision Language Models (VLMs) usually rely on a large number of visual tokens to represent images, resulting in a need for more compute especially at inference time. To address this problem, we propose Mask-LLaVA, a framework that leverages different levels of visual features to create a compact yet information-rich visual representation for autoregressive VLMs. Namely, we combine mask-based object representations together with global tokens and local patch tokens. While all tokens are used during training, it shows that the resulting model can flexibly drop especially the number of mask-based object-tokens at test time, allowing to adapt the number of tokens during inference without the need to retrain the model and without a significant drop in performance. We evaluate the proposed approach on a suite of standard benchmarks showing results competitive to current token efficient methods and comparable to the original LLaVA baseline using only a fraction of visual tokens. Our analysis demonstrates that combining multi-level features enables efficient learning with fewer tokens while allowing dynamic token selection at test time for good performance.

中文标题/摘要

标题：当LLaVA遇到物体：视觉语言模型的标记组成

当前自回归视觉语言模型（VLMs）通常依赖大量的视觉标记来表示图像，这在推理时需要更多的计算资源。为了解决这个问题，我们提出了Mask-LLaVA框架，该框架利用不同级别的视觉特征来创建一种紧凑但信息丰富的视觉表示，以供自回归VLMs使用。具体来说，我们将基于掩码的对象表示与全局标记和局部块标记结合在一起。虽然所有标记都在训练过程中使用，但结果显示，该模型可以在测试时灵活地减少基于掩码的对象标记的数量，从而在不重新训练模型且性能无显著下降的情况下，在推理时动态选择标记数量。我们在一系列标准基准上评估了该方法，结果显示其结果与当前的标记高效方法相当，并且仅使用少量视觉标记即可达到与原始LLaVA基线相当的结果。我们的分析表明，结合多级特征可以在较少的标记下实现高效的训练，并允许在测试时动态选择标记以获得良好的性能。

Summary / 总结

The paper addresses the high computational cost of current autoregressive Vision Language Models (VLMs) by proposing Mask-LLaVA, which uses a combination of mask-based object representations, global tokens, and local patch tokens to create a compact visual representation. The model can flexibly reduce the number of mask-based object tokens during inference without significant performance loss, making it more efficient. Experiments on standard benchmarks show that Mask-LLaVA achieves results competitive with other token-efficient methods using fewer visual tokens than the original LLaVA model.

论文提出了一种名为Mask-LLaVA的方法，通过结合基于掩码的对象表示、全局令牌和局部补丁令牌来创建紧凑的视觉表示，以解决当前自回归视觉语言模型（VLMs）的高计算需求问题。在推理时，模型可以灵活减少基于掩码的对象令牌的数量而不需重新训练，同时保持性能。该方法在标准基准测试中优于其他令牌高效方法，并且使用较少的视觉令牌达到了与原始LLaVA基线相当的性能。

El Agente Estructural: An Artificially Intelligent Molecular Editor

Authors: Changhyeok Choi, Yunheng Zou, Marcel Müller, Han Hao, Yeonghun Kang, Juan B. Pérez-Sánchez, Ignacio Gustin, Hanyong Xu, Mohammad Ghazi Vakili, Chris Crebolder, Alán Aspuru-Guzik, Varinia Bernales

First: 2026-02-04T18:38:48+00:00 · Latest: 2026-02-04T18:38:48+00:00

Abs · PDF · Code1 · Code2

Abstract

We present El Agente Estructural, a multimodal, natural-language-driven geometry-generation and manipulation agent for autonomous chemistry and molecular modelling. Unlike molecular generation or editing via generative models, Estructural mimics how human experts directly manipulate molecular systems in three dimensions by integrating a comprehensive set of domain-informed tools and vision-language models. This design enables precise control over atomic or functional group replacements, atomic connectivity, and stereochemistry without the need to rebuild extensive core molecular frameworks. Through a series of representative case studies, we demonstrate that Estructural enables chemically meaningful geometry manipulation across a wide range of real-world scenarios. These include site-selective functionalization, ligand binding, ligand exchange, stereochemically controlled structure construction, isomer interconversion, fragment-level structural analysis, image-guided generation of structures from schematic reaction mechanisms, and mechanism-driven geometry generation and modification. These examples illustrate how multimodal reasoning, when combined with specialized geometry-aware tools, supports interactive and context-aware molecular modelling beyond structure generation. Looking forward, the integration of Estructural into El Agente Quntur, an autonomous multi-agent quantum chemistry platform, enhances its capabilities by adding sophisticated tools for the generation and editing of three-dimensional structures.

中文标题/摘要

标题：结构特工：一种人工智能分子编辑器

我们介绍了结构特工，一种多模态、自然语言驱动的几何生成和操作代理，用于自主化学和分子建模。与通过生成模型进行的分子生成或编辑不同，结构工模仿了人类专家如何直接在三维空间中操作分子系统的方式，通过整合一系列全面的领域指导工具和视觉语言模型。这种设计使得在无需重建广泛的分子核心框架的情况下，能够精确控制原子或功能团替换、原子连接性和立体化学。通过一系列代表性案例研究，我们展示了结构工如何在广泛的实际场景中实现有意义的几何操作。这些场景包括选择性功能化、配体结合、配体交换、立体化学控制的结构构建、异构体互变、片段级结构分析、基于示意图反应机制的结构生成以及机制驱动的几何生成和修改。这些示例说明了当结合多模态推理与专门的几何感知工具时，如何支持超越结构生成的交互式和上下文感知的分子建模。展望未来，将结构工整合到量子化学多代理自主平台El Agente Quntur中，通过添加复杂的三维结构生成和编辑工具，增强了其功能。

Summary / 总结

El Agente Estructural is a multimodal agent that uses natural language to manipulate molecular structures, enabling precise control over atomic and functional group replacements, atomic connectivity, and stereochemistry. Through various case studies, it demonstrates effective geometry manipulation in diverse scenarios such as functionalization, ligand binding, and isomer interconversion. This approach supports interactive and context-aware molecular modeling beyond simple structure generation.

El Agente Estructural 是一个使用自然语言操控分子结构的多模态代理，能够精确控制原子和功能团替换、原子连接性和立体化学。通过各种案例研究，它展示了在功能化、配体结合和异构体互变等不同场景中有效的几何操控。这种方法结合了视觉-语言模型和专门的几何感知工具，支持超越简单结构生成的交互式和上下文感知的分子建模。与 El Agente Quntur 的集成进一步增强了其在量子化学应用中的能力。

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?

Authors: Qing'an Liu, Juntong Feng, Yuhao Wang, Xinzhe Han, Yujie Cheng, Yue Zhu, Haiwen Diao, Yunzhi Zhuge, Huchuan Lu

First: 2026-02-04T17:48:55+00:00 · Latest: 2026-02-04T17:48:55+00:00

Comments: 27 pages, 19 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs, yet existing benchmarks predominantly focus on pure-text queries. In real-world scenarios, language also frequently appears as visualized text embedded in images, raising the question of whether current VLMs handle such input requests comparably. We introduce VISTA-Bench, a systematic benchmark from multimodal perception, reasoning, to unimodal understanding domains. It evaluates visualized text understanding by contrasting pure-text and visualized-text questions under controlled rendering conditions. Extensive evaluation of over 20 representative VLMs reveals a pronounced modality gap: models that perform well on pure-text queries often degrade substantially when equivalent semantic content is presented as visualized text. This gap is further amplified by increased perceptual difficulty, highlighting sensitivity to rendering variations despite unchanged semantics. Overall, VISTA-Bench provides a principled evaluation framework to diagnose this limitation and to guide progress toward more unified language representations across tokenized text and pixels. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.

中文标题/摘要

标题：VISTA-Bench：视觉语言模型真的能像处理纯文本一样处理可视化文本吗？

视觉语言模型（VLMs）在跨模态理解方面取得了令人印象深刻的性能，特别是在文本和视觉输入之间，但现有的基准测试主要集中在纯文本查询上。在现实世界中，语言也经常以嵌入在图像中的可视化文本形式出现，这引发了这样一个问题：当前的VLMs是否能够以类似的方式处理此类输入请求。我们引入了VISTA-Bench，这是一个从多模态感知、推理到单模态理解领域的系统基准测试。它通过在受控渲染条件下对比纯文本和可视化文本问题来评估可视化文本理解。对超过20个代表性VLMs的广泛评估揭示了一个明显的模态差距：在纯文本查询上表现良好的模型在等效语义内容以可视化文本形式呈现时往往会大幅退化。随着感知难度的增加，这一差距进一步扩大，尽管语义不变，但对渲染变化的敏感性也增加了。总体而言，VISTA-Bench提供了一个原则性的评估框架，用于诊断这一局限性，并指导朝着更统一的语言表示方向的进步。源数据集可在https://github.com/QingAnLiu/VISTA-Bench获取。

Summary / 总结

VISTA-Bench evaluates the performance of Vision-Language Models (VLMs) in understanding visualized text compared to pure text. By introducing a systematic benchmark across multimodal perception, reasoning, and unimodal understanding domains, the study reveals a significant modality gap where models excel in pure-text queries but perform poorly when the same semantic content is presented as visualized text, especially under increased perceptual difficulty. This highlights the need for more unified language representations. The source dataset is available at https://github.com/QingAnLiu/VISTA-Bench.

VISTA-Bench 评估了视觉语言模型在理解可视化文本方面的能力，与纯文本相比。该基准在感知、推理和单模态理解任务中测试模型，通过控制条件对比纯文本和可视化文本问题。关键发现表明，对于在纯文本查询上表现优异的模型，在呈现相同语义内容为可视化文本时，其性能显著下降，尤其是在增加感知难度的情况下。这突显了需要更统一的语言表示。数据集可在 https://github.com/QingAnLiu/VISTA-Bench 获取。

Annotation Free Spacecraft Detection and Segmentation using Vision Language Models

Authors: Samet Hicsonmez, Jose Sosa, Dan Pineau, Inder Pal Singh, Arunkumar Rathinam, Abd El Rahman Shabayek, Djamila Aouada

Venue: ICRA 2026

First: 2026-02-04T16:07:29+00:00 · Latest: 2026-02-04T16:07:29+00:00

Comments: ICRA 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision Language Models (VLMs) have demonstrated remarkable performance in open-world zero-shot visual recognition. However, their potential in space-related applications remains largely unexplored. In the space domain, accurate manual annotation is particularly challenging due to factors such as low visibility, illumination variations, and object blending with planetary backgrounds. Developing methods that can detect and segment spacecraft and orbital targets without requiring extensive manual labeling is therefore of critical importance. In this work, we propose an annotation-free detection and segmentation pipeline for space targets using VLMs. Our approach begins by automatically generating pseudo-labels for a small subset of unlabeled real data with a pre-trained VLM. These pseudo-labels are then leveraged in a teacher-student label distillation framework to train lightweight models. Despite the inherent noise in the pseudo-labels, the distillation process leads to substantial performance gains over direct zero-shot VLM inference. Experimental evaluations on the SPARK-2024, SPEED+, and TANGO datasets on segmentation tasks demonstrate consistent improvements in average precision (AP) by up to 10 points. Code and models are available at https://github.com/giddyyupp/annotation-free-spacecraft-segmentation.

中文标题/摘要

标题：使用视觉语言模型进行无需注释的航天器检测与分割

视觉语言模型（VLMs）在开放世界零样本视觉识别中表现出色。然而，它们在空间相关应用中的潜力尚未得到充分探索。在空间领域，由于低可见度、光照变化和目标与行星背景融合等因素，准确的手动注释特别具有挑战性。因此，开发无需大量手动标注即可检测和分割航天器和轨道目标的方法至关重要。在本文中，我们提出了一种使用VLMs的无需注释的空间目标检测与分割流水线。我们的方法首先使用预训练的VLM自动生成一小部分未标注真实数据的伪标签。然后利用这些伪标签在教师-学生标签蒸馏框架中训练轻量级模型。尽管伪标签中存在固有的噪声，但蒸馏过程仍能显著提高直接零样本VLM推理的性能。在SPARK-2024、SPEED+和TANGO数据集上的分割任务实验评估表明，平均精度（AP）提高了多达10个百分点。代码和模型可在https://github.com/giddyyupp/annotation-free-spacecraft-segmentation上获取。

Summary / 总结

This work addresses the challenge of spacecraft detection and segmentation in space applications where manual annotation is difficult. It proposes an annotation-free method using Vision Language Models (VLMs) that automatically generates pseudo-labels for a small subset of unlabeled data. These pseudo-labels are used in a teacher-student framework to train lightweight models, leading to significant improvements in average precision by up to 10 points on various datasets. The method demonstrates consistent performance gains over direct VLM inference and is available for further research.

该研究旨在利用视觉语言模型（VLMs）在无需人工标注的情况下检测和分割空间图像中的航天器。方法首先使用预训练的VLM为一小部分未标注数据生成伪标签，然后通过教师-学生标签蒸馏框架训练轻量级模型。尽管伪标签存在噪声，但在SPARK-2024、SPEED+和TANGO数据集上的分割任务中，该方法仍能实现平均精度最多提高10个百分点的一致改进。

AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation

Authors: Jin-Chuan Shi, Binhong Ye, Tao Liu, Junzhe He, Yangjinhui Xu, Xiaoyang Liu, Zeju Li, Hao Chen, Chunhua Shen

First: 2026-02-04T15:42:58+00:00 · Latest: 2026-02-04T15:42:58+00:00

Comments: 11 pages

Abs · PDF · Code1 · Code2

Abstract

Reconstructing dynamic hand-object interactions from monocular videos is critical for dexterous manipulation data collection and creating realistic digital twins for robotics and VR. However, current methods face two prohibitive barriers: (1) reliance on neural rendering often yields fragmented, non-simulation-ready geometries under heavy occlusion, and (2) dependence on brittle Structure-from-Motion (SfM) initialization leads to frequent failures on in-the-wild footage. To overcome these limitations, we introduce AGILE, a robust framework that shifts the paradigm from reconstruction to agentic generation for interaction learning. First, we employ an agentic pipeline where a Vision-Language Model (VLM) guides a generative model to synthesize a complete, watertight object mesh with high-fidelity texture, independent of video occlusions. Second, bypassing fragile SfM entirely, we propose a robust anchor-and-track strategy. We initialize the object pose at a single interaction onset frame using a foundation model and propagate it temporally by leveraging the strong visual similarity between our generated asset and video observations. Finally, a contact-aware optimization integrates semantic, geometric, and interaction stability constraints to enforce physical plausibility. Extensive experiments on HO3D, DexYCB, and in-the-wild videos reveal that AGILE outperforms baselines in global geometric accuracy while demonstrating exceptional robustness on challenging sequences where prior art frequently collapses. By prioritizing physical validity, our method produces simulation-ready assets validated via real-to-sim retargeting for robotic applications.

中文标题/摘要

标题：AGILE：通过主动生成从视频重建手物交互

从单目视频中重建动态手物交互对于灵巧操作数据收集和为机器人和VR创建逼真的数字孪生至关重要。然而，当前方法面临两个主要障碍：(1) 对神经渲染的依赖往往在重遮挡下产生不连续、非模拟可用的几何结构，(2) 对脆弱的结构从运动（SfM）初始化的依赖导致在野外片段中频繁失败。为克服这些限制，我们引入了AGILE，一种稳健的框架，将范式从重建转向主动生成以学习交互。首先，我们采用一个主动管道，其中视觉语言模型（VLM）引导生成模型合成一个完整、无泄漏的对象网格，具有高保真度的纹理，独立于视频遮挡。其次，完全绕过脆弱的SfM，我们提出了一种稳健的锚点和跟踪策略。我们使用基础模型在单个交互起始帧初始化对象姿态，并通过利用我们生成的资产和视频观察之间的强大视觉相似性进行时间上的传播。最后，接触感知优化整合语义、几何和交互稳定性约束，以确保物理合理性。在HO3D、DexYCB和野外视频上的广泛实验表明，AGILE在全局几何精度上优于基线，在具有挑战性的序列中表现出色，而这些序列在以前的艺术中经常崩溃。通过优先考虑物理有效性，我们的方法通过实际到模拟的重新目标化验证了模拟可用的资产，适用于机器人应用。

Summary / 总结

AGILE is a framework designed to reconstruct dynamic hand-object interactions from monocular videos, addressing the limitations of current methods by using an agentic generation approach. It employs a Vision-Language Model to guide a generative model for complete object mesh synthesis, independent of occlusions, and uses a robust anchor-and-track strategy for pose initialization and temporal propagation. The method incorporates contact-aware optimization to enforce physical plausibility. Experiments show AGILE outperforms baselines in geometric accuracy and robustness on challenging sequences.

AGILE 是一种用于从单目视频重建动态手物交互的稳健框架，通过从重建转向生成性生成来解决现有方法的限制。它使用 Vision-Language 模型引导生成模型进行完整对象网格的合成，完全绕过脆弱的 SfM，采用稳健的锚点和跟踪策略，并通过接触感知优化整合语义、几何和交互稳定性约束以确保物理合理性。实验表明，AGILE 在几何精度和对具有挑战性的序列的鲁棒性方面优于基线方法。

Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

Authors: George Yakushev, Nataliia Babina, Masoud Vahid Dastgerdi, Vyacheslav Zhdanovskiy, Denis Kuznedelev, Alina Shutova, Max Ryabinin

First: 2025-12-11T18:57:02+00:00 · Latest: 2026-02-04T15:33:49+00:00

Comments: Preprint, work in progress

Abs · PDF · Code1 · Code2

Abstract

Many state-of-the-art LLMs are trained to think before giving their answer. Reasoning can greatly improve language model capabilities, but it also makes them less interactive: given a new input, a model must stop thinking before it can respond. Real-world use cases such as voice-based or embodied assistants require an LLM agent to respond and adapt to additional information in real time, which is incompatible with sequential interactions. In contrast, humans can listen, think, and act asynchronously: we begin thinking about the problem while reading it and continue thinking while formulating the answer. In this work, we augment LLMs capable of reasoning to operate in a similar way without additional training. Our method uses the properties of positional embeddings to enable LLMs built for sequential generation to simultaneously think, listen, and write outputs. We evaluate our approach on math, commonsense, and safety reasoning: it allows models to generate accurate thinking-augmented answers while reducing time to first non-thinking token from minutes to ${\le}$ 5s and the overall real-time delays by up to $12{\times}$.

中文标题/摘要

标题：异步推理：无需训练的交互式思考大语言模型

许多最先进的大语言模型在给出答案之前会先进行思考。推理可以大大提升语言模型的能力，但也使它们变得不那么互动：给定新的输入后，模型必须停止思考才能做出回应。现实世界中的应用场景，如基于语音或具身的助手，需要大语言模型代理能够实时响应并适应额外的信息，这与顺序交互不兼容。相比之下，人类可以同步地听、思考和行动：我们在阅读问题时就开始思考，并在构思答案时继续思考。在本研究中，我们通过利用位置嵌入的特性，使能够进行推理的大语言模型能够在无需额外训练的情况下以类似的方式运行。我们的方法使用位置嵌入的特性，使为顺序生成构建的语言模型能够同时思考、聆听和输出。我们在数学、常识和安全推理方面评估了我们的方法：它允许模型生成准确的增强思考的答案，将首次无思考标记的时间从几分钟减少到${\le}$5秒，并将整体实时延迟减少多达$12{\times}$。

Summary / 总结

This work addresses the limitation of state-of-the-art LLMs that require sequential reasoning before responding, making them less interactive. The authors propose a method to enable LLMs to think, listen, and write outputs simultaneously by leveraging positional embeddings. Evaluation on math, commonsense, and safety reasoning tasks shows that their approach reduces the time to first non-thinking token to less than 5 seconds and overall real-time delays by up to 12 times.

该研究针对当前最先进的LLM需要顺序推理才能响应的局限性，使其不够互动。作者提出了一种异步推理方法，利用位置嵌入使LLM能够同时思考、倾听和写作，无需额外训练。该方法显著减少了首次非思考标记的时间和整体实时延迟，展示了其在数学、常识和安全推理任务中的有效性。

PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective

Authors: Haokui Zhang, Congyang Ou, Dawei Yan, Peng Wang, Qingsen Yan, Ying Li, Rong Xiao, Chunhua Shen

First: 2026-02-04T15:33:10+00:00 · Latest: 2026-02-04T15:33:10+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$\times$ prefill speedup, 2.11$\times$ inference speedup, 6.22$\times$ lower FLOPs, and 6.05$\times$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.

中文标题/摘要

标题：PIO-FVLM：从推理目标视角重新思考无训练视觉标记缩减以加速VLM

近年来，减少视觉语言模型（VLMs）中的冗余视觉标记以加速VLM推理已成为一个热点话题。然而，大多数现有方法依赖于基于视觉标记间相似性或跨模态视觉-文本相似性的启发式构造，这在压缩性能和实际部署方面存在一定的局限性。相比之下，我们从推理目标的角度提出了PIO-FVLM，将视觉标记压缩转化为保持输出结果不变性，并主要通过其对这一目标的重要性来选择标记。特别地，视觉标记在我们设计的层局部代理损失指导下重新排序，这是一种来自当前层到最终结果的粗略约束。然后，根据非极大值抑制（NMS）原则选择最有价值的视觉标记。提出的PIO-FVLM是无训练的，并且与FlashAttention兼容，对实际应用和部署友好。它可以独立部署为一种无编码器方法，或者与VisionZip等编码器压缩方法结合使用，作为一种涉及编码器的方法。在LLaVA-Next-7B上，PIO-FVLM仅保留了11.1%的视觉标记，但保持了97.2%的原始性能，预填充速度提高了2.67倍，推理速度提高了2.11倍，FLOPs降低了6.22倍，KV缓存开销减少了6.05倍。我们的代码可在https://github.com/ocy1/PIO-FVLM获取。

Summary / 总结

The paper proposes PIO-FVLM, a training-free method for visual token reduction in vision-language models (VLMs) that focuses on preserving output result invariance. It uses token-level gradient saliency generated by a layer-local proxy loss to reorder and select the most valuable vision tokens, achieving significant performance retention while accelerating inference. On LLaVA-Next-7B, PIO-FVLM retains 97.2% of the original performance with only 11.1% of visual tokens, providing a 2.67x prefill speedup, 2.11x inference speedup, 6.22x lower FLOPs, and 6.05x reduced KV Cache overhead.

PIO-FVLM 从推理目标的角度重新思考视觉标记减少以加速 VLM，旨在保持输出结果不变性。它使用标记级别的梯度显著性重新排序视觉标记，并通过 NMS 选择最有价值的标记。在 LLaVA-Next-7B 上，PIO-FVLM 仅保留 11.1% 的视觉标记，同时保持 97.2% 的性能，实现显著的加速和减少 FLOPs 和 KV 缓存开销。该方法是训练免费的，并且兼容 FlashAttention，使其适用于实际部署。

VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Authors: Jaeyoon Jung, Yejun Yoon, Seunghyun Yoon, Kunwoo Park

First: 2026-02-04T14:12:55+00:00 · Latest: 2026-02-04T14:12:55+00:00

Comments: A system description paper for the AVerImaTeC shared task at the Ninth FEVER Workshop (co-located with EACL 2026)

Abs · PDF · Code1 · Code2 · Code3

Abstract

This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.

中文标题/摘要

标题：VILLAIN在AVerImaTeC：通过多智能体协作验证图像-文本声明

本文描述了VILLAIN，一种通过基于提示的多智能体协作来验证图像-文本声明的多模态事实核查系统。在AVerImaTeC共享任务中，VILLAIN在事实核查的多个阶段使用了视觉-语言模型智能体。文本和视觉证据从通过额外网络收集丰富后的知识库中检索。为了识别关键信息并解决证据项之间的不一致，模态特定和跨模态智能体生成分析报告。在后续阶段，基于这些报告生成问题-答案对。最后，判决预测智能体根据图像-文本声明和生成的问题-答案对生成验证结果。我们的系统在所有评估指标中均排名第一。源代码可在https://github.com/ssu-humane/VILLAIN公开获取。

Summary / 总结

VILLAIN is a multimodal fact-checking system that verifies image-text claims using prompt-based multi-agent collaboration. It retrieves textual and visual evidence from a knowledge store and generates analysis reports to identify key information and address inconsistencies. The system then produces question-answer pairs and a final verification outcome. VILLAIN ranked first on all evaluation metrics in the AVerImaTeC shared task.

VILLAIN 是一个通过基于提示的多智能体协作来验证图像-文本声明的多模态事实核查系统。它使用视觉-语言模型智能体从知识库中检索文本和视觉证据，并生成分析报告。系统随后生成问题-答案对和最终的验证结果。VILLAIN 在 AVerImaTeC 共享任务中所有评估指标上排名第一。

Understanding Degradation with Vision Language Model

Authors: Guanzhou Lan, Chenyi Liao, Yuqi Yang, Qianli Ma, Zhigang Wang, Dong Wang, Bin Zhao, Xuelong Li

First: 2026-02-04T13:51:15+00:00 · Latest: 2026-02-04T13:51:15+00:00

Comments: 17 pages

Abs · PDF · Code1 · Code2

Abstract

Understanding visual degradations is a critical yet challenging problem in computer vision. While recent Vision-Language Models (VLMs) excel at qualitative description, they often fall short in understanding the parametric physics underlying image degradations. In this work, we redefine degradation understanding as a hierarchical structured prediction task, necessitating the concurrent estimation of degradation types, parameter keys, and their continuous physical values. Although these sub-tasks operate in disparate spaces, we prove that they can be unified under one autoregressive next-token prediction paradigm, whose error is bounded by the value-space quantization grid. Building on this insight, we introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning using structured rewards. Furthermore, we show that DU-VLM can serve as a zero-shot controller for pre-trained diffusion models, enabling high-fidelity image restoration without fine-tuning the generative backbone. We also introduce \textbf{DU-110k}, a large-scale dataset comprising 110,000 clean-degraded pairs with grounded physical annotations. Extensive experiments demonstrate that our approach significantly outperforms generalist baselines in both accuracy and robustness, exhibiting generalization to unseen distributions.

中文标题/摘要

标题：理解视觉语言模型中的退化

理解视觉退化是计算机视觉中一个关键但具有挑战性的问题。虽然最近的视觉-语言模型（VLMs）在定性描述方面表现出色，但在理解图像退化背后的参数物理原理方面往往表现不佳。在本文中，我们将退化理解重新定义为一个分层结构化预测任务，需要同时估计退化类型、参数键及其连续物理值。尽管这些子任务在不同的空间中运行，但我们证明它们可以在一个自回归下一个标记预测范式下统一，其误差由值空间量化网格界。基于这一见解，我们引入了DU-VLM，这是一种多模态链式思维模型，通过监督微调和使用结构化奖励的强化学习进行训练。此外，我们还展示了DU-VLM可以作为预训练扩散模型的零样本控制器，无需微调生成主干即可实现高保真图像恢复。我们还引入了DU-110k，这是一个包含110,000个干净-退化配对的大规模数据集，具有基于物理的注释。广泛的实验表明，我们的方法在准确性和鲁棒性方面显著优于通用基线，并且能够泛化到未见过的分布。

Summary / 总结

This study addresses the challenge of understanding image degradations by proposing a hierarchical structured prediction task that requires estimating degradation types, parameters, and their physical values. The authors introduce DU-VLM, a multimodal chain-of-thought model trained with supervised fine-tuning and reinforcement learning, which excels in zero-shot image restoration. The DU-110k dataset, containing 110,000 clean-degraded pairs, supports the model's effectiveness, showing superior accuracy and robustness compared to generalist baselines.

该研究重新定义了图像退化理解为一个分层结构预测任务。作者引入了DU-VLM，这是一种多模态链式思维模型，通过监督微调和强化学习训练，能够预测退化类型、参数键及其连续物理值。DU-VLM 在零样本图像恢复中表现出色，且该方法在准确性和鲁棒性方面优于通用基线，能够很好地泛化到未见过的数据分布。作者还提出了DU-110k 数据集，包含 110,000 对干净和退化的图像，用于训练和评估。

EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models

Authors: Yu Bai, MingMing Yu, Chaojie Li, Ziyi Bai, Xinlong Wang, Börje F. Karlsson

First: 2026-02-04T13:04:56+00:00 · Latest: 2026-02-04T13:04:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.

中文标题/摘要

标题：EgoActor：通过视觉语言模型将高级指令直接接地为具有空间意识的人形机器人自中心动作

在现实世界环境中部署人形机器人是一项根本性的挑战，因为它需要在部分信息观察和动态变化的环境中紧密整合感知、移动和操作。此外，还需要在不同类型子任务之间稳健地过渡。为了解决这些挑战，我们提出了一种新的任务——EgoActing，它要求直接将高级指令接地为各种精确的空间意识人形机器人动作。我们进一步通过引入EgoActor，一种统一且可扩展的视觉语言模型（VLM），来实现这一任务，该模型可以预测移动原语（例如，行走、转身、侧移、改变高度）、头部动作、操作命令以及人机交互，以实现实时感知和执行的协调。我们利用来自真实世界演示的广义监督的自中心RGB数据、空间推理问答以及模拟环境演示，使EgoActor能够做出稳健、上下文相关的决策，并在不到1秒的时间内进行流畅的动作推理（使用8B和4B参数模型）。在模拟和现实世界环境中的广泛评估表明，EgoActor有效地将抽象的任务规划与具体的运动执行联系起来，同时在多种任务和未见过的环境中进行泛化。

Summary / 总结

The paper addresses the challenge of deploying humanoid robots in real-world settings by proposing a novel task called EgoActing, which requires grounding high-level instructions into precise spatially aware actions. To achieve this, the authors introduce EgoActor, a unified vision-language model that predicts various humanoid actions including locomotion, head movements, manipulation commands, and human-robot interactions. The model is trained on egocentric RGB data, spatial reasoning, and simulated environment demonstrations, enabling it to make context-aware decisions and perform fluent action inference in under 1 second. Evaluations show that EgoActor effectively bridges abstract task planning and concrete motor execution, generalizing across diverse tasks and unseen environments.

论文提出了EgoActing任务，旨在将高层指令直接转化为具体的、空间意识强的动作。引入了统一的视觉语言模型EgoActor，用于预测诸如移动、头部动作、操作命令等动作。该模型利用第一人称RGB数据、空间推理和模拟演示来做出上下文相关的决策，并在不到1秒的时间内进行流畅的动作推理。评估结果显示，EgoActor能够有效地将抽象的任务规划与具体的动作执行相结合，适用于多种任务和未见过的环境。

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models

Authors: Yufeng Zhong, Lei Chen, Xuanle Zhao, Wenkang Han, Liming Zheng, Jing Huang, Deyang Jiang, Yilin Cao, Lin Ma, Zhixiong Zeng

First: 2026-01-29T12:43:02+00:00 · Latest: 2026-02-04T12:53:33+00:00

Abs · PDF · Code1 · Code2

Abstract

The development of large vision language models drives the demand for managing, and applying massive amounts of multimodal data, making OCR technology, which extracts information from visual images, increasingly popular. However, existing OCR methods primarily focus on recognizing text elements from images or scanned documents (Text-centric OCR), neglecting the identification of visual elements from visually information-dense image sources (Vision-centric OCR), such as charts, web pages and science plots. In reality, these visually information-dense images are widespread on the internet and have significant real-world application value, such as data visualization and web page analysis. In this technical report, we propose OCRVerse, the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR. To this end, we constructe comprehensive data engineering to cover a wide range of text-centric documents, such as newspapers, magazines and books, as well as vision-centric rendered composites, including charts, web pages and scientific plots. Moreover, we propose a two-stage SFT-RL multi-domain training method for OCRVerse. SFT directly mixes cross-domain data to train and establish initial domain knowledge, while RL focuses on designing personalized reward strategies for the characteristics of each domain. Specifically, since different domains require various output formats and expected outputs, we provide sufficient flexibility in the RL stage to customize flexible reward signals for each domain, thereby improving cross-domain fusion and avoiding data conflicts. Experimental results demonstrate the effectiveness of OCRVerse, achieving competitive results across text-centric and vision-centric data types, even comparable to large-scale open-source and closed-source models.

中文标题/摘要

标题：OCRVerse：迈向端到端视觉语言模型中的全方位OCR

大型视觉语言模型的发展推动了对管理和应用大量多模态数据的需求，使得从视觉图像中提取信息的OCR技术越来越受欢迎。然而，现有的OCR方法主要集中在识别图像或扫描文档中的文本元素（文本中心的OCR），忽视了从视觉信息密集型图像源（视觉中心的OCR）中识别视觉元素，例如图表、网页和科学图表。实际上，这些视觉信息密集型图像在互联网上广泛存在，并具有重要的现实应用价值，如数据可视化和网页分析。在本技术报告中，我们提出了OCRVerse，这是一种端到端的全方位OCR方法，能够统一处理文本中心的OCR和视觉中心的OCR。为此，我们构建了全面的数据工程，涵盖了广泛的文本中心文档，如报纸、杂志和书籍，以及视觉中心的渲染组合，包括图表、网页和科学图表。此外，我们为OCRVerse提出了两阶段的SFT-RL多域训练方法。SFT直接混合跨域数据进行训练和建立初始领域知识，而RL则专注于为每个领域的特性设计个性化的奖励策略。具体而言，由于不同领域需要不同的输出格式和预期输出，我们在RL阶段提供了足够的灵活性，为每个领域定制灵活的奖励信号，从而提高跨域融合并避免数据冲突。实验结果表明，OCRVerse的有效性，其在文本中心和视觉中心数据类型上的表现与大规模开源和闭源模型相当甚至更优。

Summary / 总结

OCRVerse is a holistic OCR method that integrates text-centric and vision-centric OCR in an end-to-end manner. It addresses the limitations of existing OCR methods by constructing a comprehensive dataset and proposing a two-stage SFT-RL training method. The method achieves competitive results across different types of OCR data, even matching those of large-scale models.

OCRVerse 是一种端到端的综合 OCR 方法，能够同时处理文本中心和视觉中心的 OCR 任务。它通过全面的数据工程涵盖了各种类型的文档和渲染复合体，并采用两阶段的 SFT-RL 训练方法来提高跨域融合效果。实验结果表明，OCRVerse 在不同数据类型上的表现具有竞争力，甚至可以媲美大规模开源和闭源模型。

Think3D: Thinking with Space for Spatial Reasoning

Authors: Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu

First: 2026-01-19T13:13:54+00:00 · Latest: 2026-02-04T12:38:43+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at https://github.com/zhangzaibin/spagent.

中文标题/摘要

标题：Think3D：利用空间思考进行空间推理

理解和推理物理世界需要空间智能：超越二维感知的几何、透视和空间关系的解读能力。尽管最近的视觉大型模型（VLMs）在视觉理解方面表现出色，但它们本质上仍然是二维感知者，并且在真正的三维推理方面存在困难。我们引入了Think3D，这是一种框架，使VLM代理能够利用三维空间进行思考。通过利用三维重建模型从图像或视频中恢复点云和相机姿态，Think3D使代理能够通过基于相机的操作和第一人称/全局视图切换主动操控空间，将空间推理转化为一个互动的三维推理过程。在无需额外训练的情况下，Think3D显著提高了GPT-4.1和Gemini 2.5 Pro等高级模型的空间推理性能，在BLINK多视图和MindCube上平均提高了7.8%，在VSI-Bench上提高了4.7%。我们进一步表明，那些在空间探索方面存在困难的小型模型可以从强化学习策略中显著受益，该策略使模型能够选择信息性的视角和操作。通过RL，工具使用带来的益处从+0.7%增加到+6.8%。我们的研究结果表明，无需训练的工具增强的空间探索是通向更灵活和类人三维推理的多模态代理的新途径，建立了多模态智能的新维度。代码和权重发布在https://github.com/zhangzaibin/spagent/

Summary / 总结

The research aims to enhance the spatial reasoning capabilities of vision large models (VLMs) by enabling them to think with 3D space. Think3D leverages 3D reconstruction models to allow VLM agents to manipulate 3D space through camera-based operations and viewpoint switching. This framework significantly improves the spatial reasoning performance of advanced models like GPT-4.1 and Gemini 2.5 Pro, with average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. Smaller models benefit even more from a reinforcement learning policy that helps them select informative viewpoints, increasing the benefit from tool usage from +0.7% to +6.8%. The study shows that training-free, tool-augmented spatial exploration can enhance multimodal agents' 3D reasoning capabilities.

研究旨在通过使视觉大型模型（VLMs）能够思考三维空间来增强其空间推理能力。Think3D 利用 3D 重建模型，使 VLM 代理能够通过相机操作和视角切换来操作三维空间。该框架显著提高了 GPT-4.1 和 Gemini 2.5 Pro 等先进模型的空间推理性能，平均增益分别为 BLINK 多视图和 MindCube 上的 +7.8%，VSI-Bench 上的 +4.7%。较小的模型从强化学习策略中受益更多，该策略帮助它们选择有信息量的视角，使工具使用的益处从 +0.7% 增加到 +6.8%。研究显示，无需训练、工具增强的空间探索可以增强多模态代理的三维推理能力。

Less Precise Can Be More Reliable: A Systematic Evaluation of Quantization's Impact on CLIP Beyond Accuracy

Authors: Aymen Bouguerra, Daniel Montoya, Alexandra Gomez-Villa, Chokri Mraidha, Fabio Arnez

First: 2025-09-25T13:54:34+00:00 · Latest: 2026-02-04T09:44:34+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) such as CLIP have revolutionized zero-shot classification and safety-critical tasks, including Out-of-Distribution (OOD) detection. However, their high computational cost hinders efficient real-world deployment. While quantization is a standard solution for efficiency, its broader impact on reliability metrics beyond simple Top-1 accuracy remains critically under-explored. In this study, we conduct a large-scale evaluation of VLM quantization across a comprehensive experimental suite of over 700k evaluation runs with varying configurations. We find that, contrary to the assumption that quantization's noise degrades performance, it can simultaneously improve accuracy, calibration, OOD detection, and robustness to noise, though not to covariate shift or spurious correlations. We leverage these counterintuitive findings to characterize the mechanics of quantization beyond simple regularization: we show that quantization dampens high-rank spectral components, compelling the model to rely more heavily on robust, low-rank features. Ultimately, this spectral filtering effect drives the observed improvements in generalization and noise tolerance, establishing a pathway to deploy faster, more reliable VLMs by utilizing quantization beyond its conventional role.

中文标题/摘要

标题：更不精确可能更可靠：量化对CLIP超越准确性的影响的系统评估

视觉-语言模型（VLMs）如CLIP已经革新了零样本分类和包括离群值检测在内的安全关键任务。然而，它们的高计算成本阻碍了其在现实世界中的高效部署。虽然量化是提高效率的标准解决方案，但其对可靠性指标的影响远超简单的Top-1准确率，这一影响仍然被严重忽视。在本研究中，我们对VLM量化进行了大规模评估，涵盖了超过70万次不同配置的实验运行。我们发现，与量化噪声会降低性能的假设相反，量化可以同时提高准确率、校准、离群值检测和对噪声的鲁棒性，尽管对协变量变化或伪相关没有影响。我们利用这些出乎意料的发现来描述量化机制超越简单正则化的作用：我们证明量化抑制了高阶谱成分，迫使模型更多依赖于鲁棒的低阶特征。最终，这种谱过滤效应推动了观察到的泛化和噪声容忍度的改进，为通过超越传统角色的量化部署更快、更可靠的VLMs奠定了路径。

Summary / 总结

This study evaluates the impact of quantization on Vision-Language Models (VLMs) like CLIP, focusing on metrics beyond accuracy. Through a large-scale evaluation involving over 700k runs, the research reveals that quantization can simultaneously enhance accuracy, calibration, Out-of-Distribution (OOD) detection, and robustness to noise, contrary to the common belief that quantization degrades performance. The findings suggest that quantization filters out high-rank spectral components, promoting reliance on robust, low-rank features, which improves generalization and noise tolerance.

该研究评估了量化对视觉-语言模型（VLMs）如CLIP的影响，重点在于超越简单准确性的指标。通过涉及超过700,000次运行的大规模评估，研究发现量化不仅可以同时提升准确率、校准、异常分布检测和对噪声的鲁棒性，还与常见假设相反。研究结果表明，量化过滤掉高阶谱成分，促使模型更多依赖于稳健的低阶特征，从而提高泛化能力和对噪声的容忍度。

SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

Authors: Zekun Li, Ning Wang, Tongxin Bai, Changwang Mei, Peisong Wang, Shuang Qiu, Jian Cheng

First: 2026-02-04T09:34:06+00:00 · Latest: 2026-02-04T09:34:06+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{> 5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.

Summary / 总结

SparVAR is a training-free acceleration framework for Visual AutoRegressive (VAR) modeling that addresses the high computational complexity and latency issues by exploiting the properties of strong attention sinks, cross-scale activation similarity, and pronounced locality. It dynamically predicts the sparse attention pattern of high-resolution scales and constructs scale self-similar sparse attention, achieving more than 5 times faster forward speed than FlashAttention. Experiments show that SparVAR can reduce the generation time of an 8B model producing 1024x1024 images to 1 second without skipping scales, and it achieves a 1.57 times speed-up while preserving high-frequency details. Combined with scale-skipping strategies, SparVAR can achieve up to 2.28 times acceleration while maintaining competitive visual quality.

SparVAR 是一种无需训练的加速框架，用于解决 Visual AutoRegressive (VAR) 模型的计算复杂性问题，通过利用强注意力汇、跨尺度激活相似性和明显的局部性等特性。它从稀疏决策尺度动态预测稀疏注意力模式，并使用高效的索引映射机制，在大尺度下实现高效稀疏注意力计算。实验结果表明，SparVAR 可以将生成 1024x1024 图像的 8B 模型的生成时间减少到 1 秒，同时不跳过最后一级尺度，实现 1.57 倍的加速，同时保留高频细节。结合现有的尺度跳过策略，SparVAR 可以实现高达 2.28 倍的加速，同时保持竞争力的视觉生成质量。

When and Where to Attack? Stage-wise Attention-Guided Adversarial Attack on Large Vision Language Models

Authors: Jaehyun Kwak, Nam Cao, Boryeong Cho, Segyu Lee, Sumyeong Ahn, Se-Young Yun

First: 2026-02-04T09:29:10+00:00 · Latest: 2026-02-04T09:29:10+00:00

Comments: Pre-print

Abs · PDF · Code1 · Code2 · Code3

Abstract

Adversarial attacks against Large Vision-Language Models (LVLMs) are crucial for exposing safety vulnerabilities in modern multimodal systems. Recent attacks based on input transformations, such as random cropping, suggest that spatially localized perturbations can be more effective than global image manipulation. However, randomly cropping the entire image is inherently stochastic and fails to use the limited per-pixel perturbation budget efficiently. We make two key observations: (i) regional attention scores are positively correlated with adversarial loss sensitivity, and (ii) attacking high-attention regions induces a structured redistribution of attention toward subsequent salient regions. Based on these findings, we propose Stage-wise Attention-Guided Attack (SAGA), an attention-guided framework that progressively concentrates perturbations on high-attention regions. SAGA enables more efficient use of constrained perturbation budgets, producing highly imperceptible adversarial examples while consistently achieving state-of-the-art attack success rates across ten LVLMs. The source code is available at https://github.com/jackwaky/SAGA.

中文标题/摘要

标题：何时何地发动攻击？分阶段注意力引导的对抗攻击

对抗攻击针对大型视觉语言模型（LVLMs）对于揭示现代多模态系统中的安全性漏洞至关重要。基于输入变换的攻击，如随机裁剪，表明空间局部扰动比全局图像操作更有效。然而，整个图像的随机裁剪本质上是随机的，并不能有效地利用有限的每个像素扰动预算。我们做出了两个关键观察：(i) 区域注意力分数与对抗损失敏感性正相关，(ii) 攻击高注意力区域会促使注意力向后续显著区域有结构地重新分配。基于这些发现，我们提出了分阶段注意力引导攻击（SAGA），这是一种注意力引导框架，逐步将扰动集中在高注意力区域。SAGA 使受约束的扰动预算更有效地使用，产生高度不可感知的对抗样本，同时在十种LVLMs上持续实现最先进的攻击成功率。源代码可在 https://github.com/jackwaky/SAGA 获取。

Summary / 总结

The research aims to improve adversarial attacks on large vision-language models by focusing on spatially localized perturbations. The method, Stage-wise Attention-Guided Attack (SAGA), uses regional attention scores to guide the attack, concentrating perturbations on high-attention regions to achieve high attack success rates while maintaining imperceptibility. Key findings show that SAGA outperforms previous methods across ten LVLMs, demonstrating a structured redistribution of attention and more efficient use of perturbation budgets.

研究旨在通过聚焦空间局部扰动来改进对大型视觉语言模型的对抗攻击。方法Stage-wise Attention-Guided Attack (SAGA) 使用区域注意力分数来引导攻击，将扰动集中在高注意力区域，从而实现高度不可感知的对抗样本，并在十种LVLM中达到最先进的攻击成功率。源代码可在 https://github.com/jackwaky/SAGA 获取。

Explicit Uncertainty Modeling for Active CLIP Adaptation with Dual Prompt Tuning

Authors: Qian-Wei Wang, Yaguang Song, Shu-Tao Xia

First: 2026-02-04T09:01:55+00:00 · Latest: 2026-02-04T09:01:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Pre-trained vision-language models such as CLIP exhibit strong transferability, yet adapting them to downstream image classification tasks under limited annotation budgets remains challenging. In active learning settings, the model must select the most informative samples for annotation from a large pool of unlabeled data. Existing approaches typically estimate uncertainty via entropy-based criteria or representation clustering, without explicitly modeling uncertainty from the model perspective. In this work, we propose a robust uncertainty modeling framework for active CLIP adaptation based on dual-prompt tuning. We introduce two learnable prompts in the textual branch of CLIP. The positive prompt enhances the discriminability of task-specific textual embeddings corresponding to light-weight tuned visual embeddings, improving classification reliability. Meanwhile, the negative prompt is trained in an reversed manner to explicitly model the probability that the predicted label is correct, providing a principled uncertainty signal for guiding active sample selection. Extensive experiments across different fine-tuning paradigms demonstrate that our method consistently outperforms existing active learning methods under the same annotation budget.

中文标题/摘要

标题：显式建模不确定性以适应具有双提示调优的主动CLIP

预训练的跨模态模型如CLIP表现出很强的迁移性，但在有限标注预算下将它们适应下游图像分类任务仍然具有挑战性。在主动学习设置中，模型必须从大量未标注数据中选择最具信息量的样本进行标注。现有方法通常通过基于熵的标准或表示聚类来估计不确定性，但没有从模型的角度显式建模不确定性。在本文中，我们基于双提示调优提出了一个鲁棒的不确定性建模框架以适应主动CLIP。我们在CLIP的文本分支中引入了两个可学习的提示。正提示增强了与轻量级调优视觉嵌入相对应的任务特定文本嵌入的可区分性，提高了分类可靠性。同时，负提示以相反的方式进行训练，以显式建模预测标签正确的概率，为指导主动样本选择提供一个原则性的不确定性信号。在不同微调范式的广泛实验中，我们的方法在相同的标注预算下始终优于现有的主动学习方法。

Summary / 总结

This work addresses the challenge of adapting pre-trained vision-language models like CLIP to downstream image classification tasks with limited labeled data. It proposes a robust uncertainty modeling framework using dual-prompt tuning. By introducing two learnable prompts in the textual branch of CLIP, the method enhances task-specific textual embeddings and explicitly models the probability of correct predictions, leading to improved classification reliability and better active sample selection. Experiments show that this approach outperforms existing active learning methods under the same annotation budget.

该研究旨在解决在有限标注数据下将预训练的视觉-语言模型CLIP适应到下游图像分类任务的挑战。提出了一种基于双提示调优的稳健不确定性建模框架。通过引入两个可学习的提示，该方法增强了任务特定的文本嵌入，并明确建模预测标签正确的概率，从而指导主动样本选择。实验表明，在相同的标注预算下，该方法优于现有主动学习方法。

Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Authors: Qian-Wei Wang, Guanghao Meng, Ren Cai, Yaguang Song, Shu-Tao Xia

First: 2026-02-04T09:00:12+00:00 · Latest: 2026-02-04T09:00:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.

中文标题/摘要

标题：在无需人工标注的情况下微调预训练的跨模态模型

大规模的跨模态模型（VLMs）如CLIP表现出强大的零样本泛化能力，但将其适应下游任务通常需要昂贵的标注数据。现有的无监督自我训练方法依赖于伪标签，但往往遭受不可靠的置信度过滤、确认偏差和低置信度样本的利用不足。我们提出了一种协作微调（CoFT）框架，该框架通过双模型、跨模态协作机制利用未标注数据。CoFT引入了一种双提示学习策略，使用正向和负向文本提示以样本依赖的方式明确建模伪标签的清洁度，从而去除手工设计的阈值或噪声假设的需要。负向提示还正则化轻量级的视觉适应模块，提高在嘈杂监督下的鲁棒性。CoFT采用两阶段训练方案，从高置信度样本的参数高效微调过渡到由协作过滤的伪标签引导的全面微调。基于CoFT，CoFT+进一步通过迭代微调、动量对比学习和LLM生成的提示增强了适应性。广泛的实验表明，CoFT在现有无监督方法上以及少量监督基线上均表现出一致的改进。

Summary / 总结

The research aims to improve the adaptation of large vision-language models (VLMs) like CLIP to downstream tasks without requiring labeled data. The proposed Collaborative Fine-Tuning (CoFT) method uses a dual-model, cross-modal collaboration mechanism with dual-prompt learning to model pseudo-label cleanliness and improve robustness. CoFT employs a two-phase training scheme, starting with parameter-efficient fine-tuning on high-confidence samples and transitioning to full fine-tuning with collaboratively filtered pseudo-labels. Experiments show consistent improvements over existing unsupervised methods and few-shot supervised baselines.

研究旨在通过无需标注数据来改进大型视觉-语言模型（VLMs）如CLIP在下游任务中的适应性。提出的协作微调（CoFT）方法使用了双模型、跨模态协作机制和双提示学习来建模伪标签的清洁度并提高鲁棒性。CoFT采用两阶段训练方案，首先进行参数高效的高置信度样本微调，然后过渡到由协作过滤伪标签引导的全面微调。实验表明，CoFT在现有无监督方法和少量监督基线上表现出一致的改进。

Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

Authors: Zipeng Zhu, Zhanghao Hu, Qinglin Zhu, Yuxi Hong, Yijun Liu, Jingyong Su, Yulan He, Lin Gui

First: 2026-02-04T08:13:01+00:00 · Latest: 2026-02-04T08:13:01+00:00

Comments: 9 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.

中文标题/摘要

标题：超越静态裁剪：层自适应视觉定位与解码增强

大型视觉-语言模型（LVLMs）通过将视觉片段与文本嵌入空间对齐取得了快速进展，但固定的视觉标记预算迫使图像调整为统一的预训练分辨率，往往抹去了细粒度的细节并导致了对语言先验的过度依赖而产生的幻觉。最近的注意力引导增强（例如裁剪或区域聚焦注意力分配）可以缓解这一问题，但它们通常依赖于在简单识别基准上经验性选择的“魔法层”，因此可能无法适用于复杂的推理任务。与这种静态假设相反，我们提出了一种视觉定位的动态视角。通过逐层敏感性分析，我们证明了视觉定位是一个动态过程：虽然简单的物体识别任务依赖于中间层，但复杂的视觉搜索和推理任务需要视觉信息在更深的层中重新激活。基于这一观察，我们引入了查询驱动的视觉激活（VAQ），这是一种通过测量注意力对输入查询的敏感性来识别最相关的注意力图层的度量标准。基于VAQ，我们进一步提出了LASER（层自适应注意力引导选择性视觉和解码增强推理），这是一种无需训练的推理过程，可以根据任务适当地选择视觉定位和问答所需的层。跨多种VQA基准的实验表明，LASER在不同复杂度的任务中显著提高了VQA的准确性。

Summary / 总结

The research aims to address the limitations of static cropping in visual localization and decoding enhancement within large vision-language models. The method involves a layer-wise sensitivity analysis to show that visual grounding is dynamic, requiring different layers for simple and complex tasks. LASER, a layer-adaptive attention-guided selective visual and decoding enhancement for reasoning, is proposed to adaptively select appropriate layers for visual localization and question answering, improving VQA accuracy across various benchmarks.

研究旨在解决静态裁剪在视觉定位和解码增强中的局限性。方法包括逐层敏感性分析，以识别与查询特定视觉定位最相关的层，从而开发出视觉激活查询（VAQ）。基于VAQ，提出了LASER（层适应性注意力引导的选择性视觉和解码增强以进行推理），该方法能够根据任务选择合适的层进行视觉定位和问答。实验结果表明，LASER在不同复杂度的VQA基准测试中显著提高了问答准确性。

MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

Authors: Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, Kaili Liu, Ruxin Feng, Wen Jun Tan, Wei Yang Bryan Lim

First: 2025-11-21T04:33:11+00:00 · Latest: 2026-02-04T07:29:14+00:00

Abs · PDF · Code1 · Code2

Abstract

Modern Vision-Language Models (VLMs) pose significant individual-level privacy risks by linking fragmented multimodal data to identifiable individuals through hierarchical chain-of-thought reasoning. However, existing privacy benchmarks remain structurally insufficient for this threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM's ability to infer and link distributed information to construct individual profiles. To address this gap, we propose MultiPriv, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the Privacy Perception and Reasoning (PPR) framework and construct a bilingual multimodal dataset with synthetic individual profiles, where identifiers (e.g., faces, names) are linked to sensitive attributes. This design enables nine challenging tasks spanning attribute detection, cross-image re-identification, and chained inference. We conduct a large-scale evaluation of over 50 open-source and commercial VLMs. Our analysis shows that 60 percent of widely used VLMs can perform individual-level privacy reasoning with up to 80 percent accuracy, posing a significant threat to personal privacy. MultiPriv provides a foundation for developing and assessing privacy-preserving VLMs.

中文标题/摘要

标题：MultiPriv：视觉语言模型个体级隐私推理基准测试

现代视觉语言模型（VLMs）通过层次链式推理将碎片化的多模态数据与可识别的个体联系起来，从而带来重大的个体级隐私风险。然而，现有的隐私基准在结构上仍然不足以应对这一威胁，因为它们主要评估隐私感知，而未能解决更为关键的隐私推理风险：VLMs将分散的信息推断并链接起来构建个体档案的能力。为了解决这一差距，我们提出了MultiPriv，这是第一个旨在系统评估VLMs个体级隐私推理的基准测试。我们引入了隐私感知与推理（PPR）框架，并构建了一个双语多模态数据集，其中包含合成的个体档案，标识符（例如，面孔、姓名）与敏感属性相关联。这种设计使我们能够进行九项具有挑战性的任务，涵盖属性检测、跨图像再识别和链式推理。我们对超过50个开源和商用VLMs进行了大规模评估。我们的分析表明，60%的广泛使用的VLMs可以以高达80%的准确率进行个体级隐私推理，这构成了对个人隐私的重大威胁。MultiPriv为开发和评估隐私保护的VLMs提供了基础。

Summary / 总结

The research aims to address the individual-level privacy risks posed by Vision-Language Models (VLMs) through hierarchical chain-of-thought reasoning. To fill the gap in existing privacy benchmarks, the study introduces MultiPriv, a new benchmark that evaluates VLMs' ability to infer and link distributed information to construct individual profiles. The evaluation involves nine challenging tasks and a bilingual multimodal dataset with synthetic individual profiles. The results show that 60 percent of VLMs can perform individual-level privacy reasoning with up to 80 percent accuracy, highlighting a significant threat to personal privacy.

论文提出了MultiPriv，这是一个用于评估Vision-Language模型（VLM）个体级隐私推理能力的基准。它填补了现有隐私基准的空白，重点关注隐私推理这一关键风险，即VLM通过推理和链接分散的信息来构建个人档案。该基准使用包含合成个人档案的双语多模态数据集，并评估了超过50个VLM，结果显示60%的VLM可以以高达80%的准确率进行个体级隐私推理，突显了对个人隐私的重大威胁。

KVSmooth: Mitigating Hallucination in Multi-modal Large Language Models through Key-Value Smoothing

Authors: Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He

First: 2026-02-04T06:59:17+00:00 · Latest: 2026-02-04T06:59:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite the significant progress of Multimodal Large Language Models (MLLMs) across diverse tasks, hallucination -- corresponding to the generation of visually inconsistent objects, attributes, or relations -- remains a major obstacle to their reliable deployment. Unlike pure language models, MLLMs must ground their generation process in visual inputs. However, existing models often suffer from semantic drift during decoding, causing outputs to diverge from visual facts as the sequence length increases. To address this issue, we propose KVSmooth, a training-free and plug-and-play method that mitigates hallucination by performing attention-entropy-guided adaptive smoothing on hidden states. Specifically, KVSmooth applies an exponential moving average (EMA) to both keys and values in the KV-Cache, while dynamically quantifying the sink degree of each token through the entropy of its attention distribution to adaptively adjust the smoothing strength. Unlike computationally expensive retraining or contrastive decoding methods, KVSmooth operates efficiently during inference without additional training or model modification. Extensive experiments demonstrate that KVSmooth significantly reduces hallucination ($\mathit{CHAIR}_{S}$ from $41.8 \rightarrow 18.2$) while improving overall performance ($F_1$ score from $77.5 \rightarrow 79.2$), achieving higher precision and recall simultaneously. In contrast, prior methods often improve one at the expense of the other, validating the effectiveness and generality of our approach.

中文标题/摘要

标题：KVSmooth：通过键值平滑减轻多模态大型语言模型中的幻觉

尽管多模态大型语言模型（MLLMs）在多种任务中取得了显著进展，但在视觉输入不一致的对象、属性或关系生成方面——即幻觉——仍然是其可靠部署的主要障碍。与纯粹的语言模型不同，MLLMs 必须将其生成过程与视觉输入联系起来。然而，现有的模型在解码过程中经常出现语义漂移，导致输出随着序列长度的增加而与视觉事实相偏离。为了解决这一问题，我们提出了一种无需训练且即插即用的方法 KVSmooth，通过注意力-熵引导的自适应平滑来减轻幻觉。具体而言，KVSmooth 对 KV 缓存中的键和值应用指数移动平均（EMA），并通过每个令牌的注意力分布的熵动态量化其汇入程度，以自适应调整平滑强度。与计算成本高昂的重新训练或对比解码方法不同，KVSmooth 在推理过程中高效运行，无需额外的训练或模型修改。广泛的实验表明，KVSmooth 显著减少了幻觉（$\mathit{CHAIR}_{S}$ 从 $41.8 ightarrow 18.2$），同时提高了整体性能（$F_1$ 分数从 $77.5 ightarrow 79.2$），同时提高了精确度和召回率。相比之下，先前的方法往往在提高一个方面的同时牺牲另一个方面，这验证了我们方法的有效性和普适性。

Summary / 总结

KVSmooth is a training-free method that mitigates hallucination in multi-modal large language models by performing attention-entropy-guided adaptive smoothing on hidden states. It applies an exponential moving average to keys and values in the KV-Cache and dynamically adjusts the smoothing strength based on the entropy of the attention distribution. Experiments show that KVSmooth significantly reduces hallucination while improving overall performance, achieving higher precision and recall simultaneously, outperforming prior methods that often trade off one metric for another.

KVSmooth 是一种无需训练的方法，通过在隐藏状态上进行注意力-熵引导的自适应平滑来解决 MLLMs 中的幻觉问题。它对 KV 缓存中的键和值应用指数移动平均，并根据注意力分布的熵动态调整平滑强度。实验表明，KVSmooth 显著减少了幻觉并提高了整体性能，同时实现了更高的精确率和召回率，优于以往往往在一种指标上改进而牺牲另一种指标的方法。

A Survey on Vision-Language-Action Models for Embodied AI

Authors: Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, Irwin King

First: 2024-05-23T01:43:54+00:00 · Latest: 2026-02-04T06:41:11+00:00

Comments: Project page: https://github.com/yueen-ma/Awesome-VLA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Embodied AI is widely recognized as a cornerstone of artificial general intelligence because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models and vision-language models, a new category of multimodal models -- referred to as vision-language-action models (VLAs) -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.

中文标题/摘要

标题：关于视觉-语言-行动模型在具身AI中的研究综述

具身AI被广泛认为是人工通用智能的基石，因为它涉及控制具身代理在物理世界中执行任务。在大型语言模型和视觉-语言模型取得成功的基础上，出现了一类新的多模态模型——称为视觉-语言-行动模型（VLAs），它们通过利用生成行动的独特能力来解决具身AI中的语言条件机器人任务。VLAs 的近期激增需要一个全面的综述来捕捉快速发展的景观。为此，我们提供了第一个关于VLAs的具身AI综述。本文提供了VLAs的详细分类，分为三大研究方向。第一个方向专注于VLAs的各个组件。第二个方向致力于开发基于VLAs的控制策略，擅长预测低级行动。第三个方向包括高级任务规划者，能够将长期任务分解为一系列子任务，从而引导VLAs遵循更广泛的用户指令。此外，我们还提供了相关资源的详细总结，包括数据集、模拟器和基准测试。最后，我们讨论了VLAs面临的挑战，并概述了具身AI的有希望的未来方向。与本文综述相关的精选资源库可在以下链接获取：https://github.com/yueen-ma/Awesome-VLA。

Summary / 总结

This survey explores vision-language-action models (VLAs) for embodied AI, motivated by the need to control embodied agents in physical tasks. The study categorizes VLAs into three lines of research: individual components, low-level action prediction, and high-level task planning. Key findings include the development of control policies for predicting actions and task planners for decomposing long-term tasks, along with an extensive summary of relevant resources and challenges in the field.

本文探讨了用于体态AI的视觉-语言-动作模型（VLAs），旨在控制物理世界中的体态代理。研究将VLAs分为三个主要领域：个体组件、低级动作预测和高级任务规划。关键发现包括开发用于预测动作的控制策略以及能够将长期任务分解为子任务的任务规划器，从而增强VLAs遵循用户指令的能力。还讨论了面临的挑战和未来方向。

AppleVLM: End-to-end Autonomous Driving with Advanced Perception and Planning-Enhanced Vision-Language Models

Authors: Yuxuan Han, Kunyuan Wu, Qianyi Shao, Renxiang Xiao, Zilu Wang, Cansen Jiang, Yi Xiao, Liang Hu, Yunjiang Lou

First: 2026-02-04T06:37:14+00:00 · Latest: 2026-02-04T06:37:14+00:00

Abs · PDF · Code1 · Code2

Abstract

End-to-end autonomous driving has emerged as a promising paradigm integrating perception, decision-making, and control within a unified learning framework. Recently, Vision-Language Models (VLMs) have gained significant attention for their potential to enhance the robustness and generalization of end-to-end driving models in diverse and unseen scenarios. However, existing VLM-based approaches still face challenges, including suboptimal lane perception, language understanding biases, and difficulties in handling corner cases. To address these issues, we propose AppleVLM, an advanced perception and planning-enhanced VLM model for robust end-to-end driving. AppleVLM introduces a novel vision encoder and a planning strategy encoder to improve perception and decision-making. Firstly, the vision encoder fuses spatial-temporal information from multi-view images across multiple timesteps using a deformable transformer mechanism, enhancing robustness to camera variations and facilitating scalable deployment across different vehicle platforms. Secondly, unlike traditional VLM-based approaches, AppleVLM introduces a dedicated planning modality that encodes explicit Bird's-Eye-View spatial information, mitigating language biases in navigation instructions. Finally, a VLM decoder fine-tuned by a hierarchical Chain-of-Thought integrates vision, language, and planning features to output robust driving waypoints. We evaluate AppleVLM in closed-loop experiments on two CARLA benchmarks, achieving state-of-the-art driving performance. Furthermore, we deploy AppleVLM on an AGV platform and successfully showcase real-world end-to-end autonomous driving in complex outdoor environments.

中文标题/摘要

标题：AppleVLM：端到端自主驾驶的高级感知与规划增强视觉-语言模型

端到端自主驾驶已成为一种有前景的范式，将感知、决策和控制整合到统一的学习框架中。最近，视觉-语言模型（VLMs）因其在多种未见场景中增强端到端驾驶模型的鲁棒性和泛化能力而受到广泛关注。然而，现有的基于VLM的方法仍然面临挑战，包括车道感知不足、语言理解偏差以及难以处理边缘情况。为了解决这些问题，我们提出了一种先进的感知与规划增强的VLM模型AppleVLM，以提高端到端驾驶的鲁棒性。AppleVLM引入了一种新颖的视觉编码器和规划策略编码器，以改善感知和决策。首先，视觉编码器使用变形的变压器机制融合多视图图像在多个时间步的时空信息，增强对相机变化的鲁棒性，并促进在不同车辆平台上的可扩展部署。其次，与传统的基于VLM的方法不同，AppleVLM引入了一种专门的规划模态，编码显式的鸟瞰图空间信息，减轻导航指令中的语言偏差。最后，通过层次链式思维微调的VLM解码器整合视觉、语言和规划特征，输出鲁棒的驾驶航点。我们在两个CARLA基准上的闭环实验中评估了AppleVLM，实现了最先进的驾驶性能。此外，我们在AGV平台上部署了AppleVLM，并成功展示了在复杂户外环境中的真实端到端自主驾驶。

Summary / 总结

AppleVLM is an advanced VLM model designed to enhance end-to-end autonomous driving by improving perception and decision-making. It introduces a vision encoder that fuses spatial-temporal information from multi-view images and a planning strategy encoder that encodes explicit spatial information, addressing issues like suboptimal lane perception and language biases. AppleVLM achieves state-of-the-art driving performance in closed-loop experiments on CARLA benchmarks and demonstrates successful real-world autonomous driving on an AGV platform.

AppleVLM 是一种先进的 VLM 模型，旨在通过提升感知和决策能力来增强端到端自动驾驶。它引入了一个融合多视图图像时空信息的视觉编码器和一个编码显式空间信息的规划策略编码器，解决了车道感知和语言偏见等问题。在 CARLA 基准测试和 AGV 平台上的实际部署中，AppleVLM 展现了其在多种场景下的优越性能和鲁棒性。

STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Authors: Chen Li, Han Zhang, Zhantao Yang, Fangyi Chen, Zihan Wang, Anudeepsekhar Bolimera, Marios Savvides

Venue: AAAI 2026

First: 2025-08-12T07:27:50+00:00 · Latest: 2026-02-04T06:14:03+00:00

Comments: This paper has been accepted at AAAI 2026. This is the author's extended version. The final version will appear in the official proceedings

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks.

中文标题/摘要

标题：STELAR-VISION：自拓扑意识高效学习的视觉语言对齐推理

视觉语言模型（VLMs）在推理方面取得了显著进展，但在处理复杂的多模态任务时常常遇到困难，倾向于生成冗长的输出。一个关键限制是它们依赖于链式推理（CoT），尽管许多任务可以从树状或图状等替代拓扑结构中受益。为了解决这个问题，我们提出了STELAR-Vision，这是一种拓扑意识推理的训练框架。其核心是TopoAug，这是一种合成数据管道，能够丰富训练数据，使其包含多种拓扑结构。通过监督微调和强化学习，我们对Qwen2VL模型进行了后训练，兼顾准确性和效率。此外，我们还提出了节俭学习，该方法在几乎不损失准确性的前提下减少了输出长度。在MATH-V和VLM-S2H上，STELAR-Vision的准确率比基线模型提高了9.7%，并且超过了更大的Qwen2VL-72B-Instruct 7.3%。在五个离分布测试基准上，它分别比Phi-4-Multimodal-Instruct和LLaMA-3.2-11B-Vision-Instruct高出了28.4%和13.2%，展示了强大的泛化能力。与仅链式训练相比，我们的方法在分布内数据集上的总体准确率提高了4.3%，并且在所有离分布基准上都表现出色。

Summary / 总结

STELAR-Vision is a training framework that addresses the limitations of vision-language models in handling complex multimodal tasks and generating verbose outputs. It introduces TopoAug, a synthetic data pipeline that enriches training with diverse topological structures, and Frugal Learning to reduce output length. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% and 7.3% respectively, and outperforms other models on five out-of-distribution benchmarks, demonstrating strong generalization.

STELAR-Vision 是一种训练框架，旨在解决视觉语言模型（VLMs）在处理复杂多模态任务和生成冗长输出方面的局限性。它引入了 TopoAug，这是一种合成数据管道，用于在训练中丰富多样化的拓扑结构，并提出了节俭学习以减少输出长度。在 MATH-V 和 VLM-S2H 上，STELAR-Vision 分别提高了 9.7% 和 7.3% 的准确性，并在多个离分布基准测试中优于多个大型模型，展示了强大的泛化能力。

Same or Not? Enhancing Visual Perception in Vision-Language Models

Authors: Damiano Marsili, Aditya Mehta, Ryan Y. Lin, Georgia Gkioxari

First: 2025-12-29T16:43:47+00:00 · Latest: 2026-02-04T04:03:31+00:00

Comments: Project webpage: https://glab-caltech.github.io/twin/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language models (VLMs) excel at broad visual understanding but remain coarse-grained, exhibit visual biases, and miss subtle visual details. Existing training corpora reinforce this limitation by emphasizing general recognition ("Is it a cat or a dog?") over fine-grained perception. To address this, we introduce a new training corpus and task designed to enhance the perceptual abilities of VLMs. TWIN is a large-scale dataset of 561,000 image-pair queries that task models to determine whether two visually similar images depict the same object, encouraging attention to nuanced visual cues. The dataset spans a diverse range of everyday objects across contexts, viewpoints, and appearances. Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks. To quantify these gains, we introduce FGVQA, a benchmark suite of 12,000 queries that repurposes fine-grained recognition and retrieval datasets from multiple domains. While existing VLMs struggle on FGVQA, when fine-tuned on TWIN they improve by up to 19.3%, without compromising performance on general VQA benchmarks. Finally, our TWIN dataset scales favorably with object annotations, and our analysis shows that scale is key to performance. We envision TWIN as a drop-in addition to open-source VLM training corpora, advancing perceptual precision of future models. Project webpage: https://glab-caltech.github.io/twin/

中文标题/摘要

标题：同或不同？提升视觉语言模型的视觉感知能力

视觉语言模型（VLMs）在广泛的视觉理解方面表现出色，但仍然较为粗略，存在视觉偏见，并且忽略了一些细微的视觉细节。现有的训练语料库通过强调一般识别（“是猫还是狗？”）而不是精细的感知，强化了这一局限性。为了解决这一问题，我们引入了一个新的训练语料库和任务，旨在增强VLMs的感知能力。TWIN是一个包含561,000个图像对查询的大规模数据集，要求模型判断两个视觉相似的图像是否描绘同一个物体，鼓励关注细微的视觉线索。该数据集涵盖了各种日常物体在不同上下文、视角和外观下的广泛范围。在TWIN上微调VLMs在精细识别方面取得了显著进步，即使在未见过的领域如艺术、动物、植物和地标也是如此。为了量化这些进步，我们引入了FGVQA基准测试套件，包含12,000个查询，重新利用了多个领域中的精细识别和检索数据集。虽然现有的VLMs在FGVQA上表现不佳，但在TWIN上微调后，性能提高了高达19.3%，而不会影响通用VQA基准测试的表现。最后，我们的TWIN数据集在对象注释方面具有可扩展性，我们的分析表明，规模是性能的关键。我们设想TWIN可以作为开源VLM训练语料库的即插即用添加，推动未来模型感知精度的提升。项目网页：https://glab-caltech.github.io/twin/

Summary / 总结

The paper introduces TWIN, a new dataset of 561,000 image-pair queries designed to enhance the fine-grained perception of vision-language models (VLMs). By fine-tuning VLMs on TWIN, the models show significant improvements in recognizing subtle visual details, even in unseen domains like art and animals. The FGVQA benchmark demonstrates that VLMs fine-tuned on TWIN achieve up to 19.3% better performance compared to existing models, without sacrificing general VQA capabilities. The dataset's scale is crucial for performance, and TWIN is proposed as an addition to VLM training corpora to improve perceptual precision.

研究旨在通过引入一个新的训练数据集TWIN来提升视觉语言模型（VLMs）的细粒度视觉感知能力，TWIN包含561,000对图像查询，要求模型判断两张相似图像是否描绘同一个物体。在TWIN上微调VLMs可以显著提高它们在未见过的领域如艺术、动物、植物和地标中的细粒度识别能力，FGVQA基准测试集上的改进幅度可达19.3%，且不会影响通用VQA性能。研究还表明，规模对于性能至关重要，TWIN可以轻松集成到现有的VLM训练数据集中，以提升感知精度。

MAMBO-G: Magnitude-Aware Mitigation for Boosted Guidance

Authors: Shangwen Zhu, Qianyu Peng, Zhilei Shu, Yuting Hu, Zhantao Yang, Han Zhang, Zhao Pu, Andy Zheng, Xinyu Cui, Jian Zhao, Ruili Feng, Fan Cheng

First: 2025-08-05T13:41:05+00:00 · Latest: 2026-02-04T03:53:17+00:00

Abs · PDF · Code1 · Code2

Abstract

High-fidelity text-to-image and text-to-video generation typically relies on Classifier-Free Guidance (CFG), but achieving optimal results often demands computationally expensive sampling schedules. In this work, we propose MAMBO-G, a training-free acceleration framework that significantly reduces computational cost by dynamically optimizing guidance magnitudes. We observe that standard CFG schedules are inefficient, applying disproportionately large updates in early steps that hinder convergence speed. MAMBO-G mitigates this by modulating the guidance scale based on the update-to-prediction magnitude ratio, effectively stabilizing the trajectory and enabling rapid convergence. This efficiency is particularly vital for resource-intensive tasks like video generation. Our method serves as a universal plug-and-play accelerator, achieving up to 3x speedup on Stable Diffusion v3.5 (SD3.5) and 4x on Lumina. Most notably, MAMBO-G accelerates the 14B-parameter Wan2.1 video model by 2x while preserving visual fidelity, offering a practical solution for efficient large-scale video synthesis. Our implementation follows a mainstream open-source diffusion framework and is plug-and-play with existing pipelines.

中文标题/摘要

标题：MAMBO-G：基于幅度感知的增强引导抑制

高保真文本到图像和文本到视频生成通常依赖于无分类器引导（CFG），但获得最佳结果往往需要计算成本高昂的采样计划。在本文中，我们提出了一种名为MAMBO-G的无需训练的加速框架，通过动态优化引导幅度显著降低计算成本。我们观察到，标准的CFG计划效率低下，在早期步骤中应用了不成比例的大量更新，阻碍了收敛速度。MAMBO-G通过根据更新到预测幅度比例调节引导尺度，有效稳定轨迹并实现快速收敛。这种效率对于像视频生成这样资源密集的任务尤为重要。我们的方法作为一种通用即插即用加速器，在Stable Diffusion v3.5 (SD3.5) 上实现了最高3倍的加速，在Lumina上实现了4倍的加速。最值得注意的是，MAMBO-G将14B参数的Wan2.1视频模型加速了2倍，同时保持了视觉保真度，提供了一种高效的大型视频合成的实用解决方案。我们的实现遵循主流的开源扩散框架，并且可以与现有的管道无缝集成。

Summary / 总结

MAMBO-G is a training-free acceleration framework that optimizes guidance magnitudes to reduce computational cost in text-to-image and text-to-video generation. It dynamically adjusts the guidance scale based on the update-to-prediction magnitude ratio, stabilizing the trajectory and enabling faster convergence. MAMBO-G achieves up to 4x speedup on various models, including the 14B-parameter Wan2.1 video model, without compromising visual fidelity.

MAMBO-G 是一个无需训练的框架，通过动态调整指导强度来减少文本到图像和文本到视频生成的计算成本。它基于更新与预测幅度比调整指导尺度，稳定轨迹并加速收敛。MAMBO-G 在 Stable Diffusion v3.5 上实现最高 3 倍加速，在 Lumina 上实现 4 倍加速，并将 14B 参数的 Wan2.1 视频模型加速 2 倍同时保持视觉保真度。

Invariance on Manifolds: Understanding Robust Visual Representations for Place Recognition

Authors: Jintao Cheng, Weibin Li, Zhijian He, Jin Wu, Chi Man Vong, Wei Zhang

First: 2026-01-31T18:12:29+00:00 · Latest: 2026-02-04T03:38:55+00:00

Comments: 14pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Visual Place Recognition (VPR) demands representations robust to drastic environmental and viewpoint shifts. Current aggregation paradigms, however, either rely on data-hungry supervision or simplistic first-order statistics, often neglecting intrinsic structural correlations. In this work, we propose a Second-Order Geometric Statistics framework that inherently captures geometric stability without training. We conceptualize scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold, where perturbations manifest as tractable congruence transformations. By leveraging geometry-aware Riemannian mappings, we project these descriptors into a linearized Euclidean embedding, effectively decoupling signal structure from noise. Our approach introduces a training-free framework built upon fixed, pre-trained backbones, achieving strong zero-shot generalization without parameter updates. Extensive experiments confirm that our method achieves highly competitive performance against state-of-the-art baselines, particularly excelling in challenging zero-shot scenarios.

中文标题/摘要

标题：流形上的不变性：理解用于地点识别的稳健视觉表示

视觉地点识别（VPR）要求对环境和视角的巨大变化具有鲁棒性。然而，当前的聚合范式要么依赖于数据丰富的监督，要么仅使用简单的低阶统计量，往往忽略了内在的结构相关性。在本文中，我们提出了一种第二阶几何统计框架，该框架能够内在地捕捉几何稳定性，无需训练。我们将场景概念化为对称正定（SPD）流形上的协方差描述符，其中扰动表现为可处理的同构变换。通过利用几何感知的黎曼映射，我们将这些描述符投影到线性化的欧几里得嵌入中，有效地解耦信号结构和噪声。我们的方法建立在固定且预训练的骨干之上，无需参数更新即可实现强大的零样本泛化。广泛的实验表明，我们的方法在与最先进的基线方法的竞争中表现出色，特别是在具有挑战性的零样本场景中表现出色。

Summary / 总结

This work addresses the challenge of robust visual place recognition (VPR) by proposing a Second-Order Geometric Statistics framework that captures geometric stability without requiring training. The method conceptualizes scenes as covariance descriptors on the Symmetric Positive Definite (SPD) manifold and projects them into a linearized Euclidean embedding using geometry-aware Riemannian mappings. Experiments show that this approach achieves strong zero-shot generalization and competitive performance compared to state-of-the-art methods, especially in challenging scenarios.

该研究旨在解决在剧烈环境和视角变化下鲁棒的视觉地点识别（VPR）问题。提出了一种第二阶几何统计框架，无需训练即可捕捉几何稳定性。通过将场景概念化为对称正定（SPD）流形上的协方差描述符，并使用几何感知的黎曼映射，该方法将这些描述符投影到线性化的欧几里得嵌入中，从而分离信号结构和噪声。实验表明，该方法在零样本场景中表现出强大的零样本泛化能力和与最先进的基线方法相当的性能。

WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models

Authors: Zijin Yang, Yu Sun, Kejiang Chen, Jiawei Zhao, Jun Jiang, Weiming Zhang, Nenghai Yu

First: 2026-01-29T12:14:32+00:00 · Latest: 2026-02-04T03:23:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Digital watermarking is essential for securing generated images from diffusion models. Accurate watermark evaluation is critical for algorithm development, yet existing methods have significant limitations: they lack a unified framework for both residual and semantic watermarks, provide results without interpretability, neglect comprehensive security considerations, and often use inappropriate metrics for semantic watermarks. To address these gaps, we propose WMVLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models (VLMs). We redefine quality and security metrics for each watermark type: residual watermarks are evaluated by artifact strength and erasure resistance, while semantic watermarks are assessed through latent distribution shifts. Moreover, we introduce a three-stage training strategy to progressively enable the model to achieve classification, scoring, and interpretable text generation. Experiments show WMVLM outperforms state-of-the-art VLMs with strong generalization across datasets, diffusion models, and watermarking methods.

中文标题/摘要

标题：WMVLM：通过视觉语言模型评估扩散模型图像水印

数字水印对于保护来自扩散模型的生成图像的安全至关重要。准确的水印评估对于算法开发至关重要，但现有方法存在显著局限性：缺乏统一框架处理残差和语义水印，结果缺乏可解释性，忽视了全面的安全考虑，并且经常使用不合适的语义水印度量标准。为解决这些差距，我们提出了WMVLM，这是首个通过视觉语言模型（VLMs）统一且可解释的扩散模型图像水印评估框架。我们重新定义了每种水印类型的质量和安全性度量标准：残差水印通过艺术强度和擦除抗性进行评估，而语义水印则通过潜在分布偏移进行评估。此外，我们引入了三阶段训练策略，逐步使模型实现分类、评分和可解释的文本生成。实验表明，WMVLM在数据集、扩散模型和水印方法之间具有强大的泛化能力，优于最先进的VLMs。

Summary / 总结

The research aims to improve the evaluation of digital watermarks in images generated by diffusion models. WMVLM, a unified and interpretable framework, uses vision-language models to evaluate both residual and semantic watermarks based on artifact strength, erasure resistance, and latent distribution shifts. Experiments demonstrate that WMVLM outperforms existing methods with strong generalization capabilities across various datasets, diffusion models, and watermarking techniques.

研究旨在利用视觉语言模型开发一种统一且可解释的扩散模型图像水印评估框架。方法引入了三阶段训练策略，并重新定义了残差水印和语义水印的质量和安全性指标。关键发现表明，WMVLM在各种数据集、扩散模型和水印技术上表现出强大的泛化能力，优于现有方法。

UniVRSE: Unified Vision-conditioned Response Semantic Entropy for Hallucination Detection in Medical Vision-Language Models

Authors: Zehui Liao, Shishuai Hu, Ke Zou, Mengyuan Jin, Yanning Zhang, Huazhu Fu, Liangli Zhen, Yong Xia

First: 2025-03-26T12:45:34+00:00 · Latest: 2026-02-04T03:14:16+00:00

Comments: Under Review. 12 pages, 2 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have great potential for medical image understanding, particularly in Visual Report Generation (VRG) and Visual Question Answering (VQA), but they may generate hallucinated responses that contradict visual evidence, limiting clinical deployment. Although uncertainty-based hallucination detection methods are intuitive and effective, they are limited in medical VLMs. Specifically, Semantic Entropy (SE), effective in text-only LLMs, becomes less reliable in medical VLMs due to their overconfidence from strong language priors. To address this challenge, we propose UniVRSE, a Unified Vision-conditioned Response Semantic Entropy framework for hallucination detection in medical VLMs. UniVRSE strengthens visual guidance during uncertainty estimation by contrasting the semantic predictive distributions derived from an original image-text pair and a visually distorted counterpart, with higher entropy indicating hallucination risk. For VQA, UniVRSE works on the image-question pair, while for VRG, it decomposes the report into claims, generates verification questions, and applies vision-conditioned entropy estimation at the claim level. To evaluate hallucination detection, we propose a unified pipeline that generates responses on medical datasets and derives hallucination labels via factual consistency assessment. However, current evaluation methods rely on subjective criteria or modality-specific rules. To improve reliability, we introduce Alignment Ratio of Atomic Facts (ALFA), a novel method that quantifies fine-grained factual consistency. ALFA-derived labels provide ground truth for robust benchmarking. Experiments on six medical VQA/VRG datasets and three VLMs show UniVRSE significantly outperforms existing methods with strong cross-modal generalization.

中文标题/摘要

标题：UniVRSE：统一的视觉条件响应语义熵框架在医疗视觉语言模型中的幻觉检测

视觉语言模型（VLMs）在医学图像理解方面具有巨大潜力，特别是在视觉报告生成（VRG）和视觉问答（VQA）方面，但它们可能会生成与视觉证据相矛盾的幻觉响应，限制了临床应用。尽管基于不确定性的方法直观且有效，但在医疗VLMs中却受到限制。具体来说，仅文本的大规模语言模型中的语义熵（SE）由于其强大的语言先验而变得不可靠，导致其可靠性降低。为了解决这一挑战，我们提出了一种名为UniVRSE的统一的视觉条件响应语义熵框架，用于医疗VLMs中的幻觉检测。UniVRSE通过对比原始图像-文本对和视觉失真版本中获得的语义预测分布来增强不确定性估计过程中的视觉指导，熵值越高表示幻觉风险越大。对于VQA，UniVRSE在图像-问题对上工作；对于VRG，它将报告分解为声明，生成验证问题，并在声明级别应用视觉条件熵估计。为了评估幻觉检测，我们提出了一种统一的管道，该管道在医学数据集上生成响应，并通过事实一致性评估获得幻觉标签。然而，当前的评估方法依赖于主观标准或特定模态的规则。为了提高可靠性，我们引入了原子事实对齐率（ALFA）这一新颖方法，用于量化细粒度的事实一致性。ALFA衍生的标签为稳健基准测试提供了真实基准。在六个医学VQA/VRG数据集和三个VLMs上的实验表明，UniVRSE在跨模态泛化方面显著优于现有方法。

Summary / 总结

The paper introduces UniVRSE, a framework for hallucination detection in medical vision-language models. It addresses the issue of overconfidence in medical VLMs by using a vision-conditioned response semantic entropy method. UniVRSE works by contrasting semantic predictive distributions from original and visually distorted image-text pairs, with higher entropy indicating higher hallucination risk. The method is evaluated using a unified pipeline and a novel Alignment Ratio of Atomic Facts (ALFA) metric, showing superior performance across six medical VQA/VRG datasets and three VLMs compared to existing methods.

UniVRSE 是一个用于检测医疗视觉语言模型中幻觉的框架，通过带有视觉指导的语义熵估计来工作。它通过对比原始和视觉扭曲后的图像-文本对的语义预测分布来运作，熵值越高表示幻觉风险越高。实验结果显示，UniVRSE 在六个医疗数据集上的表现优于现有方法，并具有良好的跨模态泛化能力。

Finding Optimal Video Moment without Training: Gaussian Boundary Optimization for Weakly Supervised Video Grounding

Authors: Sunoh Kim, Kimin Yun, Daeho Um

First: 2026-02-03T04:01:12+00:00 · Latest: 2026-02-04T02:47:33+00:00

Comments: Accepted in IEEE TMM

Abs · PDF · Code1 · Code2 · Code3

Abstract

Weakly supervised temporal video grounding aims to localize query-relevant segments in untrimmed videos using only video-sentence pairs, without requiring ground-truth segment annotations that specify exact temporal boundaries. Recent approaches tackle this task by utilizing Gaussian-based temporal proposals to represent query-relevant segments. However, their inference strategies rely on heuristic mappings from Gaussian parameters to segment boundaries, resulting in suboptimal localization performance. To address this issue, we propose Gaussian Boundary Optimization (GBO), a novel inference framework that predicts segment boundaries by solving a principled optimization problem that balances proposal coverage and segment compactness. We derive a closed-form solution for this problem and rigorously analyze the optimality conditions under varying penalty regimes. Beyond its theoretical foundations, GBO offers several practical advantages: it is training-free and compatible with both single-Gaussian and mixture-based proposal architectures. Our experiments show that GBO significantly improves localization, achieving state-of-the-art results across standard benchmarks. Extensive experiments demonstrate the efficiency and generalizability of GBO across various proposal schemes. The code is available at https://github.com/sunoh-kim/gbo.

中文标题/摘要

标题：无需训练找到最优视频时刻：基于高斯边界的优化弱监督视频定位

弱监督时间视频定位旨在仅使用视频-句子对在未剪辑视频中定位查询相关的片段，而无需要求精确时间边界的地面真实片段注释。最近的方法通过使用基于高斯的时间提案来表示查询相关的片段来解决此任务。然而，它们的推理策略依赖于从高斯参数到片段边界的启发式映射，导致定位性能不佳。为了解决这一问题，我们提出了高斯边界优化（GBO），这是一种新颖的推理框架，通过求解平衡提案覆盖和片段紧凑性的原则性优化问题来预测片段边界。我们为该问题推导出闭式解，并在不同的惩罚制度下严格分析了最优条件。除了其理论基础外，GBO 还具有几个实际优势：它是无需训练的，并且与单高斯和混合提案架构兼容。我们的实验表明，GBO 显著提高了定位性能，实现了标准基准上的最新成果。广泛的实验表明，GBO 在各种提案方案中具有高效性和泛化性。代码可在 https://github.com/sunoh-kim/gbo/ 获取。

Summary / 总结

The paper addresses the challenge of weakly supervised temporal video grounding, where the goal is to localize query-relevant segments in untrimmed videos using video-sentence pairs. It proposes Gaussian Boundary Optimization (GBO), an inference framework that optimizes segment boundaries by solving a principled optimization problem, leading to improved localization performance compared to existing methods. Experiments show that GBO achieves state-of-the-art results on standard benchmarks and is efficient and generalizable across different proposal schemes.

论文提出了一种高斯边界优化（GBO）方法，通过求解一个兼顾提案覆盖和片段紧凑性的优化问题来预测片段边界。该方法无需训练且兼容多种提案架构，显著提高了定位性能，并在标准基准上达到了最先进的结果。

MA3DSG: Multi-Agent 3D Scene Graph Generation for Large-Scale Indoor Environments

Authors: Yirum Kim, Jaewoo Kim, Ue-Hwan Kim

First: 2026-02-04T02:39:57+00:00 · Latest: 2026-02-04T02:39:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Current 3D scene graph generation (3DSGG) approaches heavily rely on a single-agent assumption and small-scale environments, exhibiting limited scalability to real-world scenarios. In this work, we introduce Multi-Agent 3D Scene Graph Generation (MA3DSG) model, the first framework designed to tackle this scalability challenge using multiple agents. We develop a training-free graph alignment algorithm that efficiently merges partial query graphs from individual agents into a unified global scene graph. Leveraging extensive analysis and empirical insights, our approach enables conventional single-agent systems to operate collaboratively without requiring any learnable parameters. To rigorously evaluate 3DSGG performance, we propose MA3DSG-Bench-a benchmark that supports diverse agent configurations, domain sizes, and environmental conditions-providing a more general and extensible evaluation framework. This work lays a solid foundation for scalable, multi-agent 3DSGG research.

中文标题/摘要

标题：MA3DSG：多智能体3D场景图生成用于大规模室内环境

当前的3D场景图生成（3DSGG）方法严重依赖单智能体假设和小规模环境，难以扩展到真实世界场景。本文引入了多智能体3D场景图生成（MA3DSG）模型，这是第一个使用多个智能体解决这一扩展性挑战的框架。我们开发了一种无需训练的图对齐算法，该算法能够高效地将各个智能体的局部查询图合并为一个统一的全局场景图。借助广泛的分析和经验洞察，我们的方法使传统的单智能体系统能够在无需任何可学习参数的情况下协同工作。为了严格评估3DSGG性能，我们提出了MA3DSG-Bench基准，该基准支持多种智能体配置、领域大小和环境条件，提供了一个更通用和可扩展的评估框架。本工作为可扩展的多智能体3DSGG研究奠定了坚实的基础。

Summary / 总结

The research addresses the scalability limitations of current 3D scene graph generation methods, which are typically designed for small-scale environments and single agents. The Multi-Agent 3D Scene Graph Generation (MA3DSG) model is introduced to handle large-scale indoor environments by using multiple agents. It includes a training-free graph alignment algorithm that merges partial query graphs from individual agents into a unified global scene graph. The approach allows conventional single-agent systems to operate collaboratively without additional learnable parameters. The MA3DSG-Bench benchmark is proposed to evaluate 3DSGG performance under various agent configurations and environmental conditions, providing a more general evaluation framework.

研究针对当前3D场景图生成方法在单个代理和小型环境中的局限性，提出了Multi-Agent 3D场景图生成（MA3DSG）模型以处理大规模室内环境。该模型使用多个代理，通过训练免费的图对齐算法将各个代理的局部查询图合并为统一的全局场景图。该方法使单个代理系统能够协作运行，无需任何可学习参数。还提出了MA3DSG-Bench基准，以在各种条件下评估3D场景图生成性能，提供了一个更通用和可扩展的评估框架。

History

20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553