Post-hoc Probabilistic Vision-Language Models
Authors: Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp
Venue: ICLR 2026
First: 2024-12-08T18:16:13+00:00 · Latest: 2026-02-13T16:49:09+00:00
Comments: Published at ICLR 2026. Project page: https://aaltoml.github.io/BayesVLM/
Abstract
Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.
中文标题/摘要
标题:事后概率视觉-语言模型
视觉-语言模型(VLMs),如CLIP和SigLIP,在分类、检索和生成任务中取得了显著的成功。为此,VLMs将图像和文本描述确定性地映射到一个联合潜在空间,在该空间中使用余弦相似度评估它们的相似性。然而,在下游任务中使用确定性映射输入时,无法捕捉由于领域转移而产生的概念不确定性。在本文中,我们提出了一种不需要额外训练的VLMs事后不确定性估计方法。该方法利用VLMs最后一层的贝叶斯后验近似,并分析量化余弦相似度的不确定性。我们展示了其在不确定性量化和积极学习支持集选择中的有效性。与基线相比,我们获得了改进且校准良好的预测不确定性、可解释的不确定性估计以及样本高效的积极学习。我们的结果表明,对于大规模模型的安全关键应用具有前景。
Summary / 总结
This paper addresses the limitations of deterministic mappings in vision-language models (VLMs) by proposing a post-hoc method for uncertainty estimation. The method uses Bayesian posterior approximation to quantify uncertainties over cosine similarities in the latent space of VLMs. Experiments show that this approach improves predictive uncertainties, provides interpretable uncertainty estimates, and enhances sample efficiency in active learning compared to baseline methods.
该研究针对视觉-语言模型(VLMs)中确定性映射的局限性,提出了一种后验不确定性估计方法。该方法利用贝叶斯后验近似来量化VLMs中余弦相似性的不确定性。研究显示,这种方法能提高预测不确定性,提供可解释的不确定性估计,并增强样本高效的主动学习,特别适用于大规模模型的安全关键应用。
Implicit-Scale 3D Reconstruction for Multi-Food Volume Estimation from Monocular Images
Authors: Yuhao Chen, Gautham Vinod, Siddeshwar Raghavan, Talha Ibn Mahmud, Bruce Coburn, Jinge Ma, Fengqing Zhu, Jiangpeng He
Venue: www
First: 2026-02-13T15:52:39+00:00 · Latest: 2026-02-13T15:52:39+00:00
Comments: Paper accepted to 2026 IEEE Southwest Symposium on Image Analysis and Interpretation. The dataset can be downloaded at: https://www.kaggle.com/competitions/3d-reconstruction-from-monocular-multi-food-images/data
Abstract
We present Implicit-Scale 3D Reconstruction from Monocular Multi-Food Images, a benchmark dataset designed to advance geometry-based food portion estimation in realistic dining scenarios. Existing dietary assessment methods largely rely on single-image analysis or appearance-based inference, including recent vision-language models, which lack explicit geometric reasoning and are sensitive to scale ambiguity. This benchmark reframes food portion estimation as an implicit-scale 3D reconstruction problem under monocular observations. To reflect real-world conditions, explicit physical references and metric annotations are removed; instead, contextual objects such as plates and utensils are provided, requiring algorithms to infer scale from implicit cues and prior knowledge. The dataset emphasizes multi-food scenes with diverse object geometries, frequent occlusions, and complex spatial arrangements. The benchmark was adopted as a challenge at the MetaFood 2025 Workshop, where multiple teams proposed reconstruction-based solutions. Experimental results show that while strong vision--language baselines achieve competitive performance, geometry-based reconstruction methods provide both improved accuracy and greater robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.
中文标题/摘要
标题:基于单目多食品图像的隐式尺度3D重建
我们提出了基于单目多食品图像的隐式尺度3D重建基准数据集,旨在推动在现实餐饮场景中基于几何的食品分量估算。现有的膳食评估方法主要依赖于单图像分析或基于外观的推理,包括最近的视觉-语言模型,这些方法缺乏明确的几何推理,且对尺度模糊敏感。该基准将食品分量估算重新定义为在单目观测下的隐式尺度3D重建问题。为了反映现实世界条件,移除了明确的物理参考和度量标注,而是提供了盘子和餐具等上下文对象,要求算法从隐式线索和先验知识中推断尺度。数据集强调了具有多种物体几何形状、频繁遮挡和复杂空间布局的多食品场景。该基准在2025年MetaFood研讨会中被采用,多个团队提出了基于重建的解决方案。实验结果表明,虽然强大的视觉-语言基线在性能上具有竞争力,但基于几何的重建方法在准确性和鲁棒性方面均有所提升,最佳方法在体积估算中的MAPE为0.21,在几何精度中的L1 Chamfer距离为5.7。
Summary / 总结
The research addresses the limitation of existing dietary assessment methods that rely on single-image analysis or appearance-based inference, which lack geometric reasoning and are scale-ambiguous. It introduces a benchmark dataset for implicit-scale 3D reconstruction from monocular multi-food images, emphasizing real-world conditions with diverse object geometries and complex spatial arrangements. Experimental results indicate that geometry-based reconstruction methods outperform vision-language baselines in terms of accuracy and robustness, with the top-performing approach achieving 0.21 MAPE in volume estimation and 5.7 L1 Chamfer Distance in geometric accuracy.
研究旨在通过将问题重新定义为从单目图像进行隐式尺度的3D重建任务,来提高食物分量估计的准确性。该数据集设计用于MetaFood 2025研讨会,包括多食物场景、上下文对象和复杂的空间布局,去除了显式的物理参考。关键发现表明,基于几何的重建方法优于视觉-语言基线,实现了0.21的MAPE体积估计准确率和5.7的L1 Chamfer距离几何准确性。
Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding
Authors: Wenhui Liao, Hongliang Li, Pengyu Xie, Xinyu Cai, Yufan Shen, Yi Xin, Qi Qin, Shenglong Ye, Tianbin Li, Ming Hu, Junjun He, Yihao Liu, Wenhai Wang, Min Dou, Bin Fu, Botian Shi, Yu Qiao, Lianwen Jin
First: 2026-02-13T14:22:10+00:00 · Latest: 2026-02-13T14:22:10+00:00
Comments: Preliminary version of an ongoing project; the paper will be refined and extended in subsequent revisions
Abstract
Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.
中文标题/摘要
标题:文档解析视觉-语言模型的无训练加速方法:基于分层推测解码
文档解析是多模态理解中的基本任务,支持信息提取和智能文档分析等多种下游应用。得益于强大的语义建模和稳健的泛化能力,基于VLM的端到端方法近年来成为主流范式。然而,这些模型在处理长文档时往往遭受显著的推理延迟,因为它们必须自回归地生成长的标记序列。受文档解析中常见的极长输出和复杂布局结构的启发,我们在本文中提出了一种无训练且高效的加速方法。受推测解码的启发,我们采用一个轻量级的文档解析流水线作为草稿模型来预测未来的标记批次,而更准确的VLM则并行验证这些草稿预测。此外,我们进一步利用文档的布局结构特性,将每页划分为独立的区域,使用相同的草稿验证策略并行解码每个区域。最终预测结果根据自然阅读顺序组装。实验结果表明了我们方法的有效性:在通用的OmniDocBench上,我们的方法为dots.ocr模型提供了2.42倍无损加速,并在长文档解析任务上实现了高达4.89倍的加速。我们将会发布代码以促进可再现性和未来研究。
Summary / 总结
This work addresses the inference latency issue in VLM-based document parsing models by proposing a training-free acceleration method. The method uses a lightweight draft model to predict future tokens, which are then verified by a more accurate VLM in parallel. By partitioning document pages into independent regions, the approach enables parallel decoding, significantly reducing processing time. On the OmniDocBench, the method achieves a 2.42x lossless acceleration for the dots.ocr model and up to 4.89x acceleration on long-document parsing tasks.
本文提出了一种无需训练的加速方法,以解决基于VLM的文档解析模型的推理延迟问题。该方法使用一个轻量级的草稿模型并行预测未来令牌,而更准确的VLM则并行验证这些预测。通过将文档页面划分为独立区域,该方法实现了并行解码。实验表明,该方法在通用任务上将dots.ocr模型的加速比提升至2.42倍,在长文档解析任务上最高可达4.89倍,且不损失准确性。代码将被发布以促进可重复性和未来研究。
Transporting Task Vectors across Different Architectures without Training
Authors: Filippo Rinaldi, Aniello Panariello, Giacomo Salici, Angelo Porrello, Simone Calderara
First: 2026-02-13T14:16:34+00:00 · Latest: 2026-02-13T14:16:34+00:00
Abstract
Adapting large pre-trained models to downstream tasks often produces task-specific parameter updates that are expensive to relearn for every model variant. While recent work has shown that such updates can be transferred between models with identical architectures, transferring them across models of different widths remains largely unexplored. In this work, we introduce Theseus, a training-free method for transporting task-specific updates across heterogeneous models. Rather than matching parameters directly, we characterize a task update by the functional effect it induces on intermediate representations. We formalize task-vector transport as a functional matching problem on observed activations and show that, after aligning representation spaces via orthogonal Procrustes analysis, it admits a stable closed-form solution that preserves the geometry of the update. We evaluate Theseus on vision and language models across different widths, showing consistent improvements over strong baselines without additional training or backpropagation. Our results show that task updates can be meaningfully transferred across architectures when task identity is defined functionally rather than parametrically.
中文标题/摘要
标题:无需训练在不同架构间运输任务向量
将大型预训练模型适应下游任务通常会产生特定于任务的参数更新,这些更新对于每个模型变体来说都十分昂贵,需要重新学习。虽然最近的研究表明,可以在具有相同架构的模型之间转移这些更新,但将它们转移到具有不同宽度的模型上仍然很少被探索。在本文中,我们引入了Theseus,这是一种无需训练的方法,用于在异构模型之间运输特定于任务的更新。我们不是直接匹配参数,而是通过它在中间表示上诱导的功能效应来表征任务更新。我们将任务向量的运输形式化为观察到的激活上的功能匹配问题,并通过正交普罗克拉斯蒂斯分析对表示空间进行对齐,从而得到一个稳定且封闭形式的解,该解保留了更新的几何结构。我们在不同宽度的视觉和语言模型上评估了Theseus,结果显示在没有额外训练或反向传播的情况下,它相对于强大的基线模型具有一致的改进。我们的结果表明,当任务身份以功能而非参数方式定义时,任务更新可以在架构之间有意义地转移。
Summary / 总结
The research aims to address the computational inefficiency of retraining task-specific updates for different model variants. The method, Theseus, is a training-free approach that transfers task-specific updates across models of different widths by characterizing updates based on their functional effect on intermediate representations. Experiments on vision and language models demonstrate consistent improvements over strong baselines without additional training or backpropagation, indicating that task updates can be meaningfully transferred across architectures when defined functionally rather than parametrically.
研究旨在解决为不同模型变体重新训练任务特定更新的计算效率问题。Theseus 是一种无需训练的方法,通过基于中间表示的功能效应来表征更新,从而在不同宽度的模型之间传输任务特定的更新。实验表明,该方法在视觉和语言模型上的一致改进超过了强基线,而无需额外的训练或反向传播,这表明当以功能而非参数方式定义任务时,任务更新可以在不同架构之间有意义地转移。
TFTF: Training-Free Targeted Flow for Conditional Sampling
Authors: Qianqian Qu, Jun S. Liu
First: 2026-02-13T13:41:35+00:00 · Latest: 2026-02-13T13:41:35+00:00
Abstract
We propose a training-free conditional sampling method for flow matching models based on importance sampling. Because a naïve application of importance sampling suffers from weight degeneracy in high-dimensional settings, we modify and incorporate a resampling technique in sequential Monte Carlo (SMC) during intermediate stages of the generation process. To encourage generated samples to diverge along distinct trajectories, we derive a stochastic flow with adjustable noise strength to replace the deterministic flow at the intermediate stage. Our framework requires no additional training, while providing theoretical guarantees of asymptotic accuracy. Experimentally, our method significantly outperforms existing approaches on conditional sampling tasks for MNIST and CIFAR-10. We further demonstrate the applicability of our approach in higher-dimensional, multimodal settings through text-to-image generation experiments on CelebA-HQ.
中文标题/摘要
标题:TFTF: 无需训练的目标导向流用于条件采样
我们提出了一种基于重要性采样的无需训练的条件采样方法,用于流匹配模型。由于在高维设置中直接应用重要性采样会遭受权重退化的问题,我们在生成过程的中间阶段引入并结合了顺序蒙特卡洛(SMC)中的重采样技术。为了鼓励生成样本沿不同的轨迹发散,我们推导出一种具有可调噪声强度的随机流,以替换中间阶段的确定性流。我们的框架不需要额外的训练,同时提供了渐近准确性的理论保证。实验表明,我们的方法在MNIST和CIFAR-10的条件采样任务上显著优于现有方法。我们还通过在CelebA-HQ上的文本到图像生成实验,进一步展示了我们方法在高维和多模态设置中的适用性。
Summary / 总结
The paper introduces TFTF, a training-free method for conditional sampling using importance sampling and resampling techniques in SMC. By incorporating a stochastic flow with adjustable noise, the method encourages diverse sample generation. Experiments show that TFTF outperforms existing approaches on MNIST and CIFAR-10, and it also demonstrates effectiveness in higher-dimensional, multimodal settings like text-to-image generation on CelebA-HQ.
该论文提出了一种名为TFTF的无训练条件采样方法,利用重要性采样和SMC中的重采样技术。通过引入具有可调噪声强度的随机流,该方法促进了样本的多样化生成。实验结果显示,TFTF在MNIST和CIFAR-10上的表现优于现有方法,并且在如CelebA-HQ的高维多模态文本到图像生成任务中也表现出有效性。
RoadscapesQA: A Multitask, Multimodal Dataset for Visual Question Answering on Indian Roads
Authors: Vijayasri Iyer, Maahin Rathinagiriswaran, Jyothikamalesh S
First: 2026-02-13T12:27:31+00:00 · Latest: 2026-02-13T12:27:31+00:00
Abstract
Understanding road scenes is essential for autonomous driving, as it enables systems to interpret visual surroundings to aid in effective decision-making. We present Roadscapes, a multitask multimodal dataset consisting of upto 9,000 images captured in diverse Indian driving environments, accompanied by manually verified bounding boxes. To facilitate scalable scene understanding, we employ rule-based heuristics to infer various scene attributes, which are subsequently used to generate question-answer (QA) pairs for tasks such as object grounding, reasoning, and scene understanding. The dataset includes a variety of scenes from urban and rural India, encompassing highways, service roads, village paths, and congested city streets, captured in both daytime and nighttime settings. Roadscapes has been curated to advance research on visual scene understanding in unstructured environments. In this paper, we describe the data collection and annotation process, present key dataset statistics, and provide initial baselines for image QA tasks using vision-language models.
中文标题/摘要
标题:RoadscapesQA:面向印度道路的多任务多模态视觉问答数据集
理解道路场景对于自动驾驶至关重要,因为它使系统能够解释视觉环境以辅助有效的决策。我们介绍了Roadscapes,一个包含多达9,000张图像的多任务多模态数据集,这些图像在多样的印度驾驶环境中拍摄,并附有手动验证的边界框。为了促进可扩展的场景理解,我们使用基于规则的启发式方法推断各种场景属性,这些属性随后用于生成用于对象定位、推理和场景理解等任务的问题-答案(QA)对。该数据集包括来自印度城乡的各种场景,涵盖高速公路、服务路、乡村小路和拥挤的城市街道,并在白天和夜间拍摄。Roadscapes旨在推进在非结构化环境中视觉场景理解的研究。在本文中,我们描述了数据收集和注释过程,提供了关键的数据集统计信息,并提供了使用视觉-语言模型进行图像QA任务的初步基线。
Summary / 总结
The research aims to improve autonomous driving systems by developing a dataset for visual question answering on Indian roads. The dataset, Roadscapes, consists of up to 9,000 images with manually verified bounding boxes and rule-based heuristics to infer scene attributes, generating questions and answers for tasks like object grounding and scene understanding. Key findings include the dataset's diverse urban and rural Indian scenes and initial baselines for image QA tasks using vision-language models.
研究旨在通过开发印度道路的视觉问答数据集来提升自动驾驶系统的能力。Roadscapes数据集包含多达9,000张图像,附有手动验证的边界框,并使用基于规则的启发式方法推断场景属性,生成物体定位和场景理解等任务的问题和答案。主要发现包括数据集涵盖了印度城乡多样化的场景,并提供了使用视觉-语言模型的图像问答任务的初步基准。
Thinking Like a Radiologist: A Dataset for Anatomy-Guided Interleaved Vision Language Reasoning in Chest X-ray Interpretation
Authors: Yichen Zhao, Zelin Peng, Piao Yang, Xiaokang Yang, Wei Shen
First: 2026-02-13T11:49:32+00:00 · Latest: 2026-02-13T11:49:32+00:00
Abstract
Radiological diagnosis is a perceptual process in which careful visual inspection and language reasoning are repeatedly interleaved. Most medical large vision language models (LVLMs) perform visual inspection only once and then rely on text-only chain-of-thought (CoT) reasoning, which operates purely in the linguistic space and is prone to hallucination. Recent methods attempt to mitigate this issue by introducing visually related coordinates, such as bounding boxes. However, these remain a pseudo-visual solution: coordinates are still text and fail to preserve rich visual details like texture and density. Motivated by the interleaved nature of radiological diagnosis, we introduce MMRad-IVL-22K, the first large-scale dataset designed for natively interleaved visual language reasoning in chest X-ray interpretation. MMRad-IVL-22K reflects a repeated cycle of reasoning and visual inspection workflow of radiologists, in which visual rationales complement textual descriptions and ground each step of the reasoning process. MMRad-IVL-22K comprises 21,994 diagnostic traces, enabling systematic scanning across 35 anatomical regions. Experimental results on advanced closed-source LVLMs demonstrate that report generation guided by multimodal CoT significantly outperforms that guided by text-only CoT in clinical accuracy and report quality (e.g., 6\% increase in the RadGraph metric), confirming that high-fidelity interleaved vision language evidence is a non-substitutable component of reliable medical AI. Furthermore, benchmarking across seven state-of-the-art open-source LVLMs demonstrates that models fine-tuned on MMRad-IVL-22K achieve superior reasoning consistency and report quality compared with both general-purpose and medical-specific LVLMs. The project page is available at https://github.com/qiuzyc/thinking_like_a_radiologist.
中文标题/摘要
标题:像放射科医生一样思考:用于胸部X光解读的解剖引导交织视觉语言推理数据集
放射学诊断是一个感知过程,在这个过程中,仔细的视觉检查和语言推理会反复交织进行。大多数医学大型视觉语言模型(LVLM)只进行一次视觉检查,然后依赖于纯语言的链式思考(CoT)推理,这种推理在语言空间中进行,容易产生幻觉。最近的方法试图通过引入与视觉相关的坐标(如边界框)来缓解这一问题。然而,这些方法仍然是伪视觉解决方案:坐标仍然是文本,无法保留丰富的视觉细节,如纹理和密度。受放射学诊断交织性质的启发,我们引入了MMRad-IVL-22K,这是第一个为胸部X光解读设计的用于原生交织视觉语言推理的大规模数据集。MMRad-IVL-22K反映了放射科医生反复的推理和视觉检查工作流程,其中视觉推理补充了文本描述,并为推理过程中的每一步提供依据。MMRad-IVL-22K包含21,994个诊断痕迹,使系统地扫描35个解剖区域成为可能。在高级闭源LVLM上的实验结果表明,由多模态CoT引导的报告生成在临床准确性和报告质量方面显著优于仅由文本CoT引导的报告生成(例如,RadGraph指标提高6%),这证实了高保真度的交织视觉语言证据是可靠医疗AI不可或缺的组成部分。此外,对七个最先进的开源LVLM的基准测试表明,使用MMRad-IVL-22K微调的模型在推理一致性和报告质量方面优于通用和医学专用的LVLM。项目页面可在https://github.com/qiuzyc/thinking_like_a_radiologist/获得。
PixelRush: Ultra-Fast, Training-Free High-Resolution Image Generation via One-step Diffusion
Authors: Hong-Phuc Lai, Phong Nguyen, Anh Tran
First: 2026-02-13T09:54:27+00:00 · Latest: 2026-02-13T09:54:27+00:00
Abstract
Pre-trained diffusion models excel at generating high-quality images but remain inherently limited by their native training resolution. Recent training-free approaches have attempted to overcome this constraint by introducing interventions during the denoising process; however, these methods incur substantial computational overhead, often requiring more than five minutes to produce a single 4K image. In this paper, we present PixelRush, the first tuning-free framework for practical high-resolution text-to-image generation. Our method builds upon the established patch-based inference paradigm but eliminates the need for multiple inversion and regeneration cycles. Instead, PixelRush enables efficient patch-based denoising within a low-step regime. To address artifacts introduced by patch blending in few-step generation, we propose a seamless blending strategy. Furthermore, we mitigate over-smoothing effects through a noise injection mechanism. PixelRush delivers exceptional efficiency, generating 4K images in approximately 20 seconds representing a 10$\times$ to 35$\times$ speedup over state-of-the-art methods while maintaining superior visual fidelity. Extensive experiments validate both the performance gains and the quality of outputs achieved by our approach.
中文标题/摘要
标题:PixelRush:通过一步扩散实现无训练的超高速高分辨率图像生成
预训练的扩散模型在生成高质量图像方面表现出色,但仍然受限于其固有的训练分辨率。最近的无训练方法试图通过在去噪过程中引入干预来克服这一限制;然而,这些方法会带来巨大的计算开销,通常需要超过五分钟才能生成一张4K图像。在本文中,我们提出了PixelRush,这是第一个无需调优的高分辨率文本到图像生成框架。我们的方法建立在现有的基于块的推理范式之上,但消除了多次反向和再生循环的需要。相反,PixelRush在低步数范围内实现了高效的基于块的去噪。为了应对多步生成中由块融合引入的伪影,我们提出了一种无缝融合策略。此外,我们通过噪声注入机制减轻过度平滑效果。PixelRush实现了卓越的效率,大约在20秒内生成4K图像,相比最先进的方法,速度提高了10到35倍,同时保持了更高的视觉保真度。广泛的实验验证了我们方法的性能提升和输出质量。
Summary / 总结
PixelRush is a tuning-free framework for generating high-resolution images efficiently. It builds on patch-based inference to avoid multiple cycles of inversion and regeneration, enabling fast generation of 4K images in about 20 seconds, which is 10 to 35 times faster than existing methods. The approach introduces a seamless blending strategy and noise injection to address artifacts and over-smoothing, respectively, while maintaining high visual fidelity.
PixelRush 是一种无需训练的高分辨率文本到图像生成框架,通过消除多次反向转换和再生周期来显著提高效率。它采用无缝混合策略和噪声注入机制来解决图像拼接带来的伪影和过度平滑问题。PixelRush 可以在约20秒内生成4K图像,比现有方法快10到35倍,同时保持高质量的视觉效果。
X-SYS: A Reference Architecture for Interactive Explanation Systems
Authors: Tobias Labarta, Nhi Hoang, Maximilian Dreyer, Jim Berend, Oleg Hein, Jackie Ma, Wojciech Samek, Sebastian Lapuschkin
First: 2026-02-13T09:24:03+00:00 · Latest: 2026-02-13T09:24:03+00:00
Comments: 18 pages, 8 figures
Abstract
The explainable AI (XAI) research community has proposed numerous technical methods, yet deploying explainability as systems remains challenging: Interactive explanation systems require both suitable algorithms and system capabilities that maintain explanation usability across repeated queries, evolving models and data, and governance constraints. We argue that operationalizing XAI requires treating explainability as an information systems problem where user interaction demands induce specific system requirements. We introduce X-SYS, a reference architecture for interactive explanation systems, that guides (X)AI researchers, developers and practitioners in connecting interactive explanation user interfaces (XUI) with system capabilities. X-SYS organizes around four quality attributes named STAR (scalability, traceability, responsiveness, and adaptability), and specifies a five-component decomposition (XUI Services, Explanation Services, Model Services, Data Services, Orchestration and Governance). It maps interaction patterns to system capabilities to decouple user interface evolution from backend computation. We implement X-SYS through SemanticLens, a system for semantic search and activation steering in vision-language models. SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability. Together, this work provides a reusable blueprint and concrete instantiation for interactive explanation systems supporting end-to-end design under operational constraints.
中文标题/摘要
标题:X-SYS:交互解释系统参考架构
可解释人工智能(XAI)研究社区提出了众多技术方法,但将可解释性部署为系统仍然具有挑战性:交互式解释系统需要合适的算法和系统能力,以保持解释在重复查询、模型和数据演变以及治理约束下的可用性。我们认为,实现XAI需要将可解释性视为信息系统问题,其中用户交互需求引发特定系统需求。我们介绍了X-SYS,一种交互式解释系统的参考架构,指导(X)AI研究人员、开发人员和从业人员将交互式解释用户界面(XUI)与系统能力连接起来。X-SYS围绕四个质量属性(可扩展性、可追溯性、响应性和适应性)组织,并规定了五部分分解(XUI服务、解释服务、模型服务、数据服务、编排和治理)。它将交互模式映射到系统能力,以解耦用户界面的演变与后端计算。我们通过SemanticLens系统实现了X-SYS,一个用于视觉语言模型的语义搜索和激活引导系统。SemanticLens展示了基于合同的服务边界如何实现独立演变,离线/在线分离如何确保响应性,持久状态管理如何支持可追溯性。这项工作一起提供了一个可重用的蓝图和一个具体的实例,支持在运营约束下端到端设计交互式解释系统。
Summary / 总结
The research aims to address the challenges of deploying explainable AI (XAI) systems by treating explainability as an information systems problem. X-SYS, a reference architecture, is introduced to guide the development of interactive explanation systems. It focuses on four quality attributes (scalability, traceability, responsiveness, and adaptability) and decomposes the system into five components. X-SYS maps interaction patterns to system capabilities, decoupling user interface evolution from backend computation. The implementation through SemanticLens demonstrates how contract-based service boundaries enable independent evolution, offline/online separation ensures responsiveness, and persistent state management supports traceability.
研究旨在通过将可解释AI (XAI) 视为信息系统问题来解决其实现挑战。引入了X-SYS参考架构来指导交互解释系统的开发,重点关注可扩展性、可追溯性、响应性和适应性四个质量属性。关键发现包括五组件分解和基于合同的服务边界的应用,这使得用户界面和后端计算可以独立进化,确保了响应性和可追溯性。
SGM: Safety Glasses for Multimodal Large Language Models via Neuron-Level Detoxification
Authors: Hongbo Wang, MaungMaung AprilPyone, Isao Echizen
First: 2025-12-17T03:31:36+00:00 · Latest: 2026-02-13T07:41:24+00:00
Abstract
Disclaimer: Samples in this paper may be harmful and cause discomfort.
Multimodal large language models (MLLMs) enable multimodal generation but inherit toxic, biased, and NSFW signals from weakly curated pretraining corpora, causing safety risks, especially under adversarial triggers that late, opaque training-free detoxification methods struggle to handle. We propose SGM, a white-box neuron-level multimodal intervention that acts like safety glasses for toxic neurons: it selectively recalibrates a small set of toxic expert neurons via expertise-weighted soft suppression, neutralizing harmful cross-modal activations without any parameter updates. We establish MM-TOXIC-QA, a multimodal toxicity evaluation framework, and compare SGM with existing detoxification techniques. Experiments on open-source MLLMs show that SGM mitigates toxicity in standard and adversarial conditions, cutting harmful rates from 48.2\% to 2.5\% while preserving fluency and multimodal reasoning. SGM is extensible, and its combined defenses, denoted as SGM*, integrate with existing detoxification methods for stronger safety performance, providing an interpretable, low-cost solution for toxicity-controlled multimodal generation.
中文标题/摘要
标题:SGM:通过神经元级去毒化为多模态大型语言模型提供安全眼镜
免责声明:本文中的样本可能有害并引起不适。
多模态大型语言模型(MLLMs)能够实现多模态生成,但会从未严格校准的预训练语料库中继承有毒、偏见和不适合公开的信号,导致安全风险,尤其是在对抗性触发下,晚期、不透明的无训练去毒化方法难以处理。我们提出SGM,一种白盒的神经元级多模态干预方法,它像安全眼镜一样作用于有毒神经元:通过专家加权软抑制,选择性地重新校准一小部分有毒专家神经元,无需任何参数更新即可消除有害的跨模态激活。我们建立了MM-TOXIC-QA,一个多模态毒性评估框架,并将SGM与现有去毒化技术进行了比较。在开源MLLM上的实验表明,SGM在标准和对抗条件下减轻了毒性,将有害率从48.2%降低到2.5%,同时保持了流畅性和多模态推理能力。SGM具有扩展性,其结合防御措施,标记为SGM*,与现有的去毒化方法结合使用,以增强安全性表现,提供一种可解释、低成本的毒性控制多模态生成解决方案。
Summary / 总结
SGM is a white-box neuron-level intervention method for multimodal large language models (MLLMs) that selectively recalibrates toxic expert neurons via expertise-weighted soft suppression. This method acts like safety glasses, neutralizing harmful cross-modal activations without parameter updates. Experiments show that SGM reduces harmful rates from 48.2% to 2.5% under both standard and adversarial conditions, while maintaining fluency and multimodal reasoning. SGM can be combined with existing detoxification techniques to enhance safety performance.
SGM 是一种白盒神经元级干预方法,用于多模态大型语言模型(MLLMs),通过专家加权软抑制选择性地重新校准有毒专家神经元。这种方法像安全眼镜一样,中和有害的跨模态激活,而不进行参数更新。实验表明,SGM 在标准和对抗条件下将有害率从 48.2% 降低到 2.5%,同时保持流畅性和多模态推理能力。SGM 可以与现有的去毒方法结合使用,以增强安全性表现。
IndicFairFace: Balanced Indian Face Dataset for Auditing and Mitigating Geographical Bias in Vision-Language Models
Authors: Aarish Shah Mohsin, Mohammed Tayyab Ilyas Khan, Mohammad Nadeem, Shahab Saquib Sohail, Erik Cambria, Jiechao Gao
First: 2026-02-13T06:41:03+00:00 · Latest: 2026-02-13T06:41:03+00:00
Abstract
Vision-Language Models (VLMs) are known to inherit and amplify societal biases from their web-scale training data with Indian being particularly misrepresented. Existing fairness-aware datasets have significantly improved demographic balance across global race and gender groups, yet they continue to treat Indian as a single monolithic category. The oversimplification ignores the vast intra-national diversity across 28 states and 8 Union Territories of India and leads to representational and geographical bias. To address the limitation, we present IndicFairFace, a novel and balanced face dataset comprising 14,400 images representing geographical diversity of India. Images were sourced ethically from Wikimedia Commons and open-license web repositories and uniformly balanced across states and gender. Using IndicFairFace, we quantify intra-national geographical bias in prominent CLIP-based VLMs and reduce it using post-hoc Iterative Nullspace Projection debiasing approach. We also show that the adopted debiasing approach does not adversely impact the existing embedding space as the average drop in retrieval accuracy on benchmark datasets is less than 1.5 percent. Our work establishes IndicFairFace as the first benchmark to study geographical bias in VLMs for the Indian context.
中文标题/摘要
标题:IndicFairFace:平衡印度面部数据集,用于审计和缓解视觉-语言模型中的地理偏差
视觉-语言模型(VLMs)因其大规模网络训练数据而继承和放大了社会偏见,印度尤其被严重代表性不足。现有的公平意识数据集在改善全球种族和性别群体的种族平衡方面取得了显著进步,但它们仍然将印度视为单一的庞大类别。这种简化忽略了印度28个邦和8个联邦领土内的巨大国内多样性,并导致了代表性和地理上的偏见。为了解决这一局限性,我们提出了IndicFairFace,这是一个新颖且平衡的面部数据集,包含14,400张图像,代表了印度的地理多样性。这些图像从维基媒体共享和开放许可的网络存储库中伦理地获取,并在各邦和性别方面均匀平衡。使用IndicFairFace,我们量化了著名CLIP基VLMs中的国内地理偏见,并通过后处理迭代零空间投影去偏方法减少了这种偏见。我们还表明,采用的去偏方法不会对现有的嵌入空间产生负面影响,基准数据集上的检索准确率平均下降不到1.5%。我们的工作将IndicFairFace确立为首个针对印度语境中VLMs中的地理偏见进行研究的基准。
Summary / 总结
The research aims to address the geographical bias in Vision-Language Models (VLMs) concerning the Indian population, which is often misrepresented. To achieve this, the authors created IndicFairFace, a balanced dataset of 14,400 images representing the geographical diversity of India. By uniformly balancing images across states and genders, they quantified and mitigated intra-national geographical bias in CLIP-based VLMs using Iterative Nullspace Projection, with minimal impact on retrieval accuracy. The study highlights the importance of considering regional diversity in fairness-aware datasets for VLMs.
研究旨在通过关注印度背景来解决视觉-语言模型(VLMs)中的地理偏差问题,现有数据集将印度人视为单一类别,忽略了各州之间的多样性。通过从维基媒体公共领域和开放许可网络存储库中伦理地获取14,400张图像,创建了一个平衡的脸部数据集,代表了印度的地理多样性,并且在性别和各州之间均匀分布。使用IndicFairFace,研究量化了CLIP等主流VLMs中的国内地理偏差,并通过后处理的迭代零空间投影去偏方法减少了偏差,同时对基准数据集的检索准确性影响很小,平均下降不到1.5个百分点。
SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation
Authors: Nobuhiro Ueda, Yuyang Dong, Krisztián Boros, Daiki Ito, Takuya Sera, Masafumi Oyamada
First: 2025-05-20T14:03:24+00:00 · Latest: 2026-02-13T05:59:36+00:00
Abstract
With the increasing adoption of Large Language Models (LLMs) and Vision-Language Models (VLMs), rich document analysis technologies for applications like Retrieval-Augmented Generation (RAG) and visual RAG are gaining significant attention. Recent research indicates that using VLMs yields better RAG performance, but processing rich documents remains a challenge since a single page contains large amounts of information. In this paper, we present SCAN (SemantiC Document Layout ANalysis), a novel approach that enhances both textual and visual Retrieval-Augmented Generation (RAG) systems that work with visually rich documents. It is a VLM-friendly approach that identifies document components with appropriate semantic granularity, balancing context preservation with processing efficiency. SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering contiguous components. We trained the SCAN model by fine-tuning object detection models on an annotated dataset. Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.4 points and visual RAG performance by up to 10.4 points, outperforming conventional approaches and even commercial document processing solutions.
中文标题/摘要
标题:SCAN: 语义文档布局分析以增强文本和视觉检索生成
随着大型语言模型(LLMs)和视觉语言模型(VLMs)的广泛应用,用于检索增强生成(RAG)和视觉RAG等应用的丰富文档分析技术正获得广泛关注。近期研究表明,使用VLMs可以提高RAG性能,但处理丰富文档仍是一个挑战,因为单页包含大量信息。本文提出了一种名为SCAN(语义文档布局分析)的新方法,该方法增强了处理丰富视觉文档的文本和视觉RAG系统。这是一种VLM友好的方法,能够以适当的语义粒度识别文档组件,平衡上下文保留与处理效率。SCAN采用粗粒度语义方法,将文档划分为包含连续组件的连贯区域。我们通过在标注数据集上微调对象检测模型来训练SCAN模型。我们的实验结果表明,SCAN在英语和日语数据集上的端到端文本RAG性能提高了9.4个点,视觉RAG性能提高了10.4个点,优于传统方法,甚至超过了商用文档处理解决方案。
Summary / 总结
SCAN is a novel approach that enhances Retrieval-Augmented Generation (RAG) systems for visually rich documents by identifying document components with appropriate semantic granularity. It uses a coarse-grained semantic approach to divide documents into coherent regions, improving both textual and visual RAG performance by up to 9.4 and 10.4 points respectively, outperforming conventional methods and commercial solutions.
SCAN 是一种新颖的方法,通过识别具有适当语义粒度的文档组件来增强丰富文档的文本和视觉 RAG 系统。它使用粗粒度的语义方法将文档划分为连贯的区域,与传统方法和商业解决方案相比,可以将端到端的文本 RAG 性能提高多达 9.4 个点,视觉 RAG 性能提高多达 10.4 个点。
Language-in-the-Loop Culvert Inspection on the Erie Canal
Authors: Yash Turkar, Yashom Dighe, Karthik Dantu
First: 2025-09-22T17:28:10+00:00 · Latest: 2026-02-13T05:03:29+00:00
Comments: First two authors contributed equally
Abstract
Culverts on canals such as the Erie Canal, built originally in 1825, require frequent inspections to ensure safe operation. Human inspection of culverts is challenging due to age, geometry, poor illumination, weather, and lack of easy access. We introduce VISION, an end-to-end, language-in-the-loop autonomy system that couples a web-scale vision-language model (VLM) with constrained viewpoint planning for autonomous inspection of culverts. Brief prompts to the VLM solicit open-vocabulary ROI proposals with rationales and confidences, stereo depth is fused to recover scale, and a planner -- aware of culvert constraints -- commands repositioning moves to capture targeted close-ups. Deployed on a quadruped in a culvert under the Erie Canal, VISION closes the see, decide, move, re-image loop on-board and produces high-resolution images for detailed reporting without domain-specific fine-tuning. In an external evaluation by New York Canal Corporation personnel, initial ROI proposals achieved 61.4\% agreement with subject-matter experts, and final post-re-imaging assessments reached 80\%, indicating that VISION converts tentative hypotheses into grounded, expert-aligned findings.
中文标题/摘要
标题:伊利运河中循环的语言管道检查
伊利运河等运河上的涵洞,最初于1825年建造,需要频繁检查以确保安全运行。由于年龄、几何形状、照明差、天气和难以接近,人工检查涵洞具有挑战性。我们介绍了VISION,这是一种端到端、循环语言的自主系统,将大规模网络视觉-语言模型(VLM)与受限视角规划相结合,用于自主检查涵洞。向VLM发出简短指令以获取具有理由和置信度的开放词汇ROI提案,融合立体深度以恢复比例,并在了解涵洞约束的情况下,指挥重新定位动作以捕捉目标特写。VISION部署在伊利运河下的四足机器人上,在机载上关闭“看、决定、移动、重新成像”循环,并生成高分辨率图像以进行详细报告,无需特定领域的微调。在纽约运河公司人员的外部评估中,初始ROI提案与专家达成61.4%的一致性,最终重新成像评估达到了80%,表明VISION将初步假设转化为基于事实、专家一致的发现。
Summary / 总结
The paper introduces VISION, an end-to-end autonomy system for inspecting culverts on the Erie Canal using a web-scale vision-language model and constrained viewpoint planning. The system generates open-vocabulary region-of-interest proposals with rationales and confidences, fuses stereo depth to recover scale, and plans repositioning moves to capture targeted close-ups. In an external evaluation, VISION's final assessments reached 80% agreement with subject-matter experts, demonstrating its capability to convert tentative hypotheses into expert-aligned findings without domain-specific fine-tuning.
论文介绍了VISION,这是一种使用网络规模的视觉语言模型和受限视角规划的端到端自主系统,用于检查伊利运河上的涵洞。该系统生成开放词汇的区域兴趣提案及其理由和置信度,融合立体深度以恢复尺度,并计划重新定位动作以捕捉目标特写。外部评估显示,初始ROI提案与专家一致率为61.4%,最终评估达到80%,表明VISION能够将假设转化为专家对齐的发现,无需特定领域的微调。
Free Lunch for Stabilizing Rectified Flow Inversion
Authors: Chenru Wang, Beier Zhu, Chi Zhang
Venue: ICLR 2026
First: 2026-02-12T11:42:36+00:00 · Latest: 2026-02-13T02:39:35+00:00
Comments: Accepted by ICLR 2026
Abstract
Rectified-Flow (RF)-based generative models have recently emerged as strong alternatives to traditional diffusion models, demonstrating state-of-the-art performance across various tasks. By learning a continuous velocity field that transforms simple noise into complex data, RF-based models not only enable high-quality generation, but also support training-free inversion, which facilitates downstream tasks such as reconstruction and editing. However, existing inversion methods, such as vanilla RF-based inversion, suffer from approximation errors that accumulate across timesteps, leading to unstable velocity fields and degraded reconstruction and editing quality. To address this challenge, we propose Proximal-Mean Inversion (PMI), a training-free gradient correction method that stabilizes the velocity field by guiding it toward a running average of past velocities, constrained within a theoretically derived spherical Gaussian. Furthermore, we introduce mimic-CFG, a lightweight velocity correction scheme for editing tasks, which interpolates between the current velocity and its projection onto the historical average, balancing editing effectiveness and structural consistency. Extensive experiments on PIE-Bench demonstrate that our methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the required number of neural function evaluations. Our approach achieves state-of-the-art performance on the PIE-Bench with enhanced efficiency and theoretical soundness.
中文标题/摘要
标题:稳定矫正流反转的免费午餐
基于矫正流(RF)的生成模型最近已成为传统扩散模型的强大替代品,在各种任务中表现出最先进的性能。通过学习将简单噪声转换为复杂数据的连续速度场,RF基模型不仅能够实现高质量的生成,还支持无需训练的反转,这有助于下游任务如重建和编辑。然而,现有的反转方法,如纯RF基反转,会因时间步长中的累积近似误差而导致不稳定的速度场和降级的重建和编辑质量。为了解决这一挑战,我们提出了邻近均值反转(PMI),这是一种无需训练的梯度校正方法,通过将其引导向过去速度的运行平均值来稳定速度场,该平均值受理论推导出的球形高斯约束。此外,我们还引入了模仿-CFG,这是一种轻量级的编辑任务速度校正方案,它在当前速度和其在历史平均值上的投影之间进行插值,平衡编辑效果和结构一致性。在PIE-Bench上的大量实验表明,我们的方法显著提高了反转稳定性、图像重建质量和编辑保真度,同时减少了所需的神经函数评估次数。我们的方法在PIE-Bench上实现了最先进的性能,具有增强的效率和理论严谨性。
Summary / 总结
This paper addresses the issue of unstable velocity fields in Rectified-Flow (RF)-based generative models, which can lead to degraded reconstruction and editing quality. To stabilize these models, the authors propose Proximal-Mean Inversion (PMI), a training-free method that corrects gradients by guiding the velocity field towards a running average of past velocities. Additionally, they introduce mimic-CFG, a lightweight editing scheme that balances editing effectiveness and structural consistency. Experiments on PIE-Bench show that these methods significantly improve inversion stability, image reconstruction quality, and editing fidelity, while reducing the number of neural function evaluations.
本文通过提出Proximal-Mean Inversion (PMI)和mimic-CFG来解决Rectified-Flow (RF)生成模型中的不稳定性问题。PMI通过梯度修正方法,将速度场引导至过去速度的运行平均值,限定在理论上推导出的球形高斯内。mimic-CFG进一步优化编辑任务,平衡效果和结构一致性。实验表明,这些方法显著提高了反演稳定性、图像重建质量和编辑保真度,同时减少了神经函数评估次数,实现了PIEBench上的最先进性能。
What Matters in Building Vision-Language-Action Models for Generalist Robots
Authors: Xinghang Li, Peiyan Li, Long Qian, Minghuan Liu, Dong Wang, Jirong Liu, Bingyi Kang, Xiao Ma, Xinlong Wang, Di Guo, Tao Kong, Hanbo Zhang, Huaping Liu
First: 2024-12-18T17:07:20+00:00 · Latest: 2026-02-13T02:05:15+00:00
Comments: Project page: robovlms.github.io. Added limitations and future works. Fix categorization
Abstract
To utilize Foundation Vision Language Models (VLMs) for robotic tasks and motion planning, the community has proposed different methods for injecting action components into VLMs and building the Vision-Language-Action models (VLAs). In this work, we disclose the key factors that significantly influence the performance of VLA on robot manipulation problems and focus on answering three essential design choices: which backbone to select, how to formulate the VLA architectures, and when to add cross-embodiment data. The obtained results convince us firmly to explain why we prefer VLA and develop a new family of VLAs, RoboVLMs, which require very few manual designs and achieve a new state-of-the-art performance in three simulation tasks and real-world experiments. Through our extensive experiments, which include over 8 VLM backbones, 4 policy architectures, and over 600 distinct designed experiments, we provide a detailed guidebook for the future design of VLAs. In addition to the study, the highly flexible RoboVLMs framework, which supports easy integrations of new VLMs and free combinations of various design choices, is made public to facilitate future research. We open-source all details, including codes, models, datasets, and toolkits, along with detailed training and evaluation recipes at: robovlms.github.io.
中文标题/摘要
标题:构建通用机器人视觉语言行动模型的关键因素
为了利用基础视觉语言模型(VLMs)进行机器人任务和运动规划,社区提出了不同的方法将行动组件注入VLMs并构建视觉语言行动模型(VLAs)。在本文中,我们揭示了显著影响机器人操作问题中VLAs性能的关键因素,并重点关注回答三个基本设计选择:选择哪个骨干网络、如何构建VLAs架构以及何时添加跨体数据。获得的结果使我们坚信为什么我们更喜欢VLAs,并开发了一种新的VLAs家族RoboVLMs,它需要很少的手动设计并在三个模拟任务和真实世界实验中达到了新的最佳性能。通过包括超过8种VLM骨干网络、4种策略架构和超过600种不同设计实验的广泛实验,我们为未来VLAs的设计提供了详细的指南。除了研究之外,我们还公开了一个高度灵活的RoboVLMs框架,支持新VLM的轻松集成和各种设计选择的自由组合,以促进未来研究。我们开源了所有细节,包括代码、模型、数据集和工具包,以及详细的训练和评估方法:robovlms.github.io。
Summary / 总结
This study investigates the critical factors influencing the performance of Vision-Language-Action (VLA) models for robotic manipulation tasks. Through extensive experiments with various backbones, architectures, and data integration strategies, the authors propose a new family of VLA models, RoboVLMs, which achieve state-of-the-art performance. The work provides a comprehensive guide for future VLA design and opens-source all details and tools for the community to use and build upon.
该研究探讨了影响视觉-语言-动作(VLA)模型在机器人操作任务中性能的关键因素。通过超过8种视觉语言模型(VLM)骨干、4种策略架构以及超过600种不同设计的广泛实验,作者确定了选择骨干、构建VLA架构和使用跨体数据的最佳选择。研究导致了RoboVLMs的开发,该模型在模拟和真实世界实验中均达到了最先进的性能,且需要极少的手动设计。该框架已公开,以支持未来的研究和发展。
ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration
Authors: Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo
Venue: AAAI 2026 poster
First: 2025-12-19T07:27:19+00:00 · Latest: 2026-02-13T01:46:03+00:00
Comments: Accepted for poster presentation at AAAI 2026
Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model's temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.
中文标题/摘要
标题:ProCache:基于约束的特征缓存与选择性计算以加速扩散变换器
扩散变换器(DiTs)在生成建模中取得了最先进的性能,但其高昂的计算成本阻碍了实时部署。虽然特征缓存通过利用时间冗余提供了一种无训练的加速解决方案,但现有方法存在两个关键局限性:(1)均匀的缓存间隔无法与DiT的时间非均匀动态对齐;(2)使用过大的缓存间隔进行简单的特征重用会导致严重的误差累积。在本文中,我们分析了去噪过程中DiT特征的演变,发现特征变化和误差传播在时间和深度上都高度变化。受此启发,我们提出了ProCache,这是一种基于约束的动态特征缓存框架,通过两个核心组件解决了这些问题:(i)一种约束感知的缓存模式搜索模块,通过离线约束采样生成非均匀激活时间表,以适应模型的时间特性;(ii)一种选择性计算模块,在深层块和高重要性标记中选择性地计算缓存段,以最小化误差累积,同时减少开销。在PixArt-alpha和DiT上的广泛实验表明,ProCache在几乎不降低质量的情况下实现了高达1.96倍和2.90倍的加速,显著优于先前的基于缓存的方法。
Summary / 总结
ProCache is a training-free dynamic feature caching framework designed to accelerate Diffusion Transformers (DiTs) by addressing the limitations of uniform caching intervals and excessive error accumulation. It uses a constraint-aware caching pattern search module to generate non-uniform activation schedules and a selective computation module to minimize error propagation. Experiments show that ProCache can achieve up to 1.96x and 2.90x acceleration with negligible quality loss compared to previous caching-based methods.
ProCache 是一个无需训练的动态特征缓存框架,旨在通过解决均匀缓存间隔和错误累积过大的问题来加速扩散变换器(DiTs)。它使用约束感知的缓存模式搜索模块生成非均匀激活调度,并使用选择性计算模块来最小化错误传播。实验表明,ProCache 可以实现最高 2.90 倍的加速,同时保持质量基本不变,显著优于之前的基于缓存的方法。
Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
Authors: Ara Yeroyan
Venue: SIGIR 2026
First: 2026-02-13T01:27:39+00:00 · Latest: 2026-02-13T01:27:39+00:00
Comments: 4 pages, 3 figures. Submitted to SIGIR 2026 Demonstrations Track. Project website: https://github.com/Ara-Yeroyan/visual-rag-toolkit
Abstract
Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings.
Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing - high resolution PDF to image conversion, optional margin/empty-region cropping and token hygiene (indexing only visual tokens) - and a reproducible evaluation pipeline, enabling rapid exploration of two-, three-, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k <= 10), the toolkit lowers hardware barriers and makes state-of-the-art visual retrieval more accessible in practice.
中文标题/摘要
标题:视觉RAG工具包:通过无训练池化和多阶段检索扩展多向量视觉检索
多向量视觉检索器(例如,ColPali风格的后期交互模型)提供强大的准确性,但由于每页生成数千个向量,导致索引和检索成本不断增加,难以扩展。我们提出了视觉RAG工具包,这是一种实用系统,用于通过无训练、模型感知的池化和多阶段检索扩展视觉多向量检索。受Matryoshka嵌入的启发,我们的方法在补丁嵌入上执行静态空间池化——包括一种轻量级滑动窗口平均变体——以生成紧凑的瓷砖级和全局表示,用于快速候选生成,随后使用完整的多向量嵌入进行精确MaxSim重排序。我们的设计通过将每页存储的向量从数千个减少到几十个,减少了向量到向量的比较次数的平方,显著提高了检索效率,而无需在后训练、适配器或蒸馏后进行任何操作。在ColPali和ColSmol-500M等交互式模型的实验中,我们发现,在有限的ViDoRe v2基准语料库上,两阶段检索通常可以保持NDCG和Recall @ 5/10,仅轻微降级,同时显著提高吞吐量(约4倍QPS),主要在非常大的k值时敏感。该工具包还提供了稳健的预处理——高分辨率PDF到图像转换,可选的边距/空区域裁剪和标记卫生(仅索引视觉标记)——以及可重复的评估管道,使快速探索两阶段、三阶段和级联检索变体成为可能。通过强调在常见截止值(例如,k <= 10)的效率,该工具包降低了硬件门槛,使最先进的视觉检索在实践中更具可访问性。
Summary / 总结
The research aims to improve the scalability of multi-vector visual retrievers by reducing the number of vectors stored per page. The method uses training-free, model-aware pooling and multi-stage retrieval. Experiments show that the toolkit can preserve NDCG and Recall@5/10 with minimal degradation while improving throughput by approximately 4x, especially at large k values. The toolkit also includes preprocessing and evaluation pipeline features for rapid exploration of retrieval variants.
Visual RAG 工具包通过采用无需训练的池化和多阶段检索方法来解决多向量视觉检索的可扩展性问题。它使用静态空间池化来生成紧凑的表示,然后进行精确的 MaxSim 重新排序。这种方法将向量-向量比较减少了一倍,通过约 4 倍的吞吐量提升,无需额外训练。该工具包还包含预处理步骤和可重复的评估管道,使高级视觉检索更具可访问性。
On Robustness and Chain-of-Thought Consistency of RL-Finetuned VLMs
Authors: Rosie Zhao, Anshul Shah, Xiaoyu Zhu, Xinke Deng, Zhongyu Jiang, Yang Yang, Joerg Liebelt, Arnab Mondal
First: 2026-02-13T01:12:00+00:00 · Latest: 2026-02-13T01:12:00+00:00
Abstract
Reinforcement learning (RL) fine-tuning has become a key technique for enhancing large language models (LLMs) on reasoning-intensive tasks, motivating its extension to vision language models (VLMs). While RL-tuned VLMs improve on visual reasoning benchmarks, they remain vulnerable to weak visual grounding, hallucinations, and over-reliance on textual cues. We show that simple, controlled textual perturbations--misleading captions or incorrect chain-of-thought (CoT) traces--cause substantial drops in robustness and confidence, and that these effects are more pronounced when CoT consistency is taken into account across open-source multimodal reasoning models. Entropy-based metrics further show that these perturbations reshape model uncertainty and probability mass on the correct option, exposing model-specific trends in miscalibration. To better understand these vulnerabilities, we further analyze RL fine-tuning dynamics and uncover an accuracy-faithfulness trade-off: fine-tuning raises benchmark accuracy, but can simultaneously erode the reliability of the accompanying CoT and its robustness to contextual shifts. Although adversarial augmentation improves robustness, it does not by itself prevent faithfulness drift. Incorporating a faithfulness-aware reward can restore alignment between answers and reasoning, but when paired with augmentation, training risks collapsing onto shortcut strategies and robustness remains elusive. Together, these findings highlight the limitations of accuracy-only evaluations and motivate training and assessment protocols that jointly emphasize correctness, robustness, and the faithfulness of visually grounded reasoning.
中文标题/摘要
标题:关于RL微调VLMs的鲁棒性和链式思维一致性
强化学习(RL)微调已成为增强大型语言模型(LLMs)在推理密集型任务上的关键技术,推动了其向视觉语言模型(VLMs)的扩展。虽然RL调优的VLMs在视觉推理基准测试中有所改进,但它们仍然容易受到弱视觉定位、幻觉和过度依赖文本提示的影响。我们表明,简单的、受控的文本扰动——误导性的描述或错误的链式思维(CoT)轨迹——会导致鲁棒性和信心显著下降,而且当考虑开源多模态推理模型中的CoT一致性时,这些影响更为明显。基于熵的度量进一步表明,这些扰动重塑了模型对正确选项的不确定性及其概率分布,揭示了模型特定的失准趋势。为了更好地理解这些漏洞,我们进一步分析了RL微调动态,并发现准确性和忠实性之间存在权衡:微调提高了基准测试的准确性,但同时可能削弱伴随的CoT的可靠性及其对上下文变化的鲁棒性。尽管对抗性增强可以提高鲁棒性,但并不能单独防止忠实性漂移。引入一种忠实性意识的奖励可以恢复答案和推理之间的对齐,但与增强结合使用时,训练可能会导致策略的崩溃,鲁棒性仍然难以实现。这些发现共同强调了仅依赖准确性的评估的局限性,并促使制定同时强调正确性、鲁棒性和视觉定位推理忠实性的训练和评估协议。
Summary / 总结
This study investigates the robustness and chain-of-thought (CoT) consistency of reinforcement learning (RL)-finetuned vision language models (VLMs) on reasoning-intensive tasks. The research demonstrates that simple textual perturbations significantly reduce the models' robustness and confidence, with more pronounced effects when CoT consistency is considered. Entropy-based metrics reveal that these perturbations reshape model uncertainty, exposing trends in miscalibration. The study uncovers an accuracy-faithfulness trade-off in RL fine-tuning, where improvements in benchmark accuracy can erode the reliability of CoT and its robustness to contextual shifts. While adversarial augmentation improves robustness, it does not prevent faithfulness drift. Faithfulness-aware rewards can restore alignment but may risk shortcut strategies. The findings highlight the need for training and assessment protocols that emphasize correctness, robustness, and faithfulness in visually grounded reasoning.
研究探讨了强化学习(RL)微调的视觉语言模型(VLMs)在推理密集型任务中的稳健性和链式思维(CoT)一致性。研究显示,简单的文本扰动显著降低了模型的稳健性和信心,特别是在考虑CoT一致性时效果更为明显。熵基度量揭示了这些扰动重塑了模型的不确定性,暴露了模型在校准方面的趋势。研究揭示了RL微调中的准确性和信仰之间的权衡,在提高基准准确性的过程中,可能会同时削弱CoT的可靠性和其对上下文变化的稳健性。虽然对抗性增强可以提高稳健性,但并不能防止信仰漂移。信仰意识的奖励可以恢复答案与推理的一致性,但与增强结合使用时,训练可能会陷入捷径策略,稳健性仍然难以实现。这些发现强调了仅依赖准确性的评估方法的局限性,并促使开发同时强调正确性、稳健性和视觉推理一致性的训练和评估协议。
Layer-Specific Fine-Tuning for Improved Negation Handling in Medical Vision-Language Models
Authors: Ali Abbasi, Mehdi Taghipour, Rahmatollah Beheshti
Venue: ICML 2026
First: 2026-02-13T00:44:26+00:00 · Latest: 2026-02-13T00:44:26+00:00
Comments: 15 pages, 5 figures. Submitted to ICML 2026
Abstract
Negation is a fundamental linguistic operation in clinical reporting, yet vision-language models (VLMs) frequently fail to distinguish affirmative from negated medical statements. To systematically characterize this limitation, we introduce a radiology-specific diagnostic benchmark that evaluates polarity sensitivity under controlled clinical conditions, revealing that common medical VLMs consistently confuse negated and non-negated findings. To enable learning beyond simple condition absence, we further construct a contextual clinical negation dataset that encodes structured claims and supports attribute-level negations involving location and severity. Building on these resources, we propose Negation-Aware Selective Training (NAST), an interpretability-guided adaptation method that uses causal tracing effects (CTEs) to modulate layer-wise gradient updates during fine-tuning. Rather than applying uniform learning rates, NAST scales each layer's update according to its causal contribution to negation processing, transforming mechanistic interpretability signals into a principled optimization rule. Experiments demonstrate improved discrimination of affirmative and negated clinical statements without degrading general vision-language alignment, highlighting the value of causal interpretability for targeted model adaptation in safety-critical medical settings. Code and resources are available at https://github.com/healthylaife/NAST.
中文标题/摘要
标题:针对医学视觉语言模型中改进否定处理的分层微调
否定是临床报告中的一项基本语言操作,但视觉语言模型(VLMs)经常无法区分肯定陈述和否定陈述。为了系统地表征这一局限性,我们引入了一个放射学特定的诊断基准,该基准在受控的临床条件下评估极性敏感性,揭示了常见的医学VLMs经常混淆否定和非否定的发现。为了使学习超越简单的条件不存在,我们进一步构建了一个上下文临床否定数据集,该数据集编码了结构化的声明并支持涉及位置和严重程度的属性级否定。基于这些资源,我们提出了一个注意否定的有选择性微调方法(NAST),这是一种基于可解释性引导的适应方法,使用因果追踪效应(CTEs)在微调过程中调节逐层梯度更新。NAST 不是应用统一的学习率,而是根据每一层对否定处理的因果贡献来调整更新,将机制可解释性信号转化为一种原则性的优化规则。实验表明,在不降低视觉语言整体对齐的情况下,改进了对肯定和否定临床陈述的区分,突显了因果可解释性在安全关键医疗环境中的目标模型适应中的价值。代码和资源可在 https://github.com/healthylaife/NAST/ 获取。
Summary / 总结
The research aims to improve the handling of negation in medical vision-language models by systematically evaluating their performance on a radiology-specific diagnostic benchmark. The method involves constructing a contextual clinical negation dataset and proposing Negation-Aware Selective Training (NAST), which uses causal tracing effects to modulate layer-wise gradient updates during fine-tuning. The key experimental finding is that NAST enhances the model's ability to distinguish between affirmative and negated clinical statements without compromising general vision-language alignment, demonstrating the importance of causal interpretability in medical applications.
该研究通过引入放射学特定的诊断基准和上下文临床否定数据集,解决了医疗视觉语言模型在处理否定方面的问题。提出的Negation-Aware Selective Training (NAST) 方法使用因果追踪效应来调节层间梯度更新,在细调过程中改善了模型区分肯定和否定临床陈述的能力,同时不损害一般视觉语言性能。
Continuous Diffusion Models Can Obey Formal Syntax
Authors: Jinwoo Kim, Taylor Berg-Kirkpatrick, Loris D'Antoni
First: 2026-02-12T22:55:05+00:00 · Latest: 2026-02-12T22:55:05+00:00
Abstract
Diffusion language models offer a promising alternative to autoregressive models due to their global, non-causal generation process, but their continuous latent dynamics make discrete constraints -- e.g., the output should be a JSON file that matches a given schema -- difficult to impose. We introduce a training-free guidance method for steering continuous diffusion language models to satisfy formal syntactic constraints expressed using regular expressions. Our approach constructs an analytic score estimating the probability that a latent state decodes to a valid string accepted by a given regular expression, and uses its gradient to guide sampling, without training auxiliary classifiers. The denoising process targets the base model conditioned on syntactic validity.
We implement our method in Diffinity on top of the PLAID diffusion model and evaluate it on 180 regular-expression constraints over JSON and natural-language benchmarks. Diffinity achieves 68-96\% constraint satisfaction while incurring only a small perplexity cost relative to unconstrained sampling, outperforming autoregressive constrained decoding in both constraint satisfaction and output quality.
中文标题/摘要
标题:连续扩散模型可以遵循形式语法
扩散语言模型因其全局、非因果生成过程而成为自回归模型的有前途的替代方案,但其连续的潜在动态使得难以施加离散约束——例如,输出应是一个符合给定模式的JSON文件。我们提出了一种无需训练的指导方法,用于引导连续扩散语言模型满足使用正则表达式表达的形式语法约束。我们的方法构建了一个分析得分,估计潜在状态解码为由给定正则表达式接受的有效字符串的概率,并使用其梯度来引导采样,而不进行辅助分类器的训练。去噪过程针对基于语法规则有效的基础模型。
我们在Diffinity上实现该方法,基于PLAID扩散模型,并在JSON和自然语言基准上对180个正则表达式约束进行评估。Diffinity在满足约束和输出质量方面均优于自回归约束解码,仅相对于无约束采样引入了很小的困惑度成本。
Summary / 总结
The research aims to address the challenge of imposing discrete constraints on continuous diffusion models, which are known for their global, non-causal generation process. The method introduces a training-free guidance technique that uses an analytic score to estimate the probability of a latent state decoding to a valid string according to a given regular expression, guiding the sampling process. Experiments on 180 regular-expression constraints show that Diffinity, implemented on top of PLAID, achieves 68-96% constraint satisfaction with minimal impact on output quality compared to unconstrained sampling, outperforming autoregressive constrained decoding in both constraint satisfaction and output quality.
研究旨在解决在连续扩散模型中施加离散约束的问题,这些模型虽然具有全局生成过程,但在强制执行形式语法方面存在困难。方法引入了一种无需训练的引导技术,通过分析得分估计潜在状态解码为符合正则表达式的有效字符串的概率。该得分的梯度引导采样过程,确保语法有效性。该方法在PLAID上实现为Diffinity,实现了68-96%的约束满足率,并且对输出质量的影响很小,同时在各种基准测试中的约束遵守和输出质量方面优于自回归模型。
Self-Refining Vision Language Model for Robotic Failure Detection and Reasoning
Authors: Carl Qi, Xiaojie Wang, Silong Yong, Stephen Sheng, Huitan Mao, Sriram Srinivasan, Manikantan Nambi, Amy Zhang, Yesh Dattatreya
First: 2026-02-12T20:55:36+00:00 · Latest: 2026-02-12T20:55:36+00:00
Abstract
Reasoning about failures is crucial for building reliable and trustworthy robotic systems. Prior approaches either treat failure reasoning as a closed-set classification problem or assume access to ample human annotations. Failures in the real world are typically subtle, combinatorial, and difficult to enumerate, whereas rich reasoning labels are expensive to acquire. We address this problem by introducing ARMOR: Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning. We formulate detection and reasoning as a multi-task self-refinement process, where the model iteratively predicts detection outcomes and natural language reasoning conditioned on past outputs. During training, ARMOR learns from heterogeneous supervision - large-scale sparse binary labels and small-scale rich reasoning annotations - optimized via a combination of offline and online imitation learning. At inference time, ARMOR generates multiple refinement trajectories and selects the most confident prediction via a self-certainty metric. Experiments across diverse environments show that ARMOR achieves state-of-the-art performance by improving over the previous approaches by up to 30% on failure detection rate and up to 100% in reasoning measured through LLM fuzzy match score, demonstrating robustness to heterogeneous supervision and open-ended reasoning beyond predefined failure modes. We provide dditional visualizations on our website: https://sites.google.com/utexas.edu/armor
中文标题/摘要
标题:自精炼视觉语言模型在机器人故障检测与推理中的应用
故障推理对于构建可靠和可信赖的机器人系统至关重要。先前的方法要么将故障推理视为封闭集分类问题,要么假设可以访问充足的人员注释。现实世界中的故障通常是微妙的、组合的且难以枚举,而丰富的推理标签则非常昂贵。我们通过引入ARMOR:基于多轮次的多任务模型来解决这一问题,用于机器人故障检测与推理。我们将检测与推理形式化为一个多任务自精炼过程,其中模型在迭代过程中根据过去输出预测检测结果和自然语言推理。在训练期间,ARMOR 从异构监督中学习——大规模稀疏二进制标签和小规模丰富的推理注释——通过结合离线和在线模仿学习进行优化。在推理时,ARMOR 生成多个精炼轨迹,并通过自我确定性度量选择最自信的预测。跨不同环境的实验表明,与之前的最佳方法相比,ARMOR 在故障检测率上提高了最多 30%,在通过大模型模糊匹配得分衡量的推理方面提高了 100%,展示了对异构监督和超出预定义故障模式的开放性推理的鲁棒性。我们还提供了额外的可视化结果,可在我们的网站上查看:https://sites.google.com/utexas.edu/armor
Summary / 总结
The research aims to improve the reliability of robotic systems by addressing the challenge of failure reasoning, which is typically subtle and difficult to enumerate. ARMOR, an Adaptive Round-based Multi-task mOdel for Robotic failure detection and reasoning, is introduced. It uses a multi-task self-refinement process and learns from both sparse binary labels and rich reasoning annotations. During inference, ARMOR generates multiple trajectories and selects the most confident prediction. Experiments show that ARMOR outperforms previous methods by up to 30% in failure detection and 100% in reasoning accuracy, demonstrating robustness to heterogeneous supervision and open-ended reasoning.
研究旨在通过引入自提升的视觉-语言模型ARMOR,解决细微且组合式的故障问题,以提高机器人系统的可靠性。ARMOR 将故障检测和推理视为一个多任务自提升过程,从稀疏二元标签和丰富推理注释中学习。在推理时,它生成多个轨迹并选择最自信的预测。实验表明,与之前的方法相比,ARMOR 在故障检测上的表现提高了最多 30%,在推理准确性上提高了 100%,展示了对异构监督和开放性推理的强大适应性。
What does RL improve for Visual Reasoning? A Frankenstein-Style Analysis
Authors: Xirui Li, Ming Li, Tianyi Zhou
First: 2026-02-12T20:44:27+00:00 · Latest: 2026-02-12T20:44:27+00:00
Abstract
Reinforcement learning (RL) with verifiable rewards has become a standard post-training stage for boosting visual reasoning in vision-language models, yet it remains unclear what capabilities RL actually improves compared with supervised fine-tuning as cold-start initialization (IN). End-to-end benchmark gains conflate multiple factors, making it difficult to attribute improvements to specific skills. To bridge the gap, we propose a Frankenstein-style analysis framework including: (i) functional localization via causal probing; (ii) update characterization via parameter comparison; and (iii) transferability test via model merging. Instead, RL induces a consistent inference-time shift primarily in mid-to-late layers, and these mid-to-late refinements are both transferable (via merging) and necessary (via freezing) for RL gains. Overall, our results suggest that RL's reliable contribution in visual reasoning is not a uniform enhancement of visual perception, but a systematic refinement of mid-to-late transformer computation that improves vision-to-reasoning alignment and reasoning performance, highlighting the limitations of benchmark-only evaluation for understanding multimodal reasoning improvements.
中文标题/摘要
标题:强化学习对视觉推理有何改进?一种弗兰肯斯坦式分析
具有可验证奖励的强化学习(RL)已成为提升视觉语言模型视觉推理能力的标准后训练阶段,但与监督微调作为冷启动初始化(IN)相比,RL实际上改善了哪些能力仍不清楚。端到端基准测试的收益混杂了多个因素,使得难以将改进归因于特定技能。为弥合这一差距,我们提出了一种弗兰肯斯坦式分析框架,包括:(i) 功能定位通过因果探针;(ii) 更新表征通过参数比较;(iii) 转移测试通过模型合并。相反,RL 主要在中后期层引起一致的推理时变化,并且这些中后期的改进既可以通过合并转移,又可以通过冻结来实现以获得RL的收益。总体而言,我们的结果表明,RL在视觉推理中的可靠贡献不是视觉感知的均匀增强,而是对中后期变换器计算的系统性改进,提高了视觉到推理的对齐和推理性能,突显了仅基准评估理解多模态推理改进的局限性。
Summary / 总结
The study investigates what reinforcement learning (RL) specifically improves in visual reasoning compared to supervised fine-tuning. It proposes a Frankenstein-style analysis framework involving causal probing, parameter comparison, and model merging to dissect RL's impact. The key finding is that RL primarily shifts inference in mid-to-late layers, with these refinements being both transferable and necessary for RL gains in visual reasoning performance.
研究探讨了强化学习(RL)在视觉推理中相较于监督微调具体改进了哪些方面。提出了因果探针、参数比较和模型合并的Frankenstein分析框架来剖析RL的影响。主要发现是,RL主要在中间到后期层改变了推理,这些改进既是可转移的,也是必要的,对于RL在视觉推理性能上的提升至关重要。
Synthetic Image Detection with CLIP: Understanding and Assessing Predictive Cues
Authors: Marco Willi, Melanie Mathys, Michael Graber
First: 2026-02-12T20:21:32+00:00 · Latest: 2026-02-12T20:21:32+00:00
Comments: 11 figures; 23 pages
Abstract
Recent generative models produce near-photorealistic images, challenging the trustworthiness of photographs. Synthetic image detection (SID) has thus become an important area of research. Prior work has highlighted how synthetic images differ from real photographs--unfortunately, SID methods often struggle to generalize to novel generative models and often perform poorly in practical settings. CLIP, a foundational vision-language model which yields semantically rich image-text embeddings, shows strong accuracy and generalization for SID. Yet, the underlying relevant cues embedded in CLIP-features remain unknown. It is unclear, whether CLIP-based detectors simply detect strong visual artifacts or exploit subtle semantic biases, both of which would render them useless in practical settings or on generative models of high quality. We introduce SynthCLIC, a paired dataset of real photographs and high-quality synthetic counterparts from recent diffusion models, designed to reduce semantic bias in SID. Using an interpretable linear head with de-correlated activations and a text-grounded concept-model, we analyze what CLIP-based detectors learn. CLIP-based linear detectors reach 0.96 mAP on a GAN-based benchmark but only 0.92 on our high-quality diffusion dataset SynthCLIC, and generalization across generator families drops to as low as 0.37 mAP. We find that the detectors primarily rely on high-level photographic attributes (e.g., minimalist style, lens flare, or depth layering), rather than overt generator-specific artifacts. CLIP-based detectors perform well overall but generalize unevenly across diverse generative architectures. This highlights the need for continual model updates and broader training exposure, while reinforcing CLIP-based approaches as a strong foundation for more universal, robust SID.
中文标题/摘要
标题:使用CLIP进行合成图像检测:理解与评估预测线索
近期生成模型生成的图像几乎与真实照片无异,挑战了照片的可信度。合成图像检测(SID)因此成为一个重要研究领域。先前的工作强调了合成图像与真实照片之间的差异——不幸的是,SID方法往往难以泛化到新型生成模型,并且在实际应用中表现不佳。CLIP,一个基础的视觉-语言模型,能够生成语义丰富的图像-文本嵌入,显示出强大的准确性和泛化能力。然而,嵌入在CLIP特征中的相关线索仍然未知。不清楚基于CLIP的检测器是仅仅检测强烈的视觉伪影,还是利用微妙的语义偏差,这两种情况都会使它们在实际应用中或在高质量的生成模型上变得无用。我们引入了SynthCLIC,这是一个包含真实照片和来自最近扩散模型的高质量合成对应物的配对数据集,旨在减少SID中的语义偏差。通过一个可解释的线性头和去相关激活以及文本导向的概念模型,我们分析了基于CLIP的检测器学习的内容。基于CLIP的线性检测器在基于GAN的基准测试中达到0.96 mAP,但在我们的高质量扩散数据集SynthCLIC上仅达到0.92,不同生成器家族之间的泛化能力下降到最低0.37 mAP。我们发现,检测器主要依赖于高级摄影属性(例如,极简风格、镜头光晕或深度层叠),而不是明显的生成器特定伪影。基于CLIP的检测器总体表现良好,但在多种生成架构中泛化不均。这突显了持续模型更新和更广泛训练暴露的必要性,同时强化了基于CLIP的方法作为更通用、稳健的SID的强大基础。
Summary / 总结
This study investigates the effectiveness of CLIP for synthetic image detection (SID) and explores the underlying cues that CLIP-based detectors use. The researchers introduce SynthCLIC, a dataset of real photographs and high-quality synthetic images, to reduce semantic bias. They find that CLIP-based detectors achieve high accuracy on a GAN-based benchmark but perform poorly on the SynthCLIC dataset, indicating limited generalization across different generative models. The detectors mainly rely on high-level photographic attributes rather than specific generator artifacts, suggesting the need for improved generalization in SID methods.
研究旨在通过理解并评估CLIP的预测线索来提高合成图像检测。研究引入了SynthCLIC数据集,包含真实照片和高质量的合成图像,以评估CLIP基检测器的表现。主要发现表明,CLIP基检测器在GAN基准上表现很高,但在高质量的扩散模型上表现较差,更多依赖于高级的摄影属性而非生成器特定的特征,这表明它们在不同生成架构之间的泛化能力有限。
MaskInversion: Localized Embeddings via Optimization of Explainability Maps
Authors: Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, Vittorio Ferrari, Hilde Kuehne
First: 2024-07-29T14:21:07+00:00 · Latest: 2026-02-12T19:14:40+00:00
Comments: Project page: https://walidbousselham.com/MaskInversion
Abstract
Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.
中文标题/摘要
标题:MaskInversion:通过优化解释性图的局部嵌入
视觉-语言基础模型如CLIP在全局视觉-语言对齐方面取得了巨大的成果,但在为特定图像区域创建表示方面仍有一些局限性。为了解决这一问题,我们提出了一种名为MaskInversion的方法,该方法利用预训练基础模型(如CLIP)的特征表示,在测试时根据掩码生成一个上下文感知的查询图像区域嵌入。MaskInversion从初始化一个嵌入令牌开始,并将其解释性图与查询掩码进行比较。然后,通过最小化其解释性图与查询掩码之间的差异来逐步优化嵌入令牌,以逼近查询区域。在此过程中,仅更新嵌入向量,而基础模型保持冻结状态,允许使用任何预训练模型。由于提取解释性图需要计算其梯度,这可能很昂贵,我们提出了一种梯度分解策略来简化这一计算。学习到的区域表示可以用于各种任务,包括开放词汇类检索、指示表达理解,以及局部描述生成和图像生成。我们在PascalVOC、MSCOCO、RefCOCO和OpenImagesV7等数据集上对所提出的方法进行了评估,并展示了其与现有最佳方法相比的能力。
Summary / 总结
MaskInversion is a method designed to generate context-aware embeddings for specific image regions using pre-trained vision-language models like CLIP. It starts by initializing an embedding token and iteratively refines it to match a query mask by minimizing the discrepancy between the token's explainability map and the mask. This process uses a gradient decomposition strategy to reduce computational cost. MaskInversion demonstrates effectiveness in various tasks such as open-vocabulary class retrieval, referring expression comprehension, and localized captioning, outperforming other state-of-the-art approaches on multiple datasets including PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7.
MaskInversion 是一种方法,利用预训练的视觉-语言模型如 CLIP 生成特定图像区域的上下文感知嵌入,基于提供的掩码。它从初始化一个嵌入令牌开始,并通过最小化其解释性图与掩码之间的差异来逐步优化它以匹配查询区域。该方法在包括类别检索、表达理解、和图像生成等任务上进行了评估,展示了其与现有最佳方法相比的有效性。
Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-12T18:59:59+00:00
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap.'' We first characterize the test-time scaling law for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce "boot-time compute" and a hierarchical verification inference pipeline for VLAs. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses a verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
中文标题/摘要
标题:扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐
通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动(VLA)模型在这一目标上取得了显著进展,但它们生成的动作仍然可能与给定的指令不一致。在本文中,我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了基于指令的执行的测试时扩展定律,证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性,通常比独立扩展每个维度更有效地恢复正确的动作。为了利用这些扩展定律,我们提出了CoVer,一种视觉-语言-行动对齐的对比验证器,并展示了我们的架构随着额外计算资源和数据的增加而平滑扩展。然后,我们介绍了“启动时计算”和一个分层验证推理流水线,用于VLA。在部署时,我们的框架从视觉-语言模型(VLM)预计算一组多样化的重述指令,反复为每条指令生成动作候选,然后使用验证器选择最优的高层提示和低层动作片段。与在相同数据上扩展策略预训练相比,我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进,在实际实验中进一步提高了45%。在PolaRiS基准测试中,CoVer实现了14%的任务进展和9%的成功率改进。
Summary / 总结
This paper explores test-time verification as a method to improve alignment between actions and natural language instructions in Vision-Language-Action models. It demonstrates that jointly scaling the number of rephrased instructions and generated actions increases test-time sample diversity, leading to more efficient recovery of correct actions. The proposed CoVer architecture scales gracefully with additional resources, and the framework precomputes diverse rephrased instructions and uses a verifier to select optimal actions. Compared to scaling policy pre-training, the verification approach shows 22% in-distribution and 13% out-of-distribution gains on the SIMPLER benchmark, with further improvements in real-world experiments. On the PolaRiS benchmark, CoVer achieves 14% gains in task progress and 9% in success rate.
本文探讨了测试时验证作为提高视觉-语言-动作模型中动作与自然语言指令之间对齐的方法。作者证明了同时扩展重述指令的数量和生成动作的数量可以增加测试时样本多样性,从而更有效地恢复正确的动作。他们引入了CoVer对比验证器,该验证器在额外资源下可以平滑扩展。所提出的框架,包括启动时计算和分层验证推理管道,在SIMPLER基准上的分布内和分布外设置中实现了显著的改进,进一步在实际实验中提高了性能。在PolaRiS基准上,CoVer展示了14%的任务进度提升和9%的成功率提升。
ForeAct: Steering Your VLA with Efficient Visual Foresight Planning
Authors: Zhuoyang Zhang, Shang Yang, Qinghao Hu, Luke J. Huang, James Hou, Yufei Sun, Yao Lu, Song Han
First: 2026-02-12T18:56:27+00:00 · Latest: 2026-02-12T18:56:27+00:00
Abstract
Vision-Language-Action (VLA) models convert high-level language instructions into concrete, executable actions, a task that is especially challenging in open-world environments. We present Visual Foresight Planning (ForeAct), a general and efficient planner that guides a VLA step-by-step using imagined future observations and subtask descriptions. With an imagined future observation, the VLA can focus on visuo-motor inference rather than high-level semantic reasoning, leading to improved accuracy and generalization. Our planner comprises a highly efficient foresight image generation module that predicts a high-quality 640$\times$480 future observation from the current visual input and language instruction within only 0.33s on an H100 GPU, together with a vision-language model that reasons over the task and produces subtask descriptions for both the generator and the VLA. Importantly, state-of-the-art VLAs can integrate our planner seamlessly by simply augmenting their visual inputs, without any architectural modification. The foresight generator is pretrained on over 1 million multi-task, cross-embodiment episodes, enabling it to learn robust embodied dynamics. We evaluate our framework on a benchmark that consists of 11 diverse, multi-step real-world tasks. It achieves an average success rate of 87.4%, demonstrating a +40.9% absolute improvement over the $π_0$ baseline (46.5%) and a +30.3% absolute improvement over $π_0$ augmented with textual subtask guidance (57.1%).
中文标题/摘要
标题:ForeAct:使用高效视觉前瞻规划引导您的VLA
视觉-语言-行动(VLA)模型将高层语言指令转换为具体的可执行动作,在开放世界环境中是一项特别具有挑战性的任务。我们提出了视觉前瞻规划(ForeAct),这是一种通用且高效的规划器,它使用想象中的未来观察和子任务描述逐步引导VLA。借助想象中的未来观察,VLA 可以专注于视觉-运动推理而非高层语义推理,从而提高准确性和泛化能力。我们的规划器包括一个高效的前瞻图像生成模块,该模块可以在仅0.33秒内(使用H100 GPU)从当前视觉输入和语言指令中生成高质量的640×480未来观察图像,并与视觉-语言模型一起工作,该模型可以推理任务并为生成器和VLA生成子任务描述。重要的是,最先进的VLAs可以通过简单地增强其视觉输入无缝地集成我们的规划器,无需任何架构修改。前瞻生成器在超过100万个多任务、跨体态的场景中进行预训练,使其能够学习稳健的体态动力学。我们在一个由11个多样化的多步骤真实世界任务组成的基准上评估了我们的框架。它实现了87.4%的平均成功率,比π0基线(46.5%)绝对提高了40.9%,比π0结合文本子任务指导(57.1%)绝对提高了30.3%。
Summary / 总结
ForeAct is a Visual Foresight Planning system that enhances Vision-Language-Action models by using imagined future observations and subtask descriptions to guide actions efficiently. It includes a foresight image generator that predicts future observations within 0.33 seconds and a vision-language model that reasons over tasks. ForeAct improves the success rate of 11 diverse, multi-step real-world tasks to 87.4%, showing a significant improvement over existing methods and their textual subtask guidance variants.
ForeAct 是一种视觉前瞻规划方法,通过使用想象中的未来观察来引导动作,提高开放世界环境中的准确性和泛化能力。该规划器包括一个快速的前瞻图像生成模块,可以在0.33秒内预测未来的观察结果,并且可以无缝集成到现有的VLAs中。ForeAct 在包含11个多样且多步骤的真实世界任务的基准测试中达到了87.4%的成功率,显示出对先前方法和文本子任务指导的显著改进。
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-12T17:32:59+00:00
Abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.
中文标题/摘要
标题:CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型
大型语言模型(LLMs)的后训练压缩通常依赖于低秩权重近似,将权重矩阵的每一列表示为共享低维子空间中的表示。这种策略计算效率高,但其背后的约束对于异构投影权重来说可能过于僵硬,可能会导致不必要的准确度损失。我们提出了CoSpaDi(通过稀疏字典学习压缩),这是一种无需训练的框架,用结构化稀疏分解替代低秩分解,其中每个权重矩阵表示为一个稠密字典乘以一列稀疏系数矩阵。这产生了一种子空间并集模型:权重矩阵的列表示为不同字典原子子集的线性组合,从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的:使用一个小的校准集,我们优化分解以最小化层输出的功能重构误差,而不是权重空间误差。基于激活的格正交化重新表述了这个数据感知目标,将其转化为标准的字典学习问题,应用于转换后的权重上,我们支持层内压缩和组内相似投影之间的跨层字典共享。在Llama和Qwen模型家族中,CoSpaDi 在20-40%的压缩比下,相对于基于SVD的先进基线和强大的结构化剪枝基线,始终能够改善准确度-压缩和困惑度-压缩的权衡。这种结构化稀疏性使得稀疏-密集计算成为可能,并与稀疏系数的后训练量化集成。
Summary / 总结
CoSpaDi is a training-free compression framework for large language models (LLMs) that uses a structured sparse decomposition to represent weight matrices, improving expressiveness while maintaining parameter efficiency. It optimizes the factorization to minimize functional reconstruction error of layer outputs using a calibration set, and supports per-layer compression and cross-layer dictionary sharing. Experiments show that CoSpaDi outperforms state-of-the-art SVD-based and structured pruning baselines in terms of accuracy and perplexity at 20-40% compression ratios.
CoSpaDi 是一种无需训练的大型语言模型(LLM)压缩框架,使用校准引导的稀疏字典学习将低秩分解替换为结构化稀疏分解,以在固定参数预算下提高表达能力和准确性。实验表明,CoSpaDi 在 Llama 和 Qwen 模型家族中,在 20-40% 压缩率下优于最先进的 SVD 基准和结构化剪枝基准。
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Authors: Yujun Zhou, Jingdong Yang, Yue Huang, Kehan Guo, Zoe Emory, Bikram Ghosh, Amita Bedar, Sujay Shekar, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang
Venue: Nat Mach Intell 8, 20-31 (2026)
First: 2024-10-18T05:21:05+00:00 · Latest: 2026-02-12T17:29:23+00:00
Comments: Published at Nature Machine Intelligence
Abstract
Artificial Intelligence (AI) is revolutionizing scientific research, yet its growing integration into laboratory environments presents critical safety challenges. Large language models (LLMs) and vision language models (VLMs) now assist in experiment design and procedural guidance, yet their "illusion of understanding" may lead researchers to overtrust unsafe outputs. Here we show that current models remain far from meeting the reliability needed for safe laboratory operation. We introduce LabSafety Bench, a comprehensive benchmark that evaluates models on hazard identification, risk assessment, and consequence prediction across 765 multiple-choice questions and 404 realistic lab scenarios, encompassing 3,128 open-ended tasks. Evaluations on 19 advanced LLMs and VLMs show that no model evaluated on hazard identification surpasses 70% accuracy. While proprietary models perform well on structured assessments, they do not show a clear advantage in open-ended reasoning. These results underscore the urgent need for specialized safety evaluation frameworks before deploying AI systems in real laboratory settings.
中文标题/摘要
标题:实验室安全台:评估大型语言模型在科学实验室安全问题上的基准
人工智能(AI)正在革新科学研究,然而其在实验室环境中的日益融合带来了关键的安全挑战。大型语言模型(LLMs)和视觉语言模型(VLMs)现在协助实验设计和程序指导,但它们的“理解错觉”可能导致研究人员过度信任不安全的输出。我们展示当前模型远未达到实验室安全操作所需的可靠性。我们引入了LabSafety Bench,这是一个全面的基准,评估模型在765个多项选择题和404个现实实验室场景中的危害识别、风险评估和后果预测,涵盖3,128个开放式任务。对19个先进LLM和VLM的评估显示,在危害识别方面没有模型的准确率超过70%。虽然专有模型在结构化评估中表现良好,但在开放式推理中没有明显优势。这些结果强调了在实际实验室环境中部署AI系统之前迫切需要专门的安全评估框架。
Summary / 总结
The research aims to address the safety challenges posed by the integration of AI in scientific labs. It introduces LabSafety Bench, a benchmark evaluating models on hazard identification, risk assessment, and consequence prediction. Evaluations on 19 advanced LLMs and VLMs reveal that no model achieves over 70% accuracy in hazard identification, highlighting the need for specialized safety evaluation frameworks before deploying AI systems in labs.
研究旨在解决AI在科学实验室中集成带来的安全挑战。引入了LabSafety Bench基准,评估模型在危害识别、风险评估和后果预测方面的表现。对19种先进LLM和VLM的评估显示,没有模型在危害识别上的准确率超过70%,强调在将AI系统部署到实验室之前需要专门的安全评估框架。
LatentAM: Real-Time, Large-Scale Latent Gaussian Attention Mapping via Online Dictionary Learning
Authors: Junwoon Lee, Yulun Tian
First: 2026-02-12T17:25:00+00:00 · Latest: 2026-02-12T17:25:00+00:00
Comments: 8 pages, 5 figures
Abstract
We present LatentAM, an online 3D Gaussian Splatting (3DGS) mapping framework that builds scalable latent feature maps from streaming RGB-D observations for open-vocabulary robotic perception. Instead of distilling high-dimensional Vision-Language Model (VLM) embeddings using model-specific decoders, LatentAM proposes an online dictionary learning approach that is both model-agnostic and pretraining-free, enabling plug-and-play integration with different VLMs at test time. Specifically, our approach associates each Gaussian primitive with a compact query vector that can be converted into approximate VLM embeddings using an attention mechanism with a learnable dictionary. The dictionary is initialized efficiently from streaming observations and optimized online to adapt to evolving scene semantics under trust-region regularization. To scale to long trajectories and large environments, we further propose an efficient map management strategy based on voxel hashing, where optimization is restricted to an active local map on the GPU, while the global map is stored and indexed on the CPU to maintain bounded GPU memory usage. Experiments on public benchmarks and a large-scale custom dataset demonstrate that LatentAM attains significantly better feature reconstruction fidelity compared to state-of-the-art methods, while achieving near-real-time speed (12-35 FPS) on the evaluated datasets. Our project page is at: https://junwoonlee.github.io/projects/LatentAM
中文标题/摘要
标题:LatentAM:基于在线字典学习的大规模实时隐空间高斯注意力映射
我们提出了LatentAM,一种基于3D高斯点云(3DGS)的在线映射框架,可以从流式RGB-D观测中构建可扩展的隐空间特征图,用于开放词汇量的机器人感知。不同于使用模型特定解码器从高维视觉语言模型(VLM)嵌入中提取信息,LatentAM 提出了一个模型无关且无需预训练的在线字典学习方法,使得在测试时可以轻松集成不同的VLM。具体来说,我们的方法将每个高斯基元关联到一个紧凑的查询向量,该向量可以通过可学习字典的注意力机制转换为近似的VLM嵌入。字典从流式观测中高效初始化,并通过信任区域正则化在线优化以适应不断变化的场景语义。为了扩展到长轨迹和大型环境,我们进一步提出了一种基于体素哈希的高效地图管理策略,其中优化仅限于GPU上的活动局部地图,而全局地图则存储和索引在CPU上以保持GPU内存使用量的有界。在公共基准测试和大规模自定义数据集上的实验表明,LatentAM 在特征重构保真度方面显著优于现有方法,同时在评估的数据集上实现了接近实时的速度(12-35 FPS)。我们的项目页面位于:https://junwoonlee.github.io/projects/LatentAM
Summary / 总结
LatentAM is an online 3D Gaussian Splatting mapping framework that constructs scalable latent feature maps from streaming RGB-D observations for robotic perception. It uses an online dictionary learning approach to generate compact query vectors for approximate Vision-Language Model embeddings, which is model-agnostic and pretraining-free. Experiments show LatentAM provides better feature reconstruction fidelity and near-real-time speed (12-35 FPS) compared to existing methods on various benchmarks and a large-scale custom dataset.
LatentAM 是一种在线 3D 高斯点绘制映射框架,可以从流式 RGB-D 观测中构建可扩展的潜在特征图,用于机器人感知。它使用在线字典学习方法将每个高斯原语与一个紧凑的查询向量关联起来,实现与不同视觉语言模型的模型无关集成。该框架在线优化字典以适应不断变化的场景语义,并使用体素哈希有效地管理地图内存。实验表明,LatentAM 在特征重构保真度和接近实时速度(12-35 FPS)方面优于现有方法。
Chatting with Images for Introspective Visual Thinking
Authors: Junfei Wu, Jian Guan, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, Tieniu Tan
First: 2026-02-11T17:42:37+00:00 · Latest: 2026-02-12T16:49:33+00:00
Abstract
Current large vision-language models (LVLMs) typically rely on text-only reasoning based on a single-pass visual encoding, which often leads to loss of fine-grained visual information. Recently the proposal of ''thinking with images'' attempts to alleviate this limitation by manipulating images via external tools or code; however, the resulting visual states are often insufficiently grounded in linguistic semantics, impairing effective cross-modal alignment - particularly when visual semantics or geometric relationships must be reasoned over across distant regions or multiple images. To address these challenges, we propose ''chatting with images'', a new framework that reframes visual manipulation as language-guided feature modulation. Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions, enabling tighter coupling between linguistic reasoning and visual state updates. We instantiate this paradigm in ViLaVT, a novel LVLM equipped with a dynamic vision encoder explicitly designed for such interactive visual reasoning, and trained it with a two-stage curriculum combining supervised fine-tuning and reinforcement learning to promote effective reasoning behaviors. Extensive experiments across eight benchmarks demonstrate that ViLaVT achieves strong and consistent improvements, with particularly pronounced gains on complex multi-image and video-based spatial reasoning tasks.
中文标题/摘要
标题:通过图像交流进行反思性视觉思考
当前的大规模视觉-语言模型(LVLMs)通常依赖基于单次视觉编码的文本推理,这往往会导致细微视觉信息的丢失。最近提出的“通过图像思考”试图通过外部工具或代码操作图像来缓解这一限制;然而,由此产生的视觉状态往往缺乏语言语义的充分支撑,影响了跨模态对齐的有效性——特别是在需要在远处的区域或多个图像之间推理视觉语义或几何关系时。为了解决这些挑战,我们提出了一种新的“通过图像交流”框架,将视觉操作重新构想为语言引导的特征调制。在表达性语言提示的指导下,模型动态地对多个图像区域进行联合重新编码,从而增强了语言推理与视觉状态更新之间的耦合。我们通过ViLaVT这一新型LVLM实例化了这一范式,ViLaVT配备了一个明确设计用于此类交互式视觉推理的动态视觉编码器,并通过结合监督微调和强化学习的两阶段课程训练,促进有效的推理行为。在八个基准测试中的广泛实验表明,ViLaVT实现了显著且一致的改进,特别是在复杂的多图像和基于视频的空间推理任务上表现尤为突出。
Summary / 总结
This paper addresses the limitations of current large vision-language models (LVLMs) that rely on text-only reasoning and often lose fine-grained visual information. To improve cross-modal alignment, the authors propose 'chatting with images', a framework that uses language prompts to guide dynamic re-encoding of multiple image regions. ViLaVT, a novel LVLM, is developed with a dynamic vision encoder and trained using a two-stage curriculum. Experiments show that ViLaVT outperforms existing models, especially in complex multi-image and video-based spatial reasoning tasks.
研究旨在通过解决单次视觉编码过程中精细视觉信息丢失的问题,增强大型视觉-语言模型(LVLM)的推理能力。提出的‘与图像对话’框架将视觉操作重新定义为语言引导的特征调制,从而实现语言推理与视觉状态更新之间的更紧密耦合。ViLaVT 是这一范式的新型 LVLM,通过结合监督微调和强化学习的两阶段课程进行训练。实验结果显示,ViLaVT 在八个基准测试中表现出强大的一致改进,特别是在复杂的多图像和视频空间推理任务中表现尤为突出。