Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Authors: Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta
First: 2026-02-18T18:55:02+00:00 · Latest: 2026-02-18T18:55:02+00:00
Comments: Project page: https://hero-humanoid.github.io/
Abstract
Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.
中文标题/摘要
标题:类人机器人开放词汇视觉移动物体末端执行器控制学习
使用类人机器人在野外对任意物体进行视觉移动物体操作需要精确的末端执行器(EE)控制和通过视觉输入(例如RGB-D图像)对场景的广泛理解。现有方法基于现实世界的模仿学习,由于难以收集大规模训练数据集,因此表现出有限的泛化能力。本文提出了一种新的范式HERO,用于类人机器人物体移动物体操作,将大型视觉模型的强大泛化能力和开放词汇理解与模拟训练中的强大控制性能相结合。我们通过设计一种准确的残差感知末端执行器跟踪策略来实现这一点。该末端执行器跟踪策略结合了经典机器人学和机器学习。它使用a) 逆运动学将残差末端执行器目标转换为参考轨迹,b) 用于准确前向运动学的已学习神经前向模型,c) 目标调整,以及d) 重新规划。这些创新共同帮助我们将末端执行器跟踪误差减少了3.2倍。我们使用这种准确的末端执行器跟踪器构建了一个模块化移动物体系统,其中我们使用开放词汇大型视觉模型实现强大的视觉泛化。我们的系统能够在从办公室到咖啡馆的各种真实环境中操作,机器人能够可靠地操作各种日常物体(例如茶杯、苹果、玩具),这些物体位于43cm至92cm高度的表面上。在模拟和现实世界中的系统模块化和端到端测试表明我们提出的设计的有效性。我们认为本文中的进展可以为训练类人机器人与日常物体交互开辟新的途径。
Summary / 总结
The research aims to enable humanoid robots to perform accurate end-effector control and generalizable visual loco-manipulation of arbitrary objects. HERO, a new paradigm, combines large vision models with simulated training to achieve strong control performance. The system uses an accurate residual-aware end-effector tracking policy that integrates classical robotics with machine learning, reducing tracking error by 3.2x. The robot successfully manipulates various objects in diverse environments, demonstrating the system's effectiveness in real-world settings.
该论文提出了HERO,一种新的范式,使类人机器人能够在多种环境中执行物体操作。它结合了大型视觉模型的泛化能力和强大的控制性能,通过一个精确的末端执行器跟踪策略,包括逆运动学、学习神经前向模型、目标调整和重新规划。系统在现实世界中成功地操作了各种物体,展示了末端执行器跟踪误差显著减少(减少3.2倍)和在不同环境中的稳健视觉泛化能力。实验结果表明,该系统在模拟和现实世界中的模块化和端到端测试中表现出色。
Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning
Authors: Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li
First: 2026-02-18T18:49:56+00:00 · Latest: 2026-02-18T18:49:56+00:00
Comments: preprint 10 pages, 4 figures
Abstract
Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.
中文标题/摘要
标题:注意引导的多路径思考:重访视觉-语言推理
视觉-语言模型(VLMs)旨在通过联合利用视觉和文本模态进行推理。虽然为大型语言模型(LLMs)分配额外的推理时间计算已被证明是有效的,但在VLMs中实现类似的扩展仍然具有挑战性。一个关键障碍是视觉输入通常只在生成的开始阶段提供一次,而文本推理(例如,早期视觉摘要)是自回归生成的,这使得推理变得越来越以文本为主导,并允许早期视觉定位错误累积。此外,推理期间的视觉定位指导通常粗糙且嘈杂,这使得难以引导长时间文本的推理。为了解决这些挑战,我们提出了\emph{注意引导原则}(SAP)选择。SAP 在高层次的推理原则上操作,而不是在标记级轨迹上,这使得在嘈杂反馈下稳定控制离散生成成为可能,同时允许后续推理步骤在需要重新定位时重新咨询视觉证据。此外,SAP 支持多路径推理,允许并行探索多种推理行为。SAP 是模型无关的,不需要额外的数据,也不需要额外的训练。实验证明,SAP 在与可比的标记生成预算下实现了竞争力的表现,特别是在减少对象幻觉方面,同时提供了比CoT风格的长序列推理更稳定和更低的响应延迟。
Summary / 总结
The paper addresses the challenge of scaling vision-language models (VLMs) by proposing Saliency-Aware Principle (SAP) selection, which operates on high-level reasoning principles to enable stable control over discrete generation and support multi-route inference. Experiments show that SAP reduces object hallucination, provides more stable reasoning, and has lower response latency compared to CoT-style reasoning under similar token-generation budgets.
论文通过提出Saliency-Aware Principle (SAP)选择来解决视觉语言模型(VLMs)中的有效视觉定位问题。SAP在高层次推理原则上操作,以实现对离散生成的稳定控制,并支持多路线推理。实验表明,SAP在相似的令牌生成预算下减少了物体幻觉,提供了更稳定的推理和更低的响应延迟。
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang
First: 2024-11-18T16:33:52+00:00 · Latest: 2026-02-18T18:33:19+00:00
Abstract
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.
中文标题/摘要
标题:MC-LLaVA:多概念个性化视觉语言模型
当前的视觉语言模型(VLMs)在各种任务上表现出色,如视觉问答。为了提升用户体验,最近的研究探讨了VLM的个性化,以理解用户提供的概念。然而,这些研究主要集中在单一概念上,忽视了多个概念的存在及其相互作用,这限制了其在实际中的应用。本文提出了一种多概念个性化范式MC-LLaVA。具体而言,MC-LLaVA采用多概念指令调优策略,在单个训练步骤中有效整合多个概念。为了降低训练成本,我们提出了一种个性化的文本提示,利用视觉标记信息初始化概念标记。此外,在推理过程中,我们引入了个性化的视觉提示,聚合位置图以增强识别和语义关联能力。为了进一步提高性能上限,我们引入了一个可选的辅助损失,更好地增强了提出的个性化提示。为了丰富VLM个性化研究,我们贡献了一个高质量的数据集。我们精心收集了来自电影的多角色和多物体图像,并手动创建了多概念场景下的问题-答案样本,具有更高的多样性。全面的实验表明,MC-LLaVA实现了令人印象深刻的多概念个性化响应,为VLM成为更好的用户助手铺平了道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA发布。
Summary / 总结
This paper introduces MC-LLaVA, a multi-concept personalized vision-language model that employs a multi-concept instruction tuning strategy and personalized textual and visual prompts to enhance recognition and grounding capabilities. The model incorporates an optional auxiliary loss to further improve performance. Experiments show that MC-LLaVA effectively handles multi-concept scenarios, providing impressive personalized responses and advancing VLMs as user assistants.
本文提出了MC-LLaVA,一种多概念个性化视觉语言模型,采用多概念指令调优策略,在单一步骤中整合多个概念。它使用个性化文本和视觉提示来增强识别和语义关联能力,并可选地加入辅助损失以进一步提升性能。实验表明,MC-LLaVA 能生成出色的多概念个性化响应,推动了 VLMs 成为更好的用户助手的发展。提供了一个高质量的数据集和代码。
A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Authors: Qi You, Yitai Cheng, Zichao Zeng, James Haworth
First: 2026-02-18T16:41:32+00:00 · Latest: 2026-02-18T16:41:32+00:00
Abstract
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.
中文标题/摘要
标题:一种基于注意力特征适应的对比学习框架用于街景图像分类
街景图像属性分类是图像分类的重要下游任务,能够支持自动驾驶、城市分析和高精度地图构建等应用。无论是从零开始训练、使用预训练权重初始化还是微调大型模型,都仍然具有较高的计算需求。尽管预训练的视觉-语言模型如CLIP提供了丰富的图像表示,但现有的适应或微调方法往往依赖于它们的全局图像嵌入,限制了它们捕捉复杂、杂乱街景中细微、局部化的属性的能力。为了解决这一问题,我们提出了CLIP-MHAdapter,这是一种轻量级CLIP适应范式的变体,通过在补丁标记上附加一个具有多头自注意力的瓶颈MLP来建模补丁之间的依赖关系。CLIP-MHAdapter在Global StreetScapes数据集上的八个属性分类任务中实现了优于或可比的准确性,同时保持了较低的计算成本,并取得了新的最佳结果。代码可在https://github.com/SpaceTimeLab/CLIP-MHAdapter获取。
Summary / 总结
The paper proposes CLIP-MHAdapter, a method that enhances the CLIP framework with attention-based feature adaptation to improve street-view image classification. This approach uses a bottleneck MLP with multi-head self-attention on patch tokens to capture fine-grained attributes, achieving state-of-the-art results with low computational cost across eight attribute classification tasks on the Global StreetScapes dataset.
论文旨在提高街景图像属性分类的准确性,这对于自动驾驶和城市分析等应用至关重要。它提出了一种轻量级的适应方法CLIP-MHAdapter,通过在补丁标记上使用多头自注意力来增强预训练的CLIP模型,以捕捉细粒度的属性。CLIP-MHAdapter在Global StreetScapes数据集上的八个属性分类任务中取得了最先进的结果,使用了大约140万个可训练参数,同时保持了较低的计算成本。
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Authors: Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger
First: 2025-04-11T15:12:05+00:00 · Latest: 2026-02-18T15:52:04+00:00
Comments: 11 pages, 5 figures
Abstract
Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixelwise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resourceconstrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.
中文标题/摘要
标题:FindAnything:任意词汇和对象中心的映射框架,用于机器人在任意环境中的探索
几何上精确且语义上丰富的地图表示对于在未知环境中部署机器人和任务规划具有不可估量的价值。然而,实时地对大规模未知环境进行开放词汇的语义理解仍然存在挑战,主要原因是计算需求。本文提出了一种名为FindAnything的开放世界映射框架,该框架将视觉-语言信息整合到密集的体积子地图中。通过使用视觉-语言特征,FindAnything结合了纯几何和开放词汇的语义信息,提高了理解水平。它通过在对象级别聚合特征来高效存储开放词汇信息。基于eSAM片段的像素级视觉-语言特征被聚合,并整合到对象中心的体积子地图中,从而提供了一种从开放词汇查询到三维几何的映射,这种映射在内存使用方面也是可扩展的。我们证明FindAnything在语义准确性方面与最先进的技术相当,同时速度更快且更节省内存,使其能够在大规模环境中部署,并在资源受限的设备上运行,例如MAVs。我们展示了FindAnything的实时能力使其在下游任务中具有实用性,例如在模拟的搜索和救援场景中自主MAV探索。
Summary / 总结
FindAnything is an open-world mapping framework that integrates vision-language information into dense volumetric submaps to enhance semantic understanding. It uses object-level feature aggregation and eSAM segments to efficiently store open-vocabulary information, providing scalable memory usage. FindAnything matches the state-of-the-art in semantic accuracy but is faster and more memory-efficient, suitable for large-scale environments and resource-constrained devices like MAVs. It demonstrates real-time capabilities useful for tasks like autonomous MAV exploration in simulated Search and Rescue scenarios.
FindAnything 是一种将视觉-语言信息整合到密集体素子地图中的开放世界映射框架,以实现几何和语义理解。它使用对象级特征聚合和 eSAM 区段来高效存储开放词汇信息,提供可扩展的内存使用。FindAnything 在语义准确性上与最先进的技术相当,但速度更快且更节省内存,适用于大规模环境和资源受限的设备如 MAV。它展示了实时能力,适用于模拟搜索和救援场景中的自主 MAV 探索等任务。
DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images
Authors: Zeng Tao, Ying Jiang, Yunuo Chen, Tianyi Xie, Huamin Wang, Yingnian Wu, Yin Yang, Abishek Sampath Kumar, Kenji Tashiro, Chenfanfu Jiang
First: 2026-02-18T14:45:15+00:00 · Latest: 2026-02-18T14:45:15+00:00
Abstract
Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.
中文标题/摘要
标题:DressWild:从野生图像中生成前馈姿态无关服装缝制模板
最近在服装模板生成方面的进展显示出有希望的进步。然而,现有的前馈方法在处理多样化的姿态和视角方面存在困难,而基于优化的方法则计算成本高且难以扩展。本文关注于服装建模和制造应用中所需的可编辑、可分离和可用于模拟的服装的缝制模板生成。我们提出了一种名为DressWild的新型前馈管道,可以从单张野生图像中重建符合物理规律的2D缝制模板及其对应的3D服装。给定输入图像,我们的方法利用视觉-语言模型(VLMs)在图像级别上归一化姿态变化,然后提取姿态感知的、3D启发式的服装特征。这些特征通过基于变换器的编码器融合,随后用于预测缝制模板参数,这些参数可以直接应用于物理模拟、纹理合成和多层虚拟试穿。广泛的实验表明,我们的方法能够在不使用多视角输入或迭代优化的情况下,从野生图像中稳健地恢复出多样化的缝制模板及其对应的3D服装,提供了一种高效且可扩展的现实服装模拟和动画解决方案。
Summary / 总结
This paper addresses the challenge of generating sewing patterns and 3D garments from in-the-wild images, which existing feed-forward methods struggle with due to diverse poses and viewpoints. The authors propose DressWild, a novel feed-forward pipeline that uses vision-language models to normalize pose variations and extract 3D-informed garment features. These features are then used to predict sewing pattern parameters, enabling direct application to physical simulation and texture synthesis. Experiments show that DressWild robustly generates diverse sewing patterns and 3D garments without needing multi-view inputs or iterative optimization, providing an efficient and scalable solution for realistic garment simulation and animation.
本文旨在解决从多样化的在野图像中生成缝制图案和3D服装的问题。它提出了DressWild,一个前馈管道,通过视觉语言模型来标准化姿态变化,并提取3D相关的服装特征。这些特征随后用于预测缝制图案参数,可以直接应用于物理模拟和纹理合成。实验表明,DressWild能够在无需多视角输入或迭代优化的情况下,稳健地生成多样化的缝制图案和3D服装,提供了一个高效且可扩展的解决方案,用于现实的服装模拟和动画。
Fast and Scalable Analytical Diffusion
Authors: Xinyi Shang, Peng Sun, Jingyu Lin, Zhiqiang Shen
First: 2026-02-18T14:41:09+00:00 · Latest: 2026-02-18T14:41:09+00:00
Abstract
Analytical diffusion models offer a mathematically transparent path to generative modeling by formulating the denoising score as an empirical-Bayes posterior mean. However, this interpretability comes at a prohibitive cost: the standard formulation necessitates a full-dataset scan at every timestep, scaling linearly with dataset size. In this work, we present the first systematic study addressing this scalability bottleneck. We challenge the prevailing assumption that the entire training data is necessary, uncovering the phenomenon of Posterior Progressive Concentration: the effective golden support of the denoising score is not static but shrinks asymptotically from the global manifold to a local neighborhood as the signal-to-noise ratio increases. Capitalizing on this, we propose Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), a training-free framework that decouples inference complexity from dataset size. Instead of static retrieval, GoldDiff uses a coarse-to-fine mechanism to dynamically pinpoint the ''Golden Subset'' for inference. Theoretically, we derive rigorous bounds guaranteeing that our sparse approximation converges to the exact score. Empirically, GoldDiff achieves a $\bf 71 \times$ speedup on AFHQ while matching or achieving even better performance than full-scan baselines. Most notably, we demonstrate the first successful scaling of analytical diffusion to ImageNet-1K, unlocking a scalable, training-free paradigm for large-scale generative modeling.
中文标题/摘要
标题:快速且可扩展的分析性扩散
分析性扩散模型通过将去噪分数形式化为经验贝叶斯后验均值,提供了一条数学上透明的生成建模路径。然而,这种可解释性付出了高昂的成本:标准形式要求在每个时间步长进行整个数据集扫描,其规模与数据集大小成线性关系。在本文中,我们首次系统地研究了这种可扩展性瓶颈。我们挑战了整个训练数据必不可少的假设,揭示了后验渐进集中现象:去噪分数的有效黄金支持不是静态的,而是随着信噪比增加从全局流形逐渐收缩到局部邻域。利用这一发现,我们提出了动态时间感知黄金子集扩散(GoldDiff)框架,该框架在不进行训练的情况下解耦推理复杂度与数据集大小。GoldDiff 不使用静态检索,而是采用粗到细机制动态确定推理所需的“黄金子集”。理论上,我们推导出严格的界保证我们的稀疏近似收敛到精确分数。实验上,GoldDiff 在 AFHQ 上实现了 71 倍的速度提升,同时匹配甚至超过了全扫描基线的性能。最值得注意的是,我们展示了分析性扩散首次成功扩展到 ImageNet-1K,解锁了大规模生成建模的可扩展、无需训练的范式。
Summary / 总结
This work addresses the scalability issue of analytical diffusion models by proposing Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), which decouples inference complexity from dataset size. By leveraging the phenomenon of Posterior Progressive Concentration, GoldDiff dynamically identifies a 'Golden Subset' for inference, achieving a 71 times speedup on AFHQ while maintaining or improving performance compared to full-scan baselines. Notably, GoldDiff successfully scales analytical diffusion to ImageNet-1K, enabling a scalable, training-free approach for large-scale generative modeling.
该研究通过提出动态时间感知黄金子集扩散(GoldDiff)方法,动态选择用于推理的数据子集,解决了分析性扩散模型的可扩展性问题。该方法利用了后验渐进集中现象,即随着信噪比的增加,去噪分数的有效支持会逐渐缩小。实验结果显示,GoldDiff 在 AFHQ 上实现了 71 倍的加速,同时保持或超越了全集基线的性能。特别地,它首次将分析性扩散扩展到 ImageNet-1K,开启了大规模生成建模的可扩展和无训练框架。
SurgRAW: Multi-Agent Workflow with Chain of Thought Reasoning for Robotic Surgical Video Analysis
Authors: Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhu Zhuo, Zhitao Zeng, Evangelos B. Mazomenos, Yueming Jin
Venue: IEEE Robotics and Automation Letters, 2026, pp. 1-8
First: 2025-03-13T11:23:13+00:00 · Latest: 2026-02-18T14:35:21+00:00
Abstract
Robotic-assisted surgery (RAS) is central to modern surgery, driving the need for intelligent systems with accurate scene understanding. Most existing surgical AI methods rely on isolated, task-specific models, leading to fragmented pipelines with limited interpretability and no unified understanding of RAS scene. Vision-Language Models (VLMs) offer strong zero-shot reasoning, but struggle with hallucinations, domain gaps and weak task-interdependency modeling. To address the lack of unified data for RAS scene understanding, we introduce SurgCoTBench, the first reasoning-focused benchmark in RAS, covering 14256 QA pairs with frame-level annotations across five major surgical tasks. Building on SurgCoTBench, we propose SurgRAW, a clinically aligned Chain-of-Thought (CoT) driven agentic workflow for zero-shot multi-task reasoning in surgery. SurgRAW employs a hierarchical reasoning workflow where an orchestrator divides surgical scene understanding into two reasoning streams and directs specialized agents to generate task-level reasoning, while higher-level agents capture workflow interdependencies or ground output clinically. Specifically, we propose a panel discussion mechanism to ensure task-specific agents collaborate synergistically and leverage on task interdependencies. Similarly, we incorporate a retrieval-augmented generation module to enrich agents with surgical knowledge and alleviate domain gaps in general VLMs. We design task-specific CoT prompts grounded in surgical domain to ensure clinically aligned reasoning, reduce hallucinations and enhance interpretability. Extensive experiments show that SurgRAW surpasses mainstream VLMs and agentic systems and outperforms a supervised model by 14.61% accuracy. Dataset and code is available at https://github.com/jinlab-imvr/SurgRAW.git .
中文标题/摘要
标题:SurgRAW:多智能体工作流与链式思考推理在机器人手术视频分析中的应用
机器人辅助手术(RAS)是现代手术的核心,推动了智能系统准确场景理解的需求。目前大多数现有的手术AI方法依赖于孤立的任务特定模型,导致了碎片化的管道,缺乏统一的理解和解释性。视觉-语言模型(VLMs)提供了强大的零样本推理能力,但在幻觉、领域差距和任务间依赖性建模方面存在困难。为了解决RAS场景理解缺乏统一数据的问题,我们引入了SurgCoTBench,这是第一个专注于RAS的推理基准,涵盖了14256个问答对,包含五个主要手术任务的帧级注释。基于SurgCoTBench,我们提出了SurgRAW,一种临床对齐的链式思考(CoT)驱动的多任务推理智能体工作流。SurgRAW采用分层推理工作流,其中协调者将手术场景理解分为两个推理流,并指导专门的智能体生成任务级推理,而高级智能体则捕捉工作流间的相互依赖性或临床验证输出。具体而言,我们提出了一种讨论机制,以确保任务特定的智能体能够协同工作并利用任务间的相互依赖性。同样,我们引入了检索增强生成模块,以丰富智能体的手术知识并缓解通用VLMs的领域差距。我们设计了基于手术领域的特定任务CoT提示,以确保临床对齐的推理、减少幻觉并增强可解释性。广泛的实验表明,SurgRAW超越了主流的VLMs和智能体系统,并在准确率上比监督模型高出14.61%。数据集和代码可在https://github.com/jinlab-imvr/SurgRAW.git 获取。
Summary / 总结
SurgRAW is a multi-agent workflow system for robotic surgical video analysis that uses a hierarchical reasoning approach with a chain-of-thought mechanism. It addresses the limitations of isolated task-specific models by introducing SurgCoTBench, a reasoning-focused benchmark for robotic-assisted surgery. The system outperforms mainstream vision-language models and supervised models by 14.61% accuracy in zero-shot multi-task reasoning across five major surgical tasks.
研究旨在通过结合链式推理来提高对手术辅助机器人场景的理解。提出了SurgRAW多智能体系统,以零样本方式处理多个手术任务。该系统采用分层推理方法,其中协调者指导专门的智能体生成任务级推理,而高级智能体管理任务间的依赖关系。关键发现表明,SurgRAW在包含14,256个问答对的新基准数据集SurgCoTBench上的表现优于主流视觉语言模型和监督模型,准确率高出14.61%。
Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
Authors: Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin
First: 2026-02-18T13:40:53+00:00 · Latest: 2026-02-18T13:40:53+00:00
Abstract
While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
中文标题/摘要
标题:视觉自我精炼:一种像素导向的图表解析准确度提升范式
虽然大型多模态模型(LVLMs)在文本推理和自我修正方面表现出色,但这些优势对以视觉感知为中心的复杂任务,如图表解析,提供的帮助甚微。现有模型在处理视觉密集型图表时常常遇到困难,导致数据遗漏、对齐错误和幻觉等问题。受人类在阅读复杂图表时使用手指作为“视觉锚点”以确保准确性的启发,我们提出了一种新的范式——视觉自我精炼(VSR)。VSR的核心思想是使模型能够生成像素级定位输出,可视化这些输出,并将这些可视化结果反馈给模型本身,使其能够直观地检查和修正潜在的视觉感知错误。我们通过提出ChartVSR在图表解析领域实例化了VSR范式。该模型将解析过程分解为两个阶段:在精炼阶段,它通过视觉反馈迭代确保所有数据点的像素级定位的准确性;在解码阶段,它使用这些验证过的定位作为精确的视觉锚点来解析最终的结构化数据。为了应对现有基准的局限性,我们还构建了ChartP-Bench,这是一个新的、极具挑战性的图表解析基准。我们的工作还强调了VSR作为一种通用的视觉反馈机制,为提高各种视觉为中心任务的准确性提供了有希望的新方向。
Summary / 总结
The research aims to improve the accuracy of chart parsing by addressing the limitations of existing models in handling visually dense charts. The proposed Visual Self-Refine (VSR) paradigm enables a model to generate pixel-level localization outputs, visualize them, and use these visualizations to correct potential errors. ChartVSR, an instantiation of VSR in the domain of chart parsing, decomposes the process into a Refine Stage and a Decode Stage. The Refine Stage ensures the accuracy of all data points' pixel-level localizations using visual feedback, while the Decode Stage uses these verified localizations to parse structured data. The work also introduces ChartP-Bench, a new benchmark for chart parsing, to better evaluate the performance of chart parsing models.
研究旨在通过解决现有模型在处理视觉密集型图表时的局限性,提高图表解析的准确性。提出的视觉自我完善(VSR)范式使模型能够生成像素级定位输出,可视化这些输出,并利用这些可视化结果纠正潜在错误。ChartVSR 是 VSR 在图表解析中的实现,将过程分解为精炼阶段和解码阶段,确保精确的数据点定位和准确的结构化数据解析。研究引入了 ChartP-Bench,这是一个新的图表解析基准,用于评估所提出方法的有效性,显示出显著的准确性和可靠性提升。
Designing Production-Scale OCR for India: Multilingual and Domain-Specific Systems
Authors: Ali Faraz, Raja Kolla, Ashish Kulkarni, Shubham Agarwal
First: 2026-02-18T13:03:05+00:00 · Latest: 2026-02-18T13:03:05+00:00
Abstract
Designing Optical Character Recognition (OCR) systems for India requires balancing linguistic diversity, document heterogeneity, and deployment constraints. In this paper, we study two training strategies for building multilingual OCR systems with Vision-Language Models through the Chitrapathak series. We first follow a popular multimodal approach, pairing a generic vision encoder with a strong multilingual language model and training the system end-to-end for OCR. Alternatively, we explore fine-tuning an existing OCR model, despite not being trained for the target languages. Through extensive evaluation on multilingual Indic OCR benchmarks and deployment-oriented metrics, we find that the second strategy consistently achieves better accuracy-latency trade-offs. Chitrapathak-2 achieves 3-6x speedup over its predecessor with being state-of-the-art (SOTA) in Telugu (6.69 char ANLS) and second best in the rest. In addition, we present Parichay, an independent OCR model series designed specifically for 9 Indian government documents to extract structured key fields, achieving 89.8% Exact Match score with a faster inference. Together, these systems achieve SOTA performance and provide practical guidance for building production-scale OCR pipelines in the Indian context.
中文标题/摘要
标题:面向印度的生产规模OCR设计:多语言和领域特定系统
为印度设计光学字符识别(OCR)系统需要平衡语言多样性、文档异质性和部署约束。在本文中,我们研究了通过Chitrapathak系列构建多语言OCR系统的两种训练策略,利用视觉语言模型。首先,我们采用一种流行的多模态方法,将通用视觉编码器与强大的多语言语言模型配对,并端到端训练系统进行OCR。作为替代方案,我们探索了微调现有OCR模型的可能性,尽管这些模型未针对目标语言进行训练。通过在多语言印度OCR基准测试和部署导向的指标上进行广泛评估,我们发现第二种策略在准确性和延迟之间的一致性更好。Chitrapathak-2在泰卢固语上达到最先进的性能(6.69字符ANLS),在其他方面排名第二,并实现了3-6倍的速度提升。此外,我们还介绍了Parichay,这是一种专门针对9种印度政府文件的独立OCR模型系列,用于提取结构化关键字段,实现了89.8%的精确匹配分数,并且推理速度更快。这些系统共同实现了最先进的性能,并为在印度背景下构建生产规模的OCR流水线提供了实用指导。
Summary / 总结
This paper addresses the challenge of designing OCR systems for India by balancing linguistic diversity and document heterogeneity. Two training strategies were explored: a multimodal approach using a generic vision encoder and a strong multilingual language model, and fine-tuning an existing OCR model. The latter strategy was found to offer better accuracy-latency trade-offs, with Chitrapathak-2 achieving state-of-the-art performance in Telugu and competitive results in other languages. Additionally, Parichay, an OCR model series tailored for Indian government documents, achieved an 89.8% Exact Match score with faster inference. These systems provide practical insights for building production-scale OCR pipelines in India.
本文探讨了平衡印度语言多样性和文档异质性的OCR系统设计挑战。研究了两种训练策略:使用通用视觉编码器和强大的多语言语言模型的多模态方法,以及针对目标语言微调现有OCR模型。尽管未针对目标语言进行训练,但第二种策略在准确性和延迟之间取得了更好的权衡。Chitrapathak-2使用微调方法,相比前代产品提供了3-6倍的速度提升,并在泰卢固语中达到了最先进的字符ANLS 6.69。此外,Parichay专门设计用于印度政府文件的OCR模型系列,实现了89.8%的精确匹配分数,并具有更快的推理速度。
Vision and Language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning
Authors: Ross Greer, Maitrayee Keskar, Angel Martinez-Sanchez, Parthib Roy, Shashank Shriram, Mohan Trivedi
First: 2026-02-07T20:04:21+00:00 · Latest: 2026-02-18T11:33:50+00:00
Abstract
Vision-language models (VLMs) have recently emerged as powerful representation learning systems that align visual observations with natural language concepts, offering new opportunities for semantic reasoning in safety-critical autonomous driving. This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines. We study three complementary system-level use cases. First, we introduce a lightweight, category-agnostic hazard screening approach leveraging CLIP-based image-text similarity to produce a low-latency semantic hazard signal. This enables robust detection of diverse and out-of-distribution road hazards without explicit object detection or visual question answering. Second, we examine the integration of scene-level vision-language embeddings into a transformer-based trajectory planning framework using the Waymo Open Dataset. Our results show that naively conditioning planners on global embeddings does not improve trajectory accuracy, highlighting the importance of representation-task alignment and motivating the development of task-informed extraction methods for safety-critical planning. Third, we investigate natural language as an explicit behavioral constraint on motion planning using the doScenes dataset. In this setting, passenger-style instructions grounded in visual scene elements suppress rare but severe planning failures and improve safety-aligned behavior in ambiguous scenarios. Taken together, these findings demonstrate that vision-language representations hold significant promise for autonomous driving safety when used to express semantic risk, intent, and behavioral constraints. Realizing this potential is fundamentally an engineering problem requiring careful system design and structured grounding rather than direct feature injection.
中文标题/摘要
标题:视觉与语言:新型表示与人工智能在驾驶场景安全评估和自动驾驶车辆规划中的应用
视觉-语言模型(VLMs)最近作为强大的表示学习系统出现,能够将视觉观察与自然语言概念对齐,为安全关键的自动驾驶中的语义推理提供了新的机会。本文探讨了将视觉-语言表示集成到感知、预测和规划管道中时,如何支持驾驶场景安全评估和决策制定。我们研究了三个互补的系统级用例。首先,我们介绍了一种基于CLIP的图像-文本相似性实现的轻量级、无类别危险筛查方法,以生成低延迟的语义危险信号。这使得在无需显式对象检测或视觉问答的情况下,能够稳健地检测多样且分布外的道路危险。其次,我们研究了将场景级视觉-语言嵌入集成到基于Transformer的轨迹规划框架中,使用Waymo开放数据集。我们的结果表明,直接将规划者条件化于全局嵌入并不能提高轨迹精度,突显了表示任务对齐的重要性,并激励开发针对安全关键规划的任务导向提取方法。第三,我们研究了自然语言作为运动规划显式行为约束的应用,使用doScenes数据集。在这种情况下,基于视觉场景元素的乘客风格指令抑制了罕见但严重的规划失败,并在模棱两可的场景中提高了安全导向的行为。综上所述,这些发现表明,当用于表达语义风险、意图和行为约束时,视觉-语言表示在自动驾驶安全中具有巨大的潜力。实现这一潜力本质上是一个工程问题,需要精心设计系统和结构化接地,而不是直接特征注入。
Summary / 总结
This paper explores how vision-language models (VLMs) can enhance driving scene safety assessment and autonomous vehicle planning. It introduces a lightweight hazard screening approach using CLIP-based image-text similarity for low-latency semantic hazard detection, and examines the integration of scene-level VLM embeddings into a transformer-based trajectory planning framework. The study finds that directly conditioning planners on global embeddings does not improve trajectory accuracy, emphasizing the need for task-informed extraction methods. Additionally, it investigates the use of natural language as a behavioral constraint in motion planning, showing that passenger-style instructions improve safety in ambiguous scenarios. Overall, the research highlights the potential of VLMs for autonomous driving safety through semantic risk and behavioral constraint expression.
本文探讨了视觉-语言模型(VLMs)如何增强驾驶场景安全评估和自动驾驶车辆规划。研究引入了一种基于CLIP的图像-文本相似性轻量级危害筛查方法,用于低延迟的语义危害检测,并考察了场景级VLM嵌入在基于Waymo开放数据集的变换器轨迹规划框架中的集成。研究发现,直接将规划者条件化在全局嵌入上并不能提高轨迹精度,强调了需要任务导向的提取方法。此外,研究还探讨了自然语言作为运动规划行为约束的应用,结果显示乘客风格的指令在模糊场景中提高了安全性。总体而言,研究强调了通过语义风险和行为约束表达VLMs在自动驾驶安全中的潜力,这需要精心的设计和结构化的接地,而不是直接的功能注入。
GEPC: Group-Equivariant Posterior Consistency for Out-of-Distribution Detection in Diffusion Models
Authors: Yadang Alexis Rouzoumka, Jean Pinsolle, Eugénie Terreaux, Christèle Morisseau, Jean-Philippe Ovarlez, Chengfang Ren
First: 2026-01-30T08:58:13+00:00 · Latest: 2026-02-18T10:07:13+00:00
Comments: preprint
Abstract
Diffusion models learn a time-indexed score field $\mathbf{s}_θ(\mathbf{x}_t,t)$ that often inherits approximate equivariances (flips, rotations, circular shifts) from in-distribution (ID) data and convolutional backbones. Most diffusion-based out-of-distribution (OOD) detectors exploit score magnitude or local geometry (energies, curvature, covariance spectra) and largely ignore equivariances. We introduce Group-Equivariant Posterior Consistency (GEPC), a training-free probe that measures how consistently the learned score transforms under a finite group $\mathcal{G}$, detecting equivariance breaking even when score magnitude remains unchanged. At the population level, we propose the ideal GEPC residual, which averages an equivariance-residual functional over $\mathcal{G}$, and we derive ID upper bounds and OOD lower bounds under mild assumptions. GEPC requires only score evaluations and produces interpretable equivariance-breaking maps. On OOD image benchmark datasets, we show that GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines while remaining computationally lightweight. On high-resolution synthetic aperture radar imagery where OOD corresponds to targets or anomalies in clutter, GEPC yields strong target-background separation and visually interpretable equivariance-breaking maps. Code is available at https://github.com/RouzAY/gepc-diffusion/.
中文标题/摘要
标题:GEPC:组等变后验一致性在扩散模型中检测分布外数据
扩散模型学习一个时间索引化的得分场 $\mathbf{s}_θ(\mathbf{x}_t,t)$,该场通常从分布内(ID)数据和卷积骨干中继承了近似的等变性(翻转、旋转、循环移位)。大多数基于扩散的分布外(OOD)检测器利用得分的大小或局部几何(能量、曲率、协方差谱),而很大程度上忽略了等变性。我们引入了组等变后验一致性(GEPC),这是一种无需训练的探针,用于测量在有限群 $\mathcal{G}$ 下学习到的得分如何一致地变换,即使得分大小保持不变也能检测等变性的破坏。在总体水平上,我们提出了理想的GEPC残差,它在 $\mathcal{G}$ 上平均了一个等变性残差函数,并在温和假设下推导出了ID的上界和OOD的下界。GEPC只需要得分评估,并生成可解释的等变性破坏图。在OOD图像基准数据集上,我们展示了GEPC在与最近的基于扩散的基线相比在AUROC上具有竞争力或改进,同时保持计算量轻。在高分辨率合成孔径雷达图像中,OOD对应于杂波中的目标或异常,GEPC实现了强大的目标-背景分离,并生成了可视觉解释的等变性破坏图。代码可在 https://github.com/RouzAY/gepc-diffusion/ 获取。
Summary / 总结
The paper introduces Group-Equivariant Posterior Consistency (GEPC), a training-free method for detecting out-of-distribution (OOD) samples in diffusion models. GEPC measures how consistently the learned score transforms under a finite group, detecting equivariance breaking even when score magnitude remains unchanged. Experiments on OOD image datasets show that GEPC achieves competitive or improved AUROC compared to recent diffusion-based baselines while being computationally lightweight. On high-resolution synthetic aperture radar imagery, GEPC provides strong target-background separation and interpretable equivariance-breaking maps.
论文提出了Group-Equivariant Posterior Consistency (GEPC),这是一种无需训练的方法,用于在扩散模型中检测出-of-distribution (OOD) 样本。GEPC 测量了学习的分数在有限群下的变换一致性,即使分数幅度不变也能识别对称性破坏。在OOD 图像数据集上的实验表明,GEPC 在 AUC-ROC 上与最近的扩散基线相比具有竞争力或更优,并且计算效率高。在高分辨率合成孔径雷达图像上,GEPC 提供了强大的目标-背景分离和可解释的对称性破坏图。
BPP: Long-Context Robot Imitation Learning by Focusing on Key History Frames
Authors: Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, Aviral Kumar
First: 2026-02-16T18:49:56+00:00 · Latest: 2026-02-18T07:07:11+00:00
Abstract
Many robot tasks require attending to the history of past observations. For example, finding an item in a room requires remembering which places have already been searched. However, the best-performing robot policies typically condition only on the current observation, limiting their applicability to such tasks. Naively conditioning on past observations often fails due to spurious correlations: policies latch onto incidental features of training histories that do not generalize to out-of-distribution trajectories upon deployment. We analyze why policies latch onto these spurious correlations and find that this problem stems from limited coverage over the space of possible histories during training, which grows exponentially with horizon. Existing regularization techniques provide inconsistent benefits across tasks, as they do not fundamentally address this coverage problem. Motivated by these findings, we propose Big Picture Policies (BPP), an approach that conditions on a minimal set of meaningful keyframes detected by a vision-language model. By projecting diverse rollouts onto a compact set of task-relevant events, BPP substantially reduces distribution shift between training and deployment, without sacrificing expressivity. We evaluate BPP on four challenging real-world manipulation tasks and three simulation tasks, all requiring history conditioning. BPP achieves 70% higher success rates than the best comparison on real-world evaluations. Videos are available at https://bigpicturepolicies.github.io/
中文标题/摘要
标题:BPP:通过关注关键历史帧进行长上下文机器人模仿学习
许多机器人的任务需要关注过去的观察历史。例如,在房间里寻找物品需要记住已经搜索过的地方。然而,表现最佳的机器人策略通常仅依赖于当前的观察,限制了它们在这些任务中的应用。简单地依赖过去的观察往往由于虚假相关性而失败:策略会抓住训练历史中的偶然特征,这些特征在部署到新的分布时无法泛化。我们分析了为什么策略会抓住这些虚假相关性,并发现这个问题源于训练过程中对可能历史的覆盖范围有限,随着时间范围的增长而呈指数级增长。现有的正则化技术在不同任务中提供的益处不一致,因为它们没有从根本上解决这个问题。受这些发现的启发,我们提出了大图策略(BPP),该方法基于视觉-语言模型检测到的有意义的关键帧进行条件化。通过将多样化的演示投影到与任务相关的事件的紧凑集合上,BPP显著减少了训练和部署之间的分布偏移,而不会牺牲表达能力。我们在四个具有挑战性的现实世界操作任务和三个模拟任务上评估了BPP,所有任务都需要历史条件化。BPP在现实世界评估中的成功率比最佳对比方法高出70%。有关视频请参见https://bigpicturepolicies.github.io/
Summary / 总结
The research aims to improve robot imitation learning by addressing the limitations of current policies that only condition on the current observation, which are insufficient for tasks requiring memory of past observations. The proposed Big Picture Policies (BPP) method focuses on key historical frames detected by a vision-language model, reducing distribution shift and achieving 70% higher success rates on real-world manipulation tasks compared to existing methods.
论文解决了机器人长期上下文模仿学习的问题,现有的最佳策略通常只考虑当前观察,限制了其在需要记忆过去观察的任务中的效果。提出了一种大图策略(BPP),该策略基于视觉语言模型检测的关键历史帧进行条件判断,减少了训练和部署之间的分布差异,并在真实世界的操作任务中实现了比现有方法高出70%的成功率。更多信息请参见https://bigpicturepolicies.github.io/
Training-Free Adaptation of Diffusion Models via Doob's $h$-Transform
Authors: Qijie Zhu, Zeqi Ye, Han Liu, Zhaoran Wang, Minshuo Chen
First: 2026-02-18T05:44:19+00:00 · Latest: 2026-02-18T05:44:19+00:00
Comments: 36 pages, 3 figures
Abstract
Adaptation methods have been a workhorse for unlocking the transformative power of pre-trained diffusion models in diverse applications. Existing approaches often abstract adaptation objectives as a reward function and steer diffusion models to generate high-reward samples. However, these approaches can incur high computational overhead due to additional training, or rely on stringent assumptions on the reward such as differentiability. Moreover, despite their empirical success, theoretical justification and guarantees are seldom established. In this paper, we propose DOIT (Doob-Oriented Inference-time Transformation), a training-free and computationally efficient adaptation method that applies to generic, non-differentiable rewards. The key framework underlying our method is a measure transport formulation that seeks to transport the pre-trained generative distribution to a high-reward target distribution. We leverage Doob's $h$-transform to realize this transport, which induces a dynamic correction to the diffusion sampling process and enables efficient simulation-based computation without modifying the pre-trained model. Theoretically, we establish a high probability convergence guarantee to the target high-reward distribution via characterizing the approximation error in the dynamic Doob's correction. Empirically, on D4RL offline RL benchmarks, our method consistently outperforms state-of-the-art baselines while preserving sampling efficiency.
中文标题/摘要
标题:无训练扩散模型的Doob的$h$-变换自适应方法
自适应方法一直是解锁预训练扩散模型在各种应用中变革性力量的重要工具。现有方法通常将自适应目标抽象为奖励函数,并引导扩散模型生成高奖励样本。然而,这些方法可能会因额外训练而导致高计算开销,或者依赖于奖励的严格假设,如可微性。此外,尽管它们在实验上取得了成功,但理论上的证明和保证却很少建立。在本文中,我们提出了DOIT(Doob导向的推理时变换),这是一种无训练且计算高效的自适应方法,适用于通用的非可微奖励。我们方法的核心框架是一种测度传输公式,旨在将预训练的生成分布传输到高奖励的目标分布。我们利用Doob的$h$-变换来实现这一传输,这会动态修正扩散采样过程,并允许高效的基于模拟的计算而不修改预训练模型。理论上,我们通过表征动态Doob修正的近似误差来建立以高概率收敛到目标高奖励分布的保证。实验上,在D4RL离线RL基准测试中,我们的方法在保持采样效率的同时,始终优于最先进的基线。
Summary / 总结
The paper introduces DOIT (Doob-Oriented Inference-time Transformation), a training-free adaptation method for diffusion models using Doob's $h$-transform to efficiently adapt to non-differentiable rewards. The method avoids additional training and theoretical guarantees are provided. Empirically, DOIT outperforms existing baselines on D4RL offline RL benchmarks while maintaining sampling efficiency.
论文提出了DOIT(Doob-Oriented Inference-time Transformation)方法,这是一种无需训练即可适应扩散模型的方法,能够高效处理非可微奖励。通过使用Doob的$h$-变换,该方法将预训练生成分布转换到高奖励目标分布,无需修改预训练模型即可实现高效的模拟。理论分析提供了高概率收敛保证,而实验结果表明DOIT在D4RL离线RL基准测试中优于现有基线方法,同时保持了采样效率。
Beyond Learning: A Training-Free Alternative to Model Adaptation
Authors: Namkyung Yoon, Kyeonghyun Yoo, Wooyong Jung, Sanghong Kim, Hwangnam Kim
First: 2026-02-18T05:17:44+00:00 · Latest: 2026-02-18T05:17:44+00:00
Comments: 7 pages, 3 figures, 5 tables. Preprint submitted to Pattern Recognition Letters
Abstract
Despite the continuous research and evolution of language models, they sometimes underperform previous versions. Existing approaches to overcome these challenges are resource-intensive, highlighting the need for alternatives that enable immediate action. We assume that each language model has a local module inside that is suitable for a specific function. First, this work identifies a set of modules showing consistent and local activation changes under an inference workload through activation-based analysis. Subsequently, we transplant an internal module that is properly activated for a specific task into the target model, leading to immediate and measurable functional changes without additional training or fine-tuning. To experimentally demonstrate the effectiveness of the transplant technique, we quantify the relationship between transplant strength and performance improvement under different conditions for two language models. In the cross-generation setting, we find that transplanting activation-selected modules can substantially improve the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. Moreover, in transplant experiments between a base model and its instruction-tuned counterpart, transplantation improves the underperforming model toward the stronger baseline, yielding up to about 2.33 times the target baseline with gap-based recovery reaching up to 100% in the best case. These results show that meaningful capacity transfer can be realized through the implantation of highly localized modules implied by language models. Overall, this work provides empirical evidence for task-localized modularity in language models and presents a new research area: model transplantation.
中文标题/摘要
标题:超越学习:无需训练的模型适应替代方案
尽管语言模型的研究和进化不断进行,但有时它们的表现会低于之前的版本。现有的克服这些挑战的方法资源密集型,突显了需要替代方案的必要性,这些替代方案能够立即采取行动。我们假设每个语言模型内部都有一个适合特定功能的本地模块。首先,这项工作通过基于激活的分析识别出一组在推理负载下表现出一致且局部激活变化的模块。随后,我们将一个适当激活以执行特定任务的内部模块移植到目标模型中,从而实现无需额外训练或微调的即时和可测量的功能变化。为了实验性地证明移植技术的有效性,我们量化了在不同条件下两种语言模型的移植强度与性能提升之间的关系。在跨代设置中,我们发现移植选择性激活的模块可以显著提高表现不佳的模型,达到目标基线的两倍以上,并实现超过100%的基于差距的恢复。此外,在基模型与其指令调优版本之间的移植实验中,移植提高了表现不佳的模型向更强的基线模型,达到目标基线约2.33倍的提升,在最佳情况下基于差距的恢复达到100%。这些结果表明,通过语言模型暗示的高度局部模块的植入可以实现有意义的能力转移。总体而言,这项工作提供了语言模型任务局部模块化的实证证据,并提出了一项新的研究领域:模型移植。
Summary / 总结
This work addresses the issue of underperforming language models by proposing a training-free method called model transplantation. It identifies and transplants specific modules from a well-performing model into a less effective one, achieving immediate performance improvements without additional training. Experiments show that transplanting activation-selected modules can significantly enhance the underperforming model, reaching up to twice the target baseline and achieving gap-based recovery above 100%. This method provides empirical evidence for task-localized modularity in language models and opens a new research area in model transplantation.
该研究针对语言模型的性能不足问题,提出了一种无需训练的模型移植方法。通过激活分析识别并移植表现较好的模型中的特定模块到表现较差的模型中,实现即时性能提升。实验表明,移植选定的激活模块可以显著增强表现较差的模型,使其达到目标基线的两倍以上,并且在最佳情况下,基于差距的恢复达到100%。该方法为语言模型中的任务局部模块化提供了实证证据,并开辟了模型移植这一新的研究领域。
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Authors: Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, Stan Birchfield
Venue: CVPR 2025 Oral
First: 2024-11-25T16:21:34+00:00 · Latest: 2026-02-18T04:26:35+00:00
Comments: CVPR 2025 (Oral); Project Website: https://chanh.ee/RoboSpatial
Abstract
Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image datasets that often lack sophisticated spatial understanding. For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5k 3D scans, and 3M annotated spatial relationships, and the pairing of 2D egocentric images with 3D scans makes it both 2D- and 3D- ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation.
中文标题/摘要
标题:RoboSpatial:向2D和3D视觉语言模型传授空间理解能力以应用于机器人技术
空间理解是使机器人能够感知其周围环境、推理其环境并与其进行有意义互动的关键能力。在现代机器人技术中,这些能力越来越多地由视觉语言模型提供。然而,这些模型在空间推理任务中面临重大挑战,因为它们的训练数据基于一般用途的图像数据集,这些数据集通常缺乏复杂的空间理解。例如,数据集经常未能捕捉到参考框架理解,而有效的空间推理需要理解是从自我中心、世界中心还是对象中心的角度进行推理。为了解决这一问题,我们引入了RoboSpatial,这是一个用于机器人技术的空间理解大规模数据集。它包括真实室内和桌面上的场景,以3D扫描和自我中心图像的形式捕获,并附有与机器人相关的丰富空间信息进行注释。该数据集包括100万张图像、5000个3D扫描和300万注释的空间关系,2D自我中心图像与3D扫描的配对使其既适用于2D也适用于3D。我们的实验表明,使用RoboSpatial训练的模型在下游任务如空间功能预测、空间关系预测和机器人操作上优于基线模型。
Summary / 总结
The paper introduces RoboSpatial, a large-scale dataset for enhancing spatial understanding in robotics, addressing the limitations of existing general-purpose image datasets. The dataset includes real indoor and tabletop scenes, 3D scans, and egocentric images, with rich spatial annotations. Experiments demonstrate that models trained on RoboSpatial outperform baselines in tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation.
论文介绍了RoboSpatial数据集,旨在增强机器人领域的空间理解能力,解决现有通用图像数据集的局限性。该数据集包含真实的室内和桌面场景、3D扫描和第一人称视角图像,并附有丰富的空间标注信息。实验结果显示,使用RoboSpatial训练的模型在空间功能预测、空间关系预测和机器人操作等任务上优于基线模型。
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
Authors: Huadong Tang, Youpeng Zhao, Yan Huang, Min Xu, Jun Wang, Qiang Wu
First: 2024-11-30T05:49:42+00:00 · Latest: 2026-02-18T03:45:13+00:00
Abstract
It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. However, the text prompts employed in these methods are short phrases based on fixed templates, failing to capture comprehensive object attributes. Moreover, while the CLIP model excels at exploiting image-level features, it is less effective at pixel-level representation, which is crucial for semantic segmentation tasks. In this work, we propose to alleviate the above-mentioned issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Specifically, our method employs large language models (LLMs) to generate enriched language prompts with diverse visual attributes for each category, including color, shape/size, and texture/material. Additionally, for enhanced visual feature extraction, the SAM model is adopted as a supplement to the CLIP visual encoder through a proposed learnable weighted fusion strategy. Built upon these techniques, our method, termed LMSeg, achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks. The code will be made available soon.
中文标题/摘要
标题:LMSeg:大规模模型在开放词汇语义分割中的潜力释放
普遍认为,基于开放词汇的方法在识别图像中未见过的对象时,比传统的封闭集训练解决方案表现更优。现有的开放词汇方法利用像CLIP这样的视觉-语言模型,通过大规模视觉-语言数据集预训练来对齐视觉特征和丰富的语义特征。然而,这些方法中使用的文本提示是基于固定模板的简短短语,无法捕捉到全面的对象属性。此外,虽然CLIP模型在利用图像级特征方面表现出色,但在像素级表示方面却不太有效,而像素级表示对于语义分割任务至关重要。在本文中,我们通过利用多个大规模模型来增强细粒度视觉特征与丰富语言特征之间的对齐,来缓解上述问题。具体而言,我们的方法使用大型语言模型(LLMs)为每个类别生成包含颜色、形状/大小和纹理/材料等多样视觉属性的丰富语言提示。此外,为了增强视觉特征提取,我们通过一个提出的可学习加权融合策略,采用SAM模型作为CLIP视觉编码器的补充。基于这些技术,我们的方法LMSeg在所有主要的开放词汇分割基准测试中均取得了最先进的性能。代码将在不久后开源。
Summary / 总结
This paper addresses the limitations of existing open-vocabulary approaches in semantic segmentation by proposing LMSeg, which uses large-scale models to enhance alignment between visual and linguistic features. It employs large language models to generate detailed language prompts and integrates the SAM model with CLIP for better pixel-level representation. LMSeg outperforms previous methods across major open-vocabulary segmentation benchmarks.
研究旨在通过解决现有方法的局限性,如简短的模板文本提示和CLIP模型在像素级表示上的不足,来改进开放词汇语义分割。LMSeg方法使用大型语言模型生成详细的文本提示,并通过可学习的加权融合策略将SAM模型与CLIP视觉编码器结合。这种方法显著增强了视觉和语言特征之间的对齐,使其在主要的开放词汇语义分割基准测试中达到最先进的性能。
Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
Authors: Jacky Kwok, Xilun Zhang, Mengdi Xu, Yuejiang Liu, Azalia Mirhoseini, Chelsea Finn, Marco Pavone
First: 2026-02-12T18:59:59+00:00 · Latest: 2026-02-18T03:42:48+00:00
Abstract
The long-standing vision of general-purpose robots hinges on their ability to understand and act upon natural language instructions. Vision-Language-Action (VLA) models have made remarkable progress toward this goal, yet their generated actions can still misalign with the given instructions. In this paper, we investigate test-time verification as a means to shrink the "intention-action gap." We first characterize the test-time scaling laws for embodied instruction following and demonstrate that jointly scaling the number of rephrased instructions and generated actions greatly increases test-time sample diversity, often recovering correct actions more efficiently than scaling each dimension independently. To capitalize on these scaling laws, we present CoVer, a contrastive verifier for vision-language-action alignment, and show that our architecture scales gracefully with additional computational resources and data. We then introduce CoVer-VLA, a hierarchical test-time verification pipeline using the trained verifier. At deployment, our framework precomputes a diverse set of rephrased instructions from a Vision-Language-Model (VLM), repeatedly generates action candidates for each instruction, and then uses the verifier to select the optimal high-level prompt and low-level action chunks. Compared to scaling policy pre-training on the same data, our verification approach yields 22% gains in-distribution and 13% out-of-distribution on the SIMPLER benchmark, with a further 45% improvement in real-world experiments. On the PolaRiS benchmark, CoVer-VLA achieves 14% gains in task progress and 9% in success rate.
中文标题/摘要
标题:扩展验证比扩展策略学习更能有效实现视觉-语言-行动对齐
通用机器人长期愿景依赖于它们理解和执行自然语言指令的能力。视觉-语言-行动(VLA)模型在这一目标上取得了显著进展,但它们生成的动作仍然可能与给定的指令不一致。在本文中,我们研究测试时验证作为缩小“意图-行动差距”的手段。我们首先表征了实体指令跟随的测试时扩展定律,证明了同时扩展重述指令的数量和生成动作的数量大大增加了测试时样本多样性,通常比独立扩展每个维度更有效地恢复正确动作。为了利用这些扩展定律,我们提出了CoVer,一种对比验证器,用于视觉-语言-行动对齐,并展示了我们的架构随着额外计算资源和数据的增加而平滑扩展。然后,我们引入了CoVer-VLA,一种分层测试时验证流水线,使用训练好的验证器。在部署时,我们的框架从视觉语言模型(VLM)预计算一组多样化的重述指令,反复为每条指令生成动作候选,然后使用验证器选择最优的高层提示和低层动作片段。与在相同数据上扩展策略预训练相比,我们的验证方法在SIMPLER基准测试中获得了22%的同分布改进和13%的异分布改进,在实际实验中进一步提高了45%。在PolaRiS基准测试中,CoVer-VLA实现了任务进展14%的提升和成功率9%的提升。
Summary / 总结
This paper explores test-time verification as a method to improve alignment between actions and natural language instructions for vision-language-action models. By characterizing scaling laws for embodied instruction following, the authors show that jointly scaling rephrased instructions and generated actions enhances test-time sample diversity. They introduce CoVer, a contrastive verifier, which scales effectively with additional resources. CoVer-VLA, a hierarchical verification pipeline, precomputes diverse rephrased instructions and selects optimal actions, leading to 22% in-distribution and 13% out-of-distribution gains on the SIMPLER benchmark, with further improvements in real-world experiments. On the PolaRiS benchmark, CoVer-VLA achieves 14% gains in task progress and 9% in success rate.
本文探讨了测试时验证作为提高视觉-语言-行动模型中行动与自然语言指令之间对齐的方法。研究表明,同时增加重述指令的数量和生成动作的数量可以提高测试时样本多样性,更有效地恢复正确的动作。提出的CoVer验证器能够随着资源的增加平滑扩展,而CoVer-VLA层次验证管道进一步提升了性能。与政策预训练相比,在SIMPLER基准上,验证方法在分布内和分布外分别取得了22%和13%的提升,在实际实验中进一步提高了45%。在PolaRiS基准上,CoVer-VLA实现了14%的任务进度提升和9%的成功率提升。
Evaluating Demographic Misrepresentation in Image-to-Image Portrait Editing
Authors: Huichan Seo, Minki Hong, Sieun Choi, Jihie Kim, Jean Oh
First: 2026-02-18T02:47:36+00:00 · Latest: 2026-02-18T02:47:36+00:00
Comments: 19 pages, 13 figures. Preprint
Abstract
Demographic bias in text-to-image (T2I) generation is well studied, yet demographic-conditioned failures in instruction-guided image-to-image (I2I) editing remain underexplored. We examine whether identical edit instructions yield systematically different outcomes across subject demographics in open-weight I2I editors. We formalize two failure modes: Soft Erasure, where edits are silently weakened or ignored in the output image, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent attributes. We introduce a controlled benchmark that probes demographic-conditioned behavior by generating and editing portraits conditioned on race, gender, and age using a diagnostic prompt set, and evaluate multiple editors with vision-language model (VLM) scoring and human evaluation. Our analysis shows that identity preservation failures are pervasive, demographically uneven, and shaped by implicit social priors, including occupation-driven gender inference. Finally, we demonstrate that a prompt-level identity constraint, without model updates, can substantially reduce demographic change for minority groups while leaving majority-group portraits largely unchanged, revealing asymmetric identity priors in current editors. Together, our findings establish identity preservation as a central and demographically uneven failure mode in I2I editing and motivate demographic-robust editing systems. Project page: https://seochan99.github.io/i2i-demographic-bias
中文标题/摘要
标题:评估图像到图像肖像编辑中的人口统计学误表征
文本到图像(T2I)生成中的人口统计学偏差已被广泛研究,但在指令引导的图像到图像(I2I)编辑中的人口统计学条件失败方面仍研究不足。我们探讨了在开放权重I2I编辑器中,相同的编辑指令是否会导致不同受试者人口统计学群体之间系统性不同的结果。我们形式化了两种失败模式:软抹除,其中编辑在输出图像中被无声地削弱或忽略;刻板印象替代,其中编辑引入了未请求的、与刻板印象一致的属性。我们引入了一个受控基准,通过使用诊断提示集生成和编辑基于种族、性别和年龄的肖像,来探究人口统计学条件下的行为,并使用视觉语言模型(VLM)评分和人工评估来评估多个编辑器。我们的分析表明,身份保护失败普遍存在,且在不同的人口统计学群体中分布不均,受到隐含的社会先入为主的观念的影响,包括职业驱动的性别推断。最后,我们证明,在不更新模型的情况下,提示级身份约束可以显著减少少数群体的人口统计学变化,而主要群体的肖像基本保持不变,揭示了当前编辑器中不对称的身份先入为主观念。综上,我们的研究结果确立了身份保护在I2I编辑中作为一种普遍且分布不均的失败模式,并推动了人口统计学稳健的编辑系统的发展。项目页面:https://seochan99.github.io/i2i-demographic-bias
Summary / 总结
This study evaluates demographic bias in image-to-image (I2I) editing, focusing on how identical edit instructions can lead to different outcomes across different demographics. The research introduces a controlled benchmark to examine two failure modes: Soft Erasure and Stereotype Replacement. Through evaluation with vision-language model scoring and human evaluation, the study reveals pervasive identity preservation failures that are demographically uneven and influenced by implicit social priors. The findings suggest that prompt-level identity constraints can reduce demographic changes for minority groups without altering majority-group portraits, highlighting the need for demographic-robust editing systems.
该研究评估了图像到图像(I2I)编辑中的人口统计偏差,关注相同编辑指令在不同主体人口统计中的不同结果。研究使用种族、性别和年龄生成和编辑肖像,并发现身份保持失败在不同人口统计中普遍存在且分布不均。研究还表明,通过提示级身份约束可以在不改变主流群体肖像的情况下减少少数群体的种族变化,突显了需要人口统计稳健的编辑系统。
Language-Guided Invariance Probing of Vision-Language Models
Authors: Jae Joong Lee
First: 2025-11-17T15:35:49+00:00 · Latest: 2026-02-18T02:45:04+00:00
Comments: Pattern Recognition Letters 2026
Abstract
Recent vision-language models (VLMs) such as CLIP, OpenCLIP, EVA02-CLIP and SigLIP achieve strong zero-shot performance, but it is unclear how reliably they respond to controlled linguistic perturbations. We introduce Language-Guided Invariance Probing (LGIP), a benchmark that measures (i) invariance to meaning-preserving paraphrases and (ii) sensitivity to meaning-changing semantic flips in image-text matching. Using 40k MS COCO images with five human captions each, we automatically generate paraphrases and rule-based flips that alter object category, color or count, and summarize model behavior with an invariance error, a semantic sensitivity gap and a positive-rate statistic.
Across nine VLMs, EVA02-CLIP and large OpenCLIP variants lie on a favorable invariance-sensitivity frontier, combining low paraphrase-induced variance with consistently higher scores for original captions than for their flipped counterparts. In contrast, SigLIP and SigLIP2 show much larger invariance error and often prefer flipped captions to the human descriptions, especially for object and color edits. These failures are largely invisible to standard retrieval metrics, indicating that LGIP provides a model-agnostic diagnostic for the linguistic robustness of VLMs beyond conventional accuracy scores.
中文标题/摘要
标题:语言引导的视觉-语言模型不变性探查
近期的视觉-语言模型(VLMs)如CLIP、OpenCLIP、EVA02-CLIP和SigLIP在零样本情况下表现出色,但它们对控制语言扰动的可靠响应尚不清楚。我们引入了语言引导的不变性探查(LGIP),这是一个基准测试,用于测量(i)对意义保留的同义句的不变性以及(ii)对图像-文本匹配中意义改变的语义翻转的敏感性。使用40000张MS COCO图像和每张图像五个手工生成的描述,我们自动生成了改变对象类别、颜色或数量的同义句和基于规则的翻转,并用不变性误差、语义敏感性差距和正率统计来总结模型的行为。在九种VLMs中,EVA02-CLIP和大型OpenCLIP变体位于有利的不变性-敏感性前沿,结合了低同义句诱导的变异性和始终高于翻转描述的原始描述的更高分数。相比之下,SigLIP和SigLIP2显示出更大的不变性误差,通常更偏好翻转描述,尤其是在对象和颜色编辑方面。这些失败在标准检索指标中往往是看不见的,表明LGIP为VLMs的语言鲁棒性提供了一种模型无关的诊断工具,超越了传统的准确率分数。
Summary / 总结
This study introduces Language-Guided Invariance Probing (LGIP), a benchmark to evaluate the invariance and sensitivity of vision-language models (VLMs) to linguistic perturbations. Using 40k MS COCO images with five human captions each, the study automatically generates paraphrases and rule-based semantic flips. Key findings show that EVA02-CLIP and large OpenCLIP variants exhibit favorable invariance-sensitivity, maintaining low paraphrase-induced variance while consistently scoring higher for original captions. In contrast, SigLIP and SigLIP2 show larger invariance errors and often prefer flipped captions, especially for object and color edits, which are not visible in standard retrieval metrics.
该研究引入了语言引导不变性探针(LGIP),通过测量视觉语言模型(VLMs)对同义句和语义翻转的不变性和敏感性来评估其语言鲁棒性。使用40k MS COCO图像和五个人工描述,该研究自动生成同义句和基于规则的翻转来改变物体类别、颜色或数量。关键发现表明,EVA02-CLIP和大型OpenCLIP变体在不变性-敏感性前沿表现良好,而SigLIP和SigLIP2表现出更大的不变性误差,并且经常偏好翻转描述,表明其语言鲁棒性较弱。
WINA: Weight Informed Neuron Activation for Accelerating Large Language Model Inference
Authors: Sihan Chen, Dan Zhao, Jongwoo Ko, Colby Banbury, Huiping Zhuang, Luming Liang, Pashmina Cameron, Tianyi Chen
First: 2025-05-26T02:37:32+00:00 · Latest: 2026-02-18T02:40:01+00:00
Abstract
The growing computational demands of large language models (LLMs) make efficient inference and activation strategies increasingly critical. While recent approaches, such as Mixture-of-Experts (MoE), leverage selective activation but require specialized training, training-free sparse activation methods offer broader applicability and superior resource efficiency through their plug-and-play design. However, many existing methods rely solely on hidden state magnitudes to determine activation, resulting in high approximation errors and suboptimal inference accuracy. To address these limitations, we propose WINA (Weight Informed Neuron Activation), a novel, simple, and training-free sparse activation framework that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices. We show that this leads to a sparsification strategy that obtains optimal approximation error bounds with theoretical guarantees tighter than existing techniques. Empirically, WINA also outperforms state-of-the-art methods (e.g., TEAL) by up to $2.94\%$ in average performance at the same sparsity levels, across a diverse set of LLM architectures and datasets. These results position WINA as a new performance frontier for training-free sparse activation in LLM inference, advancing training-free sparse activation methods and setting a robust baseline for efficient inference. The source code is available at https://github.com/microsoft/wina.
中文标题/摘要
标题:WINA:基于权重的信息神经激活加速大型语言模型推理
大型语言模型(LLMs)日益增长的计算需求使得高效的推理和激活策略变得越来越关键。虽然最近的方法,如混合专家(MoE),利用了选择性激活但需要专门的训练,无需训练的稀疏激活方法通过其即插即用的设计提供了更广泛的适用性和更好的资源效率。然而,许多现有方法仅依赖隐藏状态的幅度来确定激活,导致高近似误差和次优的推理准确性。为了解决这些限制,我们提出了WINA(Weight Informed Neuron Activation),这是一种新颖、简单且无需训练的稀疏激活框架,同时考虑了隐藏状态的幅度和权重矩阵的列向量的$\ell_2$-范数。我们表明,这导致了一种稀疏化策略,获得了比现有技术更紧的理论保证的最优近似误差界。实验上,WINA在相同的稀疏度水平下也优于最先进的方法(例如TEAL),平均性能高出2.94%。这些结果将WINA定位为训练无需训练的稀疏激活在LLM推理中的新性能前沿,推进了训练无需训练的稀疏激活方法,并为高效的推理设定了一个稳健的基准。源代码可在https://github.com/microsoft/wina/获取。
Summary / 总结
WINA is a novel training-free sparse activation framework for large language models (LLMs) that jointly considers hidden state magnitudes and the column-wise $\ell_2$-norms of weight matrices to achieve optimal approximation error bounds. Empirically, WINA outperforms existing methods like TEAL by up to 2.94% in average performance at the same sparsity levels across various LLM architectures and datasets, demonstrating its effectiveness in improving inference accuracy while maintaining resource efficiency.
WINA 是一种新颖的无需训练的稀疏激活框架,通过同时考虑隐藏状态的幅度和权重矩阵的列向量 $\ell_2$-范数来提高大型语言模型推理的效率。该方法实现了最优的近似误差边界,并且在相同的稀疏度水平下,平均性能比现有的先进方法如TEAL高出最多2.94%,适用于多种不同的 LLM 架构和数据集。
IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models
Authors: Parsa Madinei, Srijita Karmakar, Russell Cohen Hoffing, Felix Gervitz, Miguel P. Eckstein
First: 2026-02-18T02:06:24+00:00 · Latest: 2026-02-18T02:06:24+00:00
Abstract
We introduce IRIS (Intent Resolution via Inference-time Saccades), a novel training-free approach that uses eye-tracking data in real-time to resolve ambiguity in open-ended VQA. Through a comprehensive user study with 500 unique image-question pairs, we demonstrate that fixations closest to the time participants start verbally asking their questions are the most informative for disambiguation in Large VLMs, more than doubling the accuracy of responses on ambiguous questions (from 35.2% to 77.2%) while maintaining performance on unambiguous queries. We evaluate our approach across state-of-the-art VLMs, showing consistent improvements when gaze data is incorporated in ambiguous image-question pairs, regardless of architectural differences. We release a new benchmark dataset to use eye movement data for disambiguated VQA, a novel real-time interactive protocol, and an evaluation suite.
中文标题/摘要
标题:IRIS:通过推断时的眼动扫描解决大型视觉-语言模型中开放型VQA的意图
我们引入了IRIS(通过推断时的眼动扫描进行意图解析),这是一种无需训练的新方法,利用实时的眼动追踪数据来解决开放型VQA中的歧义问题。通过一项包含500个独特图像-问题配对的全面用户研究,我们证明了参与者开始口头提问时最接近的注视点对于大型VLM中的去歧义化最具信息性,这将含糊问题的回答准确性提高了两倍多(从35.2%提高到77.2%),同时保持了对非含糊查询的性能。我们在最先进的VLM上评估了该方法,显示当将注视数据纳入含糊的图像-问题配对时,无论架构差异如何,都有一致的改进。我们发布了一个新的基准数据集,用于使用眼动数据进行去歧义化VQA,一种新的实时交互协议,以及一个评估套件。
Summary / 总结
IRIS is a training-free approach that uses eye-tracking data during inference to resolve ambiguity in open-ended VQA for large vision-language models. Through a user study with 500 image-question pairs, it was found that fixations near the start of verbal questioning are most informative, improving ambiguous question accuracy from 35.2% to 77.2% while maintaining performance on unambiguous queries. This method shows consistent improvements across different VLMs when gaze data is incorporated.
IRIS 是一种无需训练的方法,通过实时的眼动数据来解决开放性 VQA 中的歧义问题,使回答模糊问题的准确性从 35.2% 提高到 77.2%,同时保持对非模糊查询的性能。该方法在最先进的 VLM 上进行了评估,并且在引入眼动数据时显示了一致的改进。还发布了一个新的基准数据集和评估套件。
OmniCT: Towards a Unified Slice-Volume LVLM for Comprehensive CT Analysis
Authors: Tianwei Lin, Zhongwei Qiu, Wenqiao Zhang, Jiang Liu, Yihan Xie, Mingjian Gao, Zhenxuan Fan, Zhaocheng Li, Sijing Li, Zhongle Xie, Peng LU, Yueting Zhuang, Yingda Xia, Ling Zhang, Beng Chin Ooi
First: 2026-02-18T00:42:41+00:00 · Latest: 2026-02-18T00:42:41+00:00
Abstract
Computed Tomography (CT) is one of the most widely used and diagnostically information-dense imaging modalities, covering critical organs such as the heart, lungs, liver, and colon. Clinical interpretation relies on both slice-driven local features (e.g., sub-centimeter nodules, lesion boundaries) and volume-driven spatial representations (e.g., tumor infiltration, inter-organ anatomical relations). However, existing Large Vision-Language Models (LVLMs) remain fragmented in CT slice versus volumetric understanding: slice-driven LVLMs show strong generalization but lack cross-slice spatial consistency, while volume-driven LVLMs explicitly capture volumetric semantics but suffer from coarse granularity and poor compatibility with slice inputs. The absence of a unified modeling paradigm constitutes a major bottleneck for the clinical translation of medical LVLMs. We present OmniCT, a powerful unified slice-volume LVLM for CT scenarios, which makes three contributions: (i) Spatial Consistency Enhancement (SCE): volumetric slice composition combined with tri-axial positional embedding that introduces volumetric consistency, and an MoE hybrid projection enables efficient slice-volume adaptation; (ii) Organ-level Semantic Enhancement (OSE): segmentation and ROI localization explicitly align anatomical regions, emphasizing lesion- and organ-level semantics; (iii) MedEval-CT: the largest slice-volume CT dataset and hybrid benchmark integrates comprehensive metrics for unified evaluation. OmniCT consistently outperforms existing methods with a substantial margin across diverse clinical tasks and satisfies both micro-level detail sensitivity and macro-level spatial reasoning. More importantly, it establishes a new paradigm for cross-modal medical imaging understanding.
中文标题/摘要
标题:OmniCT:朝向统一的切片-体积LVLM的CT综合分析
计算机断层扫描(CT)是使用最广泛且诊断信息密集的成像技术之一,涵盖了心脏、肺、肝脏和结肠等重要器官。临床解释依赖于切片驱动的局部特征(如亚厘米结节、病灶边界)和体积驱动的空间表示(如肿瘤浸润、器官间解剖关系)。然而,现有的大型视觉-语言模型(LVLM)在CT切片与体积理解方面仍然碎片化:切片驱动的LVLM表现出强大的泛化能力,但缺乏跨切片的空间一致性,而体积驱动的LVLM则明确捕捉体积语义,但粒度过粗且与切片输入兼容性差。缺乏统一建模范式是医学LVLM临床转化的主要瓶颈。我们提出了OmniCT,这是一种强大的统一切片-体积LVLM,适用于CT场景,做出了三项贡献:(i)空间一致性增强(SCE):体积切片组成结合三轴位置嵌入引入体积一致性,MoE混合投影实现高效的切片-体积适应;(ii)器官级语义增强(OSE):分割和ROI定位明确对齐解剖区域,强调病灶和器官级语义;(iii)MedEval-CT:最大的切片-体积CT数据集和混合基准集整合了统一评估的全面指标。OmniCT在多种临床任务中一致优于现有方法,具有显著的领先优势,同时满足微观细节敏感性和宏观空间推理需求。更重要的是,它为跨模态医学成像理解建立了新的范式。
Summary / 总结
OmniCT is designed to address the limitations of existing Large Vision-Language Models (LVLMs) in CT analysis by integrating slice and volumetric understanding. It introduces Spatial Consistency Enhancement (SCE) through volumetric slice composition and tri-axial positional embedding, and Organ-level Semantic Enhancement (OSE) via segmentation and ROI localization. The model demonstrates superior performance across various clinical tasks, enhancing both micro-level detail sensitivity and macro-level spatial reasoning, and sets a new standard for cross-modal medical imaging understanding. The MedEval-CT dataset and hybrid benchmark are also introduced to facilitate unified evaluation of slice-volume CT analysis.
OmniCT旨在通过结合切片驱动和体积驱动的方法来解决现有大型视觉-语言模型在CT分析中的局限性。它通过体积切片组成和三轴位置嵌入引入空间一致性增强(SCE),并通过分割和ROI定位实现器官级别的语义增强(OSE)。该模型在各种临床任务中一致地优于现有方法,并支持详细和空间推理。此外,OmniCT还包括MedEval-CT数据集,用于统一评估切片-体积CT分析。
Quant VideoGen: Auto-Regressive Long Video Generation via 2-Bit KV-Cache Quantization
Authors: Haocheng Xi, Shuo Yang, Yilong Zhao, Muyang Li, Han Cai, Xingyang Li, Yujun Lin, Zhuoyang Zhang, Jintao Zhang, Xiuyu Li, Zhiying Xu, Jun Wu, Chenfeng Xu, Ion Stoica, Song Han, Kurt Keutzer
First: 2026-02-03T00:54:32+00:00 · Latest: 2026-02-17T23:49:23+00:00
Comments: 11 pages, 7 figures
Abstract
Despite rapid progress in autoregressive video diffusion, an emerging system algorithm bottleneck limits both deployability and generation capability: KV cache memory. In autoregressive video generation models, the KV cache grows with generation history and quickly dominates GPU memory, often exceeding 30 GB, preventing deployment on widely available hardware. More critically, constrained KV cache budgets restrict the effective working memory, directly degrading long horizon consistency in identity, layout, and motion. To address this challenge, we present Quant VideoGen (QVG), a training free KV cache quantization framework for autoregressive video diffusion models. QVG leverages video spatiotemporal redundancy through Semantic Aware Smoothing, producing low magnitude, quantization friendly residuals. It further introduces Progressive Residual Quantization, a coarse to fine multi stage scheme that reduces quantization error while enabling a smooth quality memory trade off. Across LongCat Video, HY WorldPlay, and Self Forcing benchmarks, QVG establishes a new Pareto frontier between quality and memory efficiency, reducing KV cache memory by up to 7.0 times with less than 4% end to end latency overhead while consistently outperforming existing baselines in generation quality.
中文标题/摘要
标题:Quant VideoGen:通过2比特KV缓存量化实现自回归长视频生成
尽管在自回归视频扩散方面取得了快速进展,但新兴的系统算法瓶颈限制了部署能力和生成能力:KV缓存内存。在自回归视频生成模型中,KV缓存随着生成历史增长,并迅速占据GPU内存,经常超过30 GB,阻止在广泛可用的硬件上部署。更关键的是,受限的KV缓存预算限制了有效的内存工作空间,直接降低了长期一致性中的身份、布局和运动。为了解决这一挑战,我们提出了Quant VideoGen(QVG),一种无需训练的KV缓存量化框架,用于自回归视频扩散模型。QVG 通过语义感知平滑利用视频时空冗余,产生低幅度、量化友好的残差。它进一步引入了渐进残差量化,这是一种从粗到细的多阶段方案,减少了量化误差,同时允许平滑的质量与内存之间的权衡。在LongCat Video、HY WorldPlay 和 Self Forcing 基准测试中,QVG 在质量与内存效率之间建立了新的帕累托前沿,将KV缓存内存最多减少7.0倍,同时端到端延迟开销低于4%,并且在生成质量上始终优于现有基线。
Summary / 总结
Quant VideoGen (QVG) addresses the bottleneck of KV cache memory in autoregressive video generation models by leveraging video spatiotemporal redundancy and introducing a training-free quantization framework. QVG reduces KV cache memory by up to 7.0 times with minimal latency overhead and consistently outperforms existing methods in generation quality across various benchmarks including LongCat Video, HY WorldPlay, and Self Forcing.
Quant VideoGen (QVG) 通过利用视频的时空冗余性和引入逐级残差量化方案来解决自回归视频生成中的 KV 缓存内存瓶颈问题。该方法在 LongCat Video、HY WorldPlay 和 Self Forcing 等基准测试中将 KV 缓存内存减少至最多 7.0 倍,同时保持较低的端到端延迟开销并提高生成质量。
MoE-Spec: Expert Budgeting for Efficient Speculative Decoding
Authors: Bradley McDanel, Steven Li, Sruthikesh Surineni, Harshit Khaitan
First: 2026-02-17T22:02:36+00:00 · Latest: 2026-02-17T22:02:36+00:00
Comments: 12 pages, 10 figures
Abstract
Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\% higher throughput than state-of-the-art speculative decoding baselines (EAGLE-3) at comparable quality, with flexibility to trade accuracy for further latency reductions through tighter budgets.
中文标题/摘要
标题:MoE-Spec:专家预算化以提高推测性解码效率
推测性解码通过并行验证多个草稿令牌来加速大型语言模型(LLM)推理。然而,对于混合专家(MoE)模型,这种并行性引入了一个严重的瓶颈:庞大的草稿树激活了许多独特的专家,显著增加了内存压力,从而减少了推测性解码相对于自回归解码的速度提升。先前的方法在MoE验证变得昂贵时减少推测深度。我们提出MoE-Spec,一种无需训练的验证时专家预算方法,通过在每一层强制执行固定的专家容量限制来解耦推测深度和内存成本,仅加载对验证贡献最大的专家,并丢弃那些很少使用且驱动带宽开销的长尾专家。在多个模型规模和数据集上的实验表明,该方法在可比质量下比最先进的推测性解码基线(EAGLE-3)提高了10-30%的吞吐量,同时可以通过更严格的预算进一步减少延迟。
Summary / 总结
The research aims to improve the efficiency of speculative decoding in Large Language Models (LLMs) by addressing the memory bottleneck in Mixture-of-Experts (MoE) models. The proposed MoE-Spec method limits the number of experts used for verification at each layer, ensuring that only the most relevant experts are utilized. This approach maintains or improves throughput by up to 30% compared to existing speculative decoding methods like EAGLE-3, while allowing for trade-offs between accuracy and latency through stricter budgets.
MoE-Spec 是一种无需训练的方法,用于提高 Mixture-of-Experts (MoE) 模型中推测性解码的效率。它在每一层限制使用的专家数量,专注于最关键的部分。实验表明,与现有推测性解码方法 EAGLE-3 相比,MoE-Spec 可以将吞吐量提高 10-30%,同时保持相似的质量。更紧的预算可以进一步减少延迟,但会牺牲一些准确性。
Do Vision-Language Models Respect Contextual Integrity in Location Disclosure?
Authors: Ruixin Yang, Ethan Mendes, Arthur Wang, James Hays, Sauvik Das, Wei Xu, Alan Ritter
Venue: ICLR 2026
First: 2026-02-04T20:24:14+00:00 · Latest: 2026-02-17T21:21:53+00:00
Comments: Accepted by ICLR 2026. Code and data can be downloaded via https://github.com/99starman/VLM-GeoPrivacyBench
Abstract
Vision-language models (VLMs) have demonstrated strong performance in image geolocation, a capability further sharpened by frontier multimodal large reasoning models (MLRMs). This poses a significant privacy risk, as these widely accessible models can be exploited to infer sensitive locations from casually shared photos, often at street-level precision, potentially surpassing the level of detail the sharer consented or intended to disclose. While recent work has proposed applying a blanket restriction on geolocation disclosure to combat this risk, these measures fail to distinguish valid geolocation uses from malicious behavior. Instead, VLMs should maintain contextual integrity by reasoning about elements within an image to determine the appropriate level of information disclosure, balancing privacy and utility. To evaluate how well models respect contextual integrity, we introduce VLM-GEOPRIVACY, a benchmark that challenges VLMs to interpret latent social norms and contextual cues in real-world images and determine the appropriate level of location disclosure. Our evaluation of 14 leading VLMs shows that, despite their ability to precisely geolocate images, the models are poorly aligned with human privacy expectations. They often over-disclose in sensitive contexts and are vulnerable to prompt-based attacks. Our results call for new design principles in multimodal systems to incorporate context-conditioned privacy reasoning.
中文标题/摘要
标题:视觉语言模型在位置披露中是否尊重上下文完整性?
视觉语言模型(VLMs)在图像地理定位方面表现出强大的性能,进一步被前沿的多模态大型推理模型(MLRMs)所增强。这带来了重大的隐私风险,因为这些广泛可用的模型可以被利用从随意分享的照片中推断出敏感的位置信息,通常达到街道级别的精度,可能超过分享者同意或意图披露的详细程度。虽然最近的工作提出了对地理定位披露进行全面限制的方法以应对这一风险,但这些措施未能区分有效的地理定位使用和恶意行为。相反,VLMs 应通过推理图像中的元素来保持上下文完整性,以确定适当的信息披露水平,平衡隐私和实用性。为了评估模型是否尊重上下文完整性,我们引入了VLM-GeoPrivacy基准,该基准挑战VLMs解释现实世界图像中的潜在社会规范和上下文线索,并确定适当的位置披露水平。我们的评估显示,尽管这些模型能够精确地地理定位图像,但它们与人类的隐私期望严重不符。它们在敏感背景下经常过度披露信息,并且容易受到提示攻击。我们的结果呼吁在多模态系统中引入新的设计原则,以纳入基于上下文的隐私推理。
Summary / 总结
The research aims to assess whether vision-language models respect contextual integrity in location disclosure, given their strong performance in image geolocation. The study introduces VLM-GEOPRIVACY, a benchmark that tests models' ability to interpret social norms and contextual cues to determine appropriate location disclosure. The evaluation of 14 leading VLMs reveals that these models often over-disclose sensitive information and are vulnerable to prompt-based attacks, indicating a misalignment with human privacy expectations.
该研究探讨了视觉-语言模型在位置披露时是否尊重上下文完整性,引入了VLM-GEOPRIVACY基准来评估模型解释社会规范和上下文线索的能力。对14个领先VLMs的评估显示,尽管这些模型在地理定位方面非常精确,但它们经常在敏感情境下过度披露信息,并且容易受到提示攻击,表明与人类隐私期望存在偏差。
MedProbCLIP: Probabilistic Adaptation of Vision-Language Foundation Model for Reliable Radiograph-Report Retrieval
Authors: Ahmad Elallaf, Yu Zhang, Yuktha Priya Masupalli, Jeong Yang, Young Lee, Zechun Cao, Gongbo Liang
Venue: WACV
First: 2026-02-17T21:20:32+00:00 · Latest: 2026-02-17T21:20:32+00:00
Comments: Accepted to the 2026 Winter Conference on Applications of Computer Vision (WACV) Workshops
Abstract
Vision-language foundation models have emerged as powerful general-purpose representation learners with strong potential for multimodal understanding, but their deterministic embeddings often fail to provide the reliability required for high-stakes biomedical applications. This work introduces MedProbCLIP, a probabilistic vision-language learning framework for chest X-ray and radiology report representation learning and bidirectional retrieval. MedProbCLIP models image and text representations as Gaussian embeddings through a probabilistic contrastive objective that explicitly captures uncertainty and many-to-many correspondences between radiographs and clinical narratives. A variational information bottleneck mitigates overconfident predictions, while MedProbCLIP employs multi-view radiograph encoding and multi-section report encoding during training to provide fine-grained supervision for clinically aligned correspondence, yet requires only a single radiograph and a single report at inference. Evaluated on the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines, including CLIP, CXR-CLIP, and PCME++, in both retrieval and zero-shot classification. Beyond accuracy, MedProbCLIP demonstrates superior calibration, risk-coverage behavior, selective retrieval reliability, and robustness to clinically relevant corruptions, underscoring the value of probabilistic vision-language modeling for improving the trustworthiness and safety of radiology image-text retrieval systems.
中文标题/摘要
标题:MedProbCLIP: 视觉-语言基础模型的概率适应方法以实现可靠的胸部X光片-报告检索
视觉-语言基础模型已经发展成为强大的通用表示学习者,具有强大的多模态理解潜力,但它们的确定性嵌入往往无法提供高风险生物医学应用所需的可靠性。这项工作引入了MedProbCLIP,这是一种概率视觉-语言学习框架,用于胸部X光片和放射学报告的表示学习和双向检索。MedProbCLIP通过概率对比目标将图像和文本表示建模为高斯嵌入,该目标明确捕捉不确定性以及放射学图像与临床叙述之间的多对多对应关系。变分信息瓶颈减轻了过于自信的预测,而MedProbCLIP在训练期间采用多视角X光片编码和多部分报告编码,为临床对齐的对应关系提供细粒度监督,但在推理时仅需要一张X光片和一份报告。MedProbCLIP在MIMIC-CXR数据集上的表现优于确定性和概率基线,包括CLIP、CXR-CLIP和PCME++,在检索和零样本分类方面均表现优异。除了准确性,MedProbCLIP还展示了更好的校准、风险-覆盖行为、选择性检索可靠性以及对临床相关干扰的鲁棒性,突显了概率视觉-语言建模对于提高放射学图像-文本检索系统的可靠性和安全性的重要性。
Summary / 总结
MedProbCLIP is a probabilistic vision-language learning framework designed for reliable retrieval of chest X-rays and radiology reports. It uses a probabilistic contrastive objective to model image and text as Gaussian embeddings, capturing uncertainty and many-to-many correspondences. During training, MedProbCLIP employs a variational information bottleneck and multi-view encoding to provide fine-grained supervision, while requiring only a single radiograph and report at inference. On the MIMIC-CXR dataset, MedProbCLIP outperforms deterministic and probabilistic baselines in retrieval and zero-shot classification, showing superior calibration and robustness to clinically relevant corruptions.
MedProbCLIP 是一种概率视觉-语言学习框架,旨在可靠地检索胸部 X 光片和放射学报告。它使用概率对比目标将图像和文本建模为高斯嵌入,捕捉不确定性及多对多对应关系。在训练过程中,MedProbCLIP 使用变分信息瓶颈和多视图编码提供细粒度监督,但在推理时仅需一张 X 光片和一份报告。在 MIMIC-CXR 数据集上,MedProbCLIP 在检索和零样本分类中优于确定性和概率基线,显示出更好的校准和对临床相关干扰的鲁棒性。
BTReport: A Framework for Brain Tumor Radiology Report Generation with Clinically Relevant Features
Authors: Juampablo E. Heras Rivera, Dickson T. Chen, Tianyi Ren, Daniel K. Low, Asma Ben Abacha, Alberto Santamaria-Pang, Mehmet Kurt
First: 2026-02-17T20:55:00+00:00 · Latest: 2026-02-17T20:55:00+00:00
Comments: Accepted to Medical Imaging with Deep Learning (MIDL) 2026
Abstract
Recent advances in radiology report generation (RRG) have been driven by large paired image-text datasets; however, progress in neuro-oncology has been limited due to a lack of open paired image-report datasets. Here, we introduce BTReport, an open-source framework for brain tumor RRG that constructs natural language radiology reports using deterministically extracted imaging features. Unlike existing approaches that rely on large general-purpose or fine-tuned vision-language models for both image interpretation and report composition, BTReport performs deterministic feature extraction for image analysis and uses large language models only for syntactic structuring and narrative formatting. By separating RRG into a deterministic feature extraction step and a report generation step, the generated reports are completely interpretable and less prone to hallucinations. We show that the features used for report generation are predictive of key clinical outcomes, including survival and IDH mutation status, and reports generated by BTReport are more closely aligned with reference clinical reports than existing baselines for RRG. Finally, we introduce BTReport-BraTS, a companion dataset that augments BraTS imaging with synthetically generated radiology reports produced with BTReport. Code for this project can be found at https://github.com/KurtLabUW/BTReport.
中文标题/摘要
标题:BTReport:一种基于临床相关特征的大脑肿瘤放射学报告生成框架
近年来,放射学报告生成(RRG)的进步得益于大规模的成像-文本配对数据集;然而,由于缺乏开放的成像-报告配对数据集,神经肿瘤学的进步受到了限制。在此,我们介绍了BTReport,这是一个开源的大脑肿瘤RRG框架,该框架使用确定性提取的成像特征构建自然语言放射学报告。与现有依赖于大型通用或微调视觉-语言模型进行图像解释和报告编写的方案不同,BTReport 对图像进行确定性特征提取,并仅使用大型语言模型进行句法结构化和叙述格式化。通过将RRG分为确定性特征提取步骤和报告生成步骤,生成的报告完全可解释且较少出现幻觉。我们展示了用于报告生成的特征可以预测关键临床结果,包括生存率和IDH突变状态,并且由BTReport生成的报告比现有RRG基线更接近参考临床报告。最后,我们介绍了BTReport-BraTS,这是一个补充BraTS影像数据集的合成放射学报告数据集,这些报告是使用BTReport生成的。此项目的代码可以在 https://github.com/KurtLabUW/BTReport/ 找到。
Summary / 总结
BTReport is a framework for brain tumor radiology report generation that uses deterministically extracted imaging features for image analysis and large language models for syntactic structuring and narrative formatting. The generated reports are interpretable and aligned with clinical outcomes, outperforming existing baselines. BTReport-BraTS, a companion dataset, provides synthetically generated radiology reports for BraTS imaging data. The framework aims to improve the interpretability and clinical relevance of radiology reports in neuro-oncology by separating the feature extraction and report generation steps.
BTReport 是一个用于脑肿瘤放射学报告生成的框架,通过确定性提取影像特征进行报告组成,仅使用大型语言模型进行语法结构和叙述格式化。这种方法生成的报告具有更高的可解释性,且较少出现幻觉。关键发现包括生成特征对临床结果(如生存率和IDH突变状态)具有预测能力,以及由BTReport生成的报告与参考临床报告相比更为一致。BTReport-BraTS 是一个配套数据集,为BraTS影像数据生成了合成的放射学报告。
Automated Re-Identification of Holstein-Friesian Cattle in Dense Crowds
Authors: Phoenix Yu, Tilo Burghardt, Andrew W Dowsey, Neill W Campbell
First: 2026-02-17T19:25:50+00:00 · Latest: 2026-02-17T19:25:50+00:00
Comments: 32 pages, 13 figures, 5 tables
Abstract
Holstein-Friesian detection and re-identification (Re-ID) methods capture individuals well when targets are spatially separate. However, existing approaches, including YOLO-based species detection, break down when cows group closely together. This is particularly prevalent for species which have outline-breaking coat patterns. To boost both effectiveness and transferability in this setting, we propose a new detect-segment-identify pipeline that leverages the Open-Vocabulary Weight-free Localisation and the Segment Anything models as pre-processing stages alongside Re-ID networks. To evaluate our approach, we publish a collection of nine days CCTV data filmed on a working dairy farm. Our methodology overcomes detection breakdown in dense animal groupings, resulting in a 98.93% accuracy. This significantly outperforms current oriented bounding box-driven, as well as SAM species detection baselines with accuracy improvements of 47.52% and 27.13%, respectively. We show that unsupervised contrastive learning can build on this to yield 94.82% Re-ID accuracy on our test data. Our work demonstrates that Re-ID in crowded scenarios is both practical as well as reliable in working farm settings with no manual intervention. Code and dataset are provided for reproducibility.
中文标题/摘要
标题:密集人群中荷斯坦-弗里斯兰奶牛的自动再识别
荷斯坦-弗里斯兰奶牛的检测和再识别(Re-ID)方法在目标空间分离时表现良好。然而,现有的方法,包括基于YOLO的物种检测,在奶牛紧密聚集时会失效。这在具有断开轮廓毛皮图案的物种中尤为明显。为了在这一场景中提高有效性和可移植性,我们提出了一种新的检测-分割-识别流水线,该流水线利用了Open-Vocabulary Weight-free Localisation和Segment Anything模型作为预处理阶段,同时结合了Re-ID网络。为了评估我们的方法,我们发布了一个在运营奶牛场拍摄的九天的闭路电视数据集。我们的方法克服了密集动物群组中的检测失效,准确率达到98.93%。这在当前基于定向边界框的方法以及SAM物种检测基线中分别提高了47.52%和27.13%的准确性。我们展示了无监督对比学习可以在此基础上实现94.82%的Re-ID准确率。我们的工作证明,在运营农场环境中,Re-ID在拥挤场景中既是可行的,又是可靠的,无需人工干预。我们提供了代码和数据集以实现可重复性。
Summary / 总结
The research aims to improve the re-identification of Holstein-Friesian cattle in dense crowds, where existing methods fail. The proposed method uses a detect-segment-identify pipeline with Open-Vocabulary Weight-free Localisation and Segment Anything models to preprocess images before applying Re-ID networks. Evaluation on a nine-day CCTV dataset from a dairy farm shows 98.93% accuracy, significantly outperforming current methods by 47.52% and 27.13% respectively. Unsupervised contrastive learning further improves Re-ID accuracy to 94.82%. This work demonstrates practical and reliable re-identification in crowded farm settings without manual intervention.
研究旨在改善密集人群中荷斯坦-弗里斯兰奶牛的再识别,现有方法在此场景下失效。提出了一种新的检测-分割-识别流水线,利用开放词汇重量自由局部化和Segment Anything模型。在来自牧场的九天闭路电视数据集上进行评估,准确率达到98.93%,分别比当前方法高出47.52%和27.13%。无监督对比学习进一步将再识别准确率提高到94.82%。这项工作证明了在无需人工干预的情况下,在牧场环境中进行密集场景下的再识别是可行且可靠的。
Can Vision-Language Models See Squares? Text-Recognition Mediates Spatial Reasoning Across Three Model Families
Authors: Yuval Levental
First: 2026-02-17T19:06:19+00:00 · Latest: 2026-02-17T19:06:19+00:00
Comments: 9 pages, 3 figures, 2 tables. Workshop-length paper
Abstract
We present a simple experiment that exposes a fundamental limitation in vision-language models (VLMs): the inability to accurately localize filled cells in binary grids when those cells lack textual identity. We generate fifteen 15x15 grids with varying density (10.7%-41.8% filled cells) and render each as two image types -- text symbols (. and #) and filled squares without gridlines -- then ask three frontier VLMs (Claude Opus, ChatGPT 5.2, and Gemini 3 Thinking) to transcribe them. In the text-symbol condition, Claude and ChatGPT achieve approximately 91% cell accuracy and 84% F1, while Gemini achieves 84% accuracy and 63% F1. In the filled-squares condition, all three models collapse to 60-73% accuracy and 29-39% F1. Critically, all conditions pass through the same visual encoder -- the text symbols are images, not tokenized text. The text-vs-squares F1 gap ranges from 34 to 54 points across models, demonstrating that VLMs behave as if they possess a high-fidelity text-recognition pathway for spatial reasoning that dramatically outperforms their native visual pathway. Each model exhibits a distinct failure mode in the squares condition -- systematic under-counting (Claude), massive over-counting (ChatGPT), and template hallucination (Gemini) -- but all share the same underlying deficit: severely degraded spatial localization for non-textual visual elements.
中文标题/摘要
标题:视觉-语言模型能识别正方形吗?文本识别介导跨三种模型家族的空间推理
我们提出一个简单的实验,揭示了视觉-语言模型(VLMs)的一个根本性局限:无法准确定位二元网格中的填充单元格,尤其是这些单元格没有文本标识时。我们生成了十五个15x15的网格,密度从10.7%到41.8%不等,并以两种图像类型呈现——文本符号(. 和 #)和填充正方形(无网格线)。然后,我们让三个前沿的VLMs(Claude Opus、ChatGPT 5.2和Gemini 3 Thinking)进行转录。在文本符号条件下,Claude和ChatGPT分别达到约91%的单元格准确率和84%的F1值,而Gemini达到84%的准确率和63%的F1值。在填充正方形条件下,所有模型的准确率下降到60%-73%,F1值下降到29%-39%。关键的是,所有条件都通过相同的视觉编码器——文本符号是图像,而不是分词文本。文本与正方形的F1值差距在模型间从34到54分不等,表明VLMs的行为似乎具有高保真度的文本识别路径,其在空间推理中的表现远超其原生视觉路径。每个模型在正方形条件下表现出不同的失败模式——系统性低估(Claude)、巨大高估(ChatGPT)和模板幻觉(Gemini),但所有模型都共享相同的根本缺陷:非文本视觉元素的空间定位严重退化。
Summary / 总结
This study highlights a significant limitation in vision-language models (VLMs) by demonstrating their inability to accurately localize filled cells in binary grids without textual identity. The researchers generated 15x15 grids with varying densities and rendered them as text symbols or filled squares. While models performed well in the text-symbol condition, their accuracy dropped dramatically in the filled-squares condition. This gap in performance indicates that VLMs rely heavily on text-recognition pathways for spatial reasoning, which outperforms their native visual processing capabilities, especially for non-textual visual elements.
研究揭示了视觉语言模型(VLMs)的一个关键局限性,即它们无法准确识别二元网格中没有文本标识的填充单元格。实验涉及将15x15网格以不同密度渲染为文本符号和填充方块,并要求三个领先的VLMs(Claude Opus、ChatGPT 5.2和Gemini 3 Thinking)进行转录。虽然文本符号的转录达到了较高的准确率和F1分数,但在填充方块条件下,模型的表现显著下降,表明它们依赖于文本识别来进行空间推理,而这种能力远超其视觉处理能力。