arXiv 论文速递

2026-02-25 04:04
Snapshot: 20260225_0404
NovaPlan: Zero-Shot Long-Horizon Manipulation via Closed-Loop Video Language Planning
Authors: Jiahui Fu, Junyu Nan, Lingfeng Sun, Hongyu Li, Jianing Qian, Jennifer L. Barry, Kris Kitani, George Konidaris
First: 2026-02-23T18:35:18+00:00 · Latest: 2026-02-23T18:35:18+00:00
Comments: 25 pages, 15 figures. Project webpage: https://nova-plan.github.io/
Abstract
Solving long-horizon tasks requires robots to integrate high-level semantic reasoning with low-level physical interaction. While vision-language models (VLMs) and video generation models can decompose tasks and imagine outcomes, they often lack the physical grounding necessary for real-world execution. We introduce NovaPlan, a hierarchical framework that unifies closed-loop VLM and video planning with geometrically grounded robot execution for zero-shot long-horizon manipulation. At the high level, a VLM planner decomposes tasks into sub-goals and monitors robot execution in a closed loop, enabling the system to recover from single-step failures through autonomous re-planning. To compute low-level robot actions, we extract and utilize both task-relevant object keypoints and human hand poses as kinematic priors from the generated videos, and employ a switching mechanism to choose the better one as a reference for robot actions, maintaining stable execution even under heavy occlusion or depth inaccuracy. We demonstrate the effectiveness of NovaPlan on three long-horizon tasks and the Functional Manipulation Benchmark (FMB). Our results show that NovaPlan can perform complex assembly tasks and exhibit dexterous error recovery behaviors without any prior demonstrations or training. Project page: https://nova-plan.github.io/
中文标题/摘要
标题:NovaPlan:通过闭环视频语言规划实现零样本长时程操作
解决长时程任务需要机器人将高层次语义推理与低层次物理交互相结合。尽管视觉-语言模型(VLM)和视频生成模型可以分解任务并想象结果,但它们往往缺乏实现世界执行所需的物理基础。我们提出了NovaPlan,这是一种分层框架,将闭环VLM和视频规划与几何上接地的机器人执行统一起来,以实现零样本长时程操作。在高层次上,VLM规划器将任务分解为子目标,并在闭环中监控机器人执行,使系统能够通过自主重新规划从单步失败中恢复。为了计算低层次的机器人动作,我们从生成的视频中提取并利用与任务相关的物体关键点和人类手部姿态作为运动学先验,并采用切换机制选择更好的一个作为机器人动作的参考,即使在严重遮挡或深度不准确的情况下也能保持稳定的执行。我们在三个长时程任务和功能性操作基准(FMB)上展示了NovaPlan的有效性。我们的结果表明,NovaPlan可以在没有任何先验演示或训练的情况下执行复杂的装配任务并表现出灵巧的错误恢复行为。项目页面:https://nova-plan.github.io/
Summary / 总结
NovaPlan is a hierarchical framework that integrates closed-loop vision-language planning and geometrically grounded robot execution for zero-shot long-horizon manipulation. It decomposes tasks into sub-goals and monitors robot execution, allowing for autonomous re-planning in case of failures. To compute low-level actions, it uses task-relevant object keypoints and human hand poses as kinematic priors, switching between them based on their quality. NovaPlan demonstrates effectiveness on complex assembly tasks and the Functional Manipulation Benchmark without prior demonstrations or training.
NovaPlan 是一个层次框架,将视觉语言模型和视频规划与几何上接地的机器人执行相结合,用于零样本长时程操作。它将任务分解为子目标,并使用闭环机制监控和重新规划机器人动作,以从单步失败中恢复。为了计算低级动作,它从生成的视频中提取并利用物体关键点和人类手部姿势,并使用切换机制选择最佳参考用于机器人动作。NovaPlan 在复杂装配任务和错误恢复方面展示了有效性,无需任何先验演示或训练。
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
Authors: Yunze Man, Shihao Wang, Guowen Zhang, Johan Bjorck, Zhiqi Li, Liang-Yan Gui, Jim Fan, Jan Kautz, Yu-Xiong Wang, Zhiding Yu
First: 2025-11-25T18:59:45+00:00 · Latest: 2026-02-23T17:59:38+00:00
Comments: Tech report. Project page: https://nvlabs.github.io/LocateAnything3D/
Abstract
To act in the world, a model must name what it sees and know where it is in 3D. Today's vision-language models (VLMs) excel at open-ended 2D description and grounding, yet multi-object 3D detection remains largely missing from the VLM toolbox. We present LocateAnything3D, a VLM-native recipe that casts 3D detection as a next-token prediction problem. The key is a short, explicit Chain-of-Sight (CoS) sequence that mirrors how human reason from images: find an object in 2D, then infer its distance, size, and pose. The decoder first emits 2D detections as a visual chain-of-thought, then predicts 3D boxes under an easy-to-hard curriculum: across objects, a near-to-far order reduces early ambiguity and matches ego-centric utility; within each object, a center-from-camera, dimensions, and rotation factorization ranks information by stability and learnability. This VLM-native interface preserves open-vocabulary and visual-prompting capability without specialized heads. On the challenging Omni3D benchmark, our model achieves state-of-the-art results, with 38.90 AP_3D, surpassing the previous best by +13.98 absolute improvement even when the baseline is given ground-truth 2D boxes. It also generalizes zero-shot to held-out categories with strong robustness. By turning 3D detection into a disciplined next-token problem, LocateAnything3D offers a practical foundation for models to perceive in 3D.
中文标题/摘要
标题:LocateAnything3D:基于视线链的视觉-语言3D检测
为了在世界中行动,模型必须命名它所看到的并知道其在3D中的位置。今天的视觉-语言模型(VLM)在开放的2D描述和语义定位方面表现出色,但多对象3D检测仍然缺乏于VLM工具箱中。我们提出了LocateAnything3D,这是一种VLM原生的方法,将3D检测视为下一个标记预测问题。关键在于一个简短明确的视线链(CoS)序列,这反映了人类如何从图像中推理:先在2D中找到一个物体,然后推断其距离、大小和姿态。解码器首先以视觉链的方式发出2D检测,然后在容易到困难的课程中预测3D框:在不同物体之间,从近到远的顺序减少了早期的模糊性并匹配了以自我为中心的实用性;在每个物体内部,从相机中心、尺寸和旋转的分解按稳定性和可学习性排序信息。这种VLM原生的接口保留了开放词汇量和视觉提示的能力,而无需专门的头部。在具有挑战性的Omni3D基准测试中,我们的模型达到了最先进的结果,3D AP得分为38.90,超越了之前的最佳结果13.98的绝对改进,即使基线模型得到了真实2D框。它还以强大的鲁棒性在零样本情况下推广到未见过的类别。通过将3D检测转化为一个有纪律的下一个标记问题,LocateAnything3D为模型提供了一个感知3D的实用基础。
Summary / 总结
The research aims to enable vision-language models to perform multi-object 3D detection, a capability largely absent in current models. The method involves casting 3D detection as a next-token prediction problem using a Chain-of-Sight (CoS) sequence that mimics human reasoning. The model first generates 2D detections and then predicts 3D boxes in an easy-to-hard curriculum. On the Omni3D benchmark, the model achieves state-of-the-art results with 38.90 AP_3D, surpassing previous bests by 13.98 points, even when given ground-truth 2D boxes. It also demonstrates strong zero-shot generalization to new categories.
研究旨在使视觉语言模型能够执行多目标3D检测,目前这还是一个缺失的功能。方法是将3D检测视为下一个标记预测问题,使用链式视线(CoS)序列来模仿人类的推理过程。模型首先生成2D检测,然后在从易到难的课程中预测3D框。在Omni3D基准测试中,该模型取得了最先进的结果,AP_3D得分为38.90,比之前的最佳结果高出13.98分。此外,它还展示了对新类别的强零样本泛化能力。
StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
Authors: Zanxi Ruan, Qiuyu Kong, Songqun Gao, Yiming Wang, Marco Cristani
Venue: CVPR 2026
First: 2026-02-23T17:57:37+00:00 · Latest: 2026-02-23T17:57:37+00:00
Comments: Accepted by CVPR 2026
Abstract
Edge-based representations are fundamental cues for visual understanding, a principle rooted in early vision research and still central today. We extend this principle to vision-language alignment, showing that isolating and aligning structural cues across modalities can greatly benefit fine-tuning on long, detail-rich captions, with a specific focus on improving cross-modal retrieval. We introduce StructXLIP, a fine-tuning alignment paradigm that extracts edge maps (e.g., Canny), treating them as proxies for the visual structure of an image, and filters the corresponding captions to emphasize structural cues, making them "structure-centric". Fine-tuning augments the standard alignment loss with three structure-centric losses: (i) aligning edge maps with structural text, (ii) matching local edge regions to textual chunks, and (iii) connecting edge maps to color images to prevent representation drift. From a theoretical standpoint, while standard CLIP maximizes the mutual information between visual and textual embeddings, StructXLIP additionally maximizes the mutual information between multimodal structural representations. This auxiliary optimization is intrinsically harder, guiding the model toward more robust and semantically stable minima, enhancing vision-language alignment. Beyond outperforming current competitors on cross-modal retrieval in both general and specialized domains, our method serves as a general boosting recipe that can be integrated into future approaches in a plug-and-play manner. Code and pretrained models are publicly available at: https://github.com/intelligolabs/StructXLIP.
中文标题/摘要
标题:StructXLIP:通过多模态结构线索增强视觉语言模型
基于边的表示是视觉理解的基本线索,这一原则根植于早期视觉研究,并且至今仍然至关重要。我们将其原则扩展到视觉语言对齐中,表明隔离和对齐跨模态的结构线索可以极大地提高对长、细节丰富的描述进行微调的效果,特别是改善跨模态检索。我们引入了StructXLIP,这是一种微调对齐范式,提取边缘图(例如Canny),将其视为图像视觉结构的代理,并过滤相应的描述以强调结构线索,使其“结构为中心”。微调通过增加三种结构为中心的损失来增强标准对齐损失:(i)将边缘图与结构文本对齐,(ii)匹配局部边缘区域与文本片段,(iii)将边缘图与颜色图像连接以防止表示漂移。从理论角度来看,虽然标准CLIP最大化视觉和文本嵌入之间的互信息,但StructXLIP还最大化了多模态结构表示之间的互信息。这种辅助优化本质上更难,引导模型向更稳健和语义稳定的极小值发展,增强视觉语言对齐。除了在通用和专门领域中均优于当前竞争对手的跨模态检索之外,我们的方法还提供了一种通用的增强配方,可以以即插即用的方式集成到未来的方案中。代码和预训练模型可在:https://github.com/intelligolabs/StructXLIP公开获取。
Summary / 总结
The research aims to enhance vision-language models by incorporating structural cues from edge-based representations, which are fundamental for visual understanding. The method, StructXLIP, introduces a fine-tuning alignment paradigm that extracts edge maps and filters captions to emphasize structural cues, aligning them with textual descriptions. Key experimental findings show that StructXLIP outperforms existing models on cross-modal retrieval tasks in both general and specialized domains, demonstrating its effectiveness as a general boosting technique for vision-language models.
研究旨在通过引入边缘图等结构线索来增强视觉语言模型,这些线索对于视觉理解至关重要。方法StructXLIP引入了一种细调对齐范式,提取边缘图并与描述进行对齐,强调结构线索。实验结果表明,StructXLIP在通用和专门领域上的跨模态检索任务中均优于现有模型,通过增加结构中心的损失来提高语义稳定性和鲁棒性。
Do Large Language Models Understand Data Visualization Principles?
Authors: Martin Sinnona, Valentin Bonas, Viviana Siless, Emmanuel Iarussi
First: 2026-02-23T17:51:06+00:00 · Latest: 2026-02-23T17:51:06+00:00
Abstract
Data visualization principles, derived from decades of research in design and perception, ensure proper visual communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they and their vision-language counterparts (VLMs) can reason about and enforce visualization principles directly. Constraint based systems encode these principles as logical rules for precise automated checks, but translating them into formal specifications demands expert knowledge. This motivates leveraging LLMs and VLMs as principle checkers that can reason about visual design directly, bypassing the need for symbolic rule specification. In this paper, we present the first systematic evaluation of both LLMs and VLMs on their ability to reason about visualization principles, using hard verification ground truth derived from Answer Set Programming (ASP). We compiled a set of visualization principles expressed as natural-language statements and generated a controlled dataset of approximately 2,000 Vega-Lite specifications annotated with explicit principle violations, complemented by over 300 real-world Vega-Lite charts. We evaluated both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications. Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception. They also reveal an interesting asymmetry: frontier models tend to be more effective at correcting violations than at detecting them reliably.
中文标题/摘要
标题:大型语言模型理解数据可视化原则吗?
数据可视化原则源自数十年在设计和感知方面的研究,确保了视觉传达的正确性。尽管先前的研究表明大型语言模型(LLMs)能够生成图表或标记误导性图表,但尚不清楚它们及其视觉-语言对应物(VLMs)是否能够直接推理和执行可视化原则。基于约束的系统将这些原则编码为逻辑规则以进行精确的自动化检查,但将其转化为形式规范需要专家知识。这促使我们利用LLMs和VLMs作为原则检查器,可以直接推理视觉设计,而无需符号规则规范。在本文中,我们首次系统地评估了LLMs和VLMs在推理可视化原则方面的能力,使用来自回答集编程(ASP)的严格验证地面真相。我们编译了一组用自然语言表述的可视化原则,并生成了一个包含约2,000个Vega-Lite规范的数据集,这些规范被明确标注了原则违反情况,同时还包括了超过300个真实世界的Vega-Lite图表。我们评估了检查和修复任务,评估模型检测原则违反情况和纠正有缺陷的图表规范的能力。我们的工作突显了大型(视觉-)语言模型作为灵活的可视化设计验证器和编辑器的潜力,同时也揭示了与符号求解器在更复杂的视觉感知方面存在的持续差距。它们还揭示了一个有趣的不对称性:前沿模型在纠正违反情况方面比在可靠地检测它们方面更有效。
Summary / 总结
This paper evaluates whether large language models (LLMs) and vision-language models (VLMs) can reason about and enforce data visualization principles, using a dataset of Vega-Lite specifications annotated with principle violations. The study finds that while these models can effectively correct flawed chart specifications, they struggle to reliably detect principle violations, highlighting their potential as flexible validators and editors but also indicating a need for improvement in nuanced visual perception tasks.
本文评估了大型语言模型(LLMs)和视觉语言模型(VLMs)在理解和应用数据可视化原则方面的能力。研究受到无需专家知识进行符号规则指定的需求驱动,使用了约2,000个带有原则违反标注的Vega-Lite规范数据集和超过300个真实世界的图表。评估结果显示,尽管这些模型能够有效地纠正违反原则的情况,但在可靠地检测原则违反方面却存在困难,这既突显了它们作为灵活的验证器和编辑器的潜力,也揭示了与符号求解器在视觉感知的细腻方面相比存在的显著差距。
Training-Free Generative Modeling via Kernelized Stochastic Interpolants
Authors: Florentin Coeurdoux, Etienne Lempereur, Nathanaël Cuvelle-Magar, Thomas Eboli, Stéphane Mallat, Anastasia Borovykh, Eric Vanden-Eijnden
First: 2026-02-23T17:26:09+00:00 · Latest: 2026-02-23T17:26:09+00:00
Abstract
We develop a kernel method for generative modeling within the stochastic interpolant framework, replacing neural network training with linear systems. The drift of the generative SDE is $\hat b_t(x) = \nablaφ(x)^\topη_t$, where $η_t\in\R^P$ solves a $P\times P$ system computable from data, with $P$ independent of the data dimension $d$. Since estimates are inexact, the diffusion coefficient $D_t$ affects sample quality; the optimal $D_t^*$ from Girsanov diverges at $t=0$, but this poses no difficulty and we develop an integrator that handles it seamlessly. The framework accommodates diverse feature maps -- scattering transforms, pretrained generative models etc. -- enabling training-free generation and model combination. We demonstrate the approach on financial time series, turbulence, and image generation.
中文标题/摘要
标题:基于核随机插值的无训练生成建模
我们提出了一种在随机插值框架内进行生成建模的核方法,用线性系统替代神经网络训练。生成SDE的漂移为$\hat b_t(x) = ablaφ(x)^\topη_t$,其中$η_t\in\R^P$是可从数据中计算出的$P\times P$系统解,且$P$与数据维度$d$无关。由于估计不精确,扩散系数$D_t$影响样本质量;从Girsanov公式得出的最佳$D_t^*$在$t=0$时发散,但这并不构成问题,我们开发了一种积分器来无缝处理这一问题。该框架支持多种特征映射——散射变换、预训练生成模型等,实现无训练生成和模型组合。我们在金融时间序列、湍流和图像生成方面进行了演示。
Summary / 总结
This paper presents a kernel-based method for generative modeling using stochastic interpolants, eliminating the need for neural network training. The drift of the generative SDE is defined by a linear system computable from data, with a diffusion coefficient that affects sample quality. The approach is flexible, supporting various feature maps and enabling training-free generation and model combination. Experiments on financial time series, turbulence, and image generation show promising results.
该论文提出了一种使用随机插值的核方法进行生成建模,无需进行神经网络训练。生成SDE的漂移由可从数据计算的线性系统定义,扩散系数影响样本质量。该方法灵活,支持多种特征映射,实现无需训练的生成和模型组合。在金融时间序列、湍流和图像生成上的实验展示了有希望的结果。
HeatPrompt: Zero-Shot Vision-Language Modeling of Urban Heat Demand from Satellite Images
Authors: Kundan Thota, Xuanhao Mu, Thorsten Schlachter, Veit Hagenmeyer
First: 2026-02-23T17:22:54+00:00 · Latest: 2026-02-23T17:22:54+00:00
Abstract
Accurate heat-demand maps play a crucial role in decarbonizing space heating, yet most municipalities lack detailed building-level data needed to calculate them. We introduce HeatPrompt, a zero-shot vision-language energy modeling framework that estimates annual heat demand using semantic features extracted from satellite images, basic Geographic Information System (GIS), and building-level features. We feed pretrained Large Vision Language Models (VLMs) with a domain-specific prompt to act as an energy planner and extract the visual attributes such as roof age, building density, etc, from the RGB satellite image that correspond to the thermal load. A Multi-Layer Perceptron (MLP) regressor trained on these captions shows an $R^2$ uplift of 93.7% and shrinks the mean absolute error (MAE) by 30% compared to the baseline model. Qualitative analysis shows that high-impact tokens align with high-demand zones, offering lightweight support for heat planning in data-scarce regions.
中文标题/摘要
标题:HeatPrompt:基于卫星图像的零样本城市热需求视觉语言建模
准确的热需求地图在实现空间供暖去碳化方面起着关键作用,但大多数地方政府缺乏所需的详细建筑级数据来计算这些地图。我们提出了HeatPrompt,这是一种零样本视觉语言能源建模框架,利用从卫星图像、基本地理信息系统(GIS)和建筑级特征中提取的语义特征来估算年度热需求。我们通过领域特定的提示预训练大型视觉语言模型(VLMs),使其充当能源规划者,并从RGB卫星图像中提取与热负荷对应的视觉属性,如屋顶年龄、建筑密度等。基于这些描述的多层感知机(MLP)回归器在基线模型上显示出93.7%的$R^2$提升,并将平均绝对误差(MAE)减少了30%。定性分析表明,高影响标记与高需求区域对齐,在数据稀缺地区为热规划提供轻量级支持。
Summary / 总结
The research aims to address the lack of detailed building-level data for accurate heat-demand maps, which are essential for decarbonizing space heating. HeatPrompt, a zero-shot vision-language energy modeling framework, uses satellite images and basic GIS data to estimate annual heat demand. By leveraging pretrained VLMs with a domain-specific prompt, the system extracts semantic features from satellite images, and an MLP regressor improves the $R^2$ score by 93.7% and reduces the MAE by 30% compared to the baseline model. Qualitative analysis indicates that the system effectively identifies high-demand zones, providing support for heat planning in data-scarce regions.
研究旨在解决缺乏详细建筑级数据的问题,这些数据对于空间供暖的去碳化至关重要。HeatPrompt 是一种零样本视觉语言能源建模框架,利用卫星图像和基本的GIS数据来估算年度热需求。通过使用带有领域特定提示的预训练VLM提取相关视觉属性,并使用MLP回归器,该系统实现了93.7%的$R^2$提升和30%的MAE降低,相比基线模型显示出其在数据稀缺地区进行热规划的有效性。
Training-Free Safe Denoisers for Safe Use of Diffusion Models
Authors: Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park
First: 2025-02-11T23:14:39+00:00 · Latest: 2026-02-23T17:09:14+00:00
Comments: NeurIPS2025, Code: https://github.com/MingyuKim87/Safe_Denoiser
Abstract
There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.
中文标题/摘要
标题:无需训练的安全去噪器以安全使用扩散模型
随着强大的扩散模型(DMs)的安全性问题日益引起关注,它们经常被误用以生成不适当、不适合工作的(NSFW)内容或生成受版权保护的材料或个人希望被遗忘的数据。许多现有方法通过大量依赖文本负面提示或重新训练DMs来消除某些特征或样本来解决这些问题。在本文中,我们采取了完全不同的方法,通过利用否定集(例如,不安全的图像、受版权保护的数据或需要排除的数据点)直接修改采样轨迹,以避免数据分布中的特定区域,而无需重新训练或微调DMs。我们正式推导了安全和不安全的去噪样本之间的关系,从而得到我们的“安全”去噪器,确保其最终样本远离需要否定的区域。受推导的启发,我们开发了一种实用算法,在文本条件、类别条件和无条件图像生成场景中成功生成高质量样本,同时避免数据分布的否定区域。这些结果暗示了我们无需训练的安全去噪器在更安全地使用DMs方面的巨大潜力。
Summary / 总结
This paper addresses the safety concerns of diffusion models by proposing a training-free safe denoiser that directly modifies the sampling trajectory using a negation set to avoid specific regions of data distribution. The method derives the relationship between safe and unsafe samples, ensuring the final output is safe. Experiments show high-quality samples are produced without retraining or fine-tuning the diffusion models, demonstrating the potential for safer use in various image generation scenarios.
该论文通过提出一种无需重新训练的训练-free 安全去噪器来解决扩散模型的安全问题,该方法通过利用否定集避免特定的数据分布区域,而不重新训练模型。该方法在文本条件、类别条件和无条件图像生成场景中得到验证,展示了其在生成高质量安全样本方面的有效性。
Contextual Safety Reasoning and Grounding for Open-World Robots
Authors: Zachary Ravichadran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar, George J. Pappas
First: 2026-02-23T15:51:23+00:00 · Latest: 2026-02-23T15:51:23+00:00
Abstract
Robots are increasingly operating in open-world environments where safe behavior depends on context: the same hallway may require different navigation strategies when crowded versus empty, or during an emergency versus normal operations. Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment. We address this gap via CORE, a safety framework that enables online contextual reasoning, grounding, and enforcement without prior knowledge of the environment (e.g., maps or safety specifications). CORE uses a vision-language model (VLM) to continuously reason about context-dependent safety rules directly from visual observations, grounds these rules in the physical environment, and enforces the resulting spatially-defined safe sets via control barrier functions. We provide probabilistic safety guarantees for CORE that account for perceptual uncertainty, and we demonstrate through simulation and real-world experiments that CORE enforces contextually appropriate behavior in unseen environments, significantly outperforming prior semantic safety methods that lack online contextual reasoning. Ablation studies validate our theoretical guarantees and underscore the importance of both VLM-based reasoning and spatial grounding for enforcing contextual safety in novel settings. We provide additional resources at https://zacravichandran.github.io/CORE.
中文标题/摘要
标题:开放世界机器人的情境安全推理与接地
机器人越来越多地在开放世界环境中操作,其安全行为依赖于情境:同一走廊在拥挤和空旷时需要不同的导航策略,或者在紧急情况和正常操作时需要不同的策略。传统安全方法在用户指定的情境下强制执行固定的约束,限制了它们处理实际部署中开放的情境变化的能力。我们通过CORE,一种安全框架,解决了这一缺口,该框架能够在没有先验环境知识(例如地图或安全规范)的情况下进行在线的情境推理、接地和执行。CORE 使用视觉语言模型(VLM)从视觉观察中连续推理情境相关的安全规则,将这些规则与物理环境对接,并通过控制障碍函数实现空间定义的安全集的执行。我们为CORE提供了考虑感知不确定性的情境概率安全保证,并通过仿真和实际实验表明,CORE能够在未见过的环境中执行适当的情境行为,显著优于缺乏在线情境推理的先前语义安全方法。消融研究验证了我们的理论保证,并强调了基于VLM的推理和空间对接对于在新环境中执行情境安全的重要性。有关更多信息,请参见https://zacravichandran.github.io/CORE。
Summary / 总结
The paper addresses the challenge of ensuring safe behavior in open-world environments for robots, where safety depends on context. It introduces CORE, a safety framework that uses a vision-language model to reason about context-dependent safety rules from visual observations, ground these rules in the physical environment, and enforce them via control barrier functions. CORE provides probabilistic safety guarantees and outperforms previous methods in enforcing contextually appropriate behavior in unseen environments through both simulation and real-world experiments. Ablation studies confirm the importance of VLM-based reasoning and spatial grounding for enforcing contextual safety in novel settings.
论文旨在解决机器人在开放环境中的安全行为问题,其中安全取决于上下文。提出了一种名为CORE的安全框架,该框架利用视觉语言模型从视觉观察中推理上下文相关的安全规则,将这些规则与物理环境对接,并通过控制屏障函数进行执行。CORE提供了概率安全保证,并通过仿真和真实世界实验在未见过的环境中表现出色,优于缺乏上下文推理的先前语义安全方法。消融研究证实了基于视觉语言模型的推理和空间对接对于在新环境中执行上下文安全的重要性。
A Replicate-and-Quantize Strategy for Plug-and-Play Load Balancing of Sparse Mixture-of-Experts LLMs
Authors: Zijie Liu, Jie Peng, Jinhao Duan, Zirui Liu, Kaixiong Zhou, Mingfu Liang, Luke Simon, Xi Liu, Zhaozhuo Xu, Tianlong Chen
First: 2026-02-23T15:11:16+00:00 · Latest: 2026-02-23T15:11:16+00:00
Abstract
Sparse Mixture-of-Experts (SMoE) architectures are increasingly used to scale large language models efficiently, delivering strong accuracy under fixed compute budgets. However, SMoE models often suffer from severe load imbalance across experts, where a small subset of experts receives most tokens while others are underutilized. Prior work has focused mainly on training-time solutions such as routing regularization or auxiliary losses, leaving inference-time behavior, which is critical for deployment, less explored. We present a systematic analysis of expert routing during inference and identify three findings: (i) load imbalance persists and worsens with larger batch sizes, (ii) selection frequency does not reliably reflect expert importance, and (iii) overall expert workload and importance can be estimated using a small calibration set. These insights motivate inference-time mechanisms that rebalance workloads without retraining or router modification. We propose Replicate-and-Quantize (R&Q), a training-free and near-lossless framework for dynamic workload rebalancing. In each layer, heavy-hitter experts are replicated to increase parallel capacity, while less critical experts and replicas are quantized to remain within the original memory budget. We also introduce a Load-Imbalance Score (LIS) to measure routing skew by comparing heavy-hitter load to an equal allocation baseline. Experiments across representative SMoE models and benchmarks show up to 1.4x reduction in imbalance with accuracy maintained within +/-0.6%, enabling more predictable and efficient inference.
中文标题/摘要
标题:一种复制和量化策略用于插件式稀疏混合专家LLM负载均衡
稀疏混合专家(SMoE)架构被广泛用于高效扩展大型语言模型,能够在固定计算预算下提供强大的准确性。然而,SMoE模型经常面临严重的专家负载不平衡问题,其中一小部分专家接收大部分令牌,而其他专家则被严重闲置。先前的工作主要集中在训练时的解决方案,如路由正则化或辅助损失,而对推理时的行为关注较少,这是部署的关键因素。 我们对推理时的专家路由进行了系统分析,并发现了三个发现:(i) 随着批次大小的增大,负载不平衡持续并加剧,(ii) 选择频率不能可靠地反映专家的重要性,(iii) 整体专家工作量和重要性可以通过一个小的校准集来估计。这些见解促使我们开发了无需重新训练或修改路由器的推理时机制,以重新平衡工作量。 我们提出了复制和量化(R&Q)框架,这是一种无需训练且几乎无损的动态工作量重新平衡框架。在每一层中,复制主要的专家以增加并行容量,而较少关键的专家和复制体则进行量化以保持在原始内存预算内。我们还引入了负载不平衡分数(LIS)来通过与等分配基准进行比较来衡量路由偏差。跨代表性SMoE模型和基准的实验表明,与保持准确性在±0.6%内的同时,不平衡最多可减少1.4倍,从而实现更可预测和高效的推理。
Summary / 总结
The paper addresses the issue of load imbalance in Sparse Mixture-of-Experts (SMoE) models during inference, proposing a Replicate-and-Quantize (R&Q) strategy to rebalance workloads without retraining. The authors find that load imbalance increases with larger batch sizes, selection frequency does not reliably reflect expert importance, and overall workload can be estimated using a small calibration set. The R&Q method replicates heavy-hitter experts to increase parallel capacity and quantizes less critical experts to maintain memory budget, achieving up to 1.4x reduction in imbalance with minimal accuracy loss.
论文研究了稀疏混合专家(SMoE)模型在推理过程中存在的负载不平衡问题,提出了一种名为Replicate-and-Quantize (R&Q)的策略来动态重新平衡工作负载。研究发现,随着批量大小的增加,负载不平衡会加剧,选择频率并不能准确反映专家的重要性,而整体工作负载可以通过一个小的校准集进行估算。R&Q框架通过复制关键专家来增加并行容量,并通过量化次要专家来保持内存预算,实现了高达1.4倍的负载不平衡减少,并且几乎没有准确率损失。
Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery
Authors: Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li
Venue: CVPR 2026
First: 2026-02-23T14:51:09+00:00 · Latest: 2026-02-23T14:51:09+00:00
Comments: 15 pages, accepted by CVPR 2026
Abstract
Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD task are usually built on multi-modality representation learning, which is heavily dependent upon inter-modality alignment. However, few of them cast a proper intra-modality alignment to generate a desired underlying structure of representation distributions. In this paper, we propose a novel and effective multi-modal representation learning framework for GCD via Semi-Supervised Rate Reduction, called SSR$^2$-GCD, to learn cross-modality representations with desired structural properties based on emphasizing to properly align intra-modality relationships. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets demonstrating superior performance of our approach.
中文标题/摘要
标题:基于半监督率减少的多模态表示学习以实现泛化类别发现
泛化类别发现(GCD)旨在识别已知和未知类别,仅给定已知类别的部分标签,这构成了一个具有挑战性的开放集识别问题。当前GCD任务的先进方法通常基于多模态表示学习,这高度依赖于跨模态对齐。然而,很少有方法能够适当对齐内模态关系以生成所需的表示分布结构。在本文中,我们提出了一种新的半监督率减少的多模态表示学习框架SSR$^2$-GCD,以基于强调适当对齐内模态关系来学习具有所需结构特性的跨模态表示。此外,为了增强知识迁移,我们通过利用视觉语言模型提供的跨模态对齐来整合提示候选。我们在通用和细粒度基准数据集上进行了广泛的实验,证明了我们方法的优越性能。
Summary / 总结
The paper addresses the challenge of Generalized Category Discovery (GCD) by proposing SSR$^2$-GCD, a multi-modal representation learning framework that emphasizes intra-modality alignment to generate structured representation distributions. By integrating prompt candidates from Vision Language Models, the approach enhances knowledge transfer. Experiments on generic and fine-grained datasets show superior performance compared to state-of-the-art methods.
论文提出了SSR$^2$-GCD框架,这是一种多模态表示学习方法,旨在解决泛化类别发现(GCD)问题。该框架强调内模态关系的对齐,并通过视觉语言模型提供的跨模态对齐整合提示候选,以增强知识迁移。在基准数据集上的实验结果表明,SSR$^2$-GCD在通用和细粒度类别中均优于现有方法。
Make Some Noise: Unsupervised Remote Sensing Change Detection Using Latent Space Perturbations
Authors: Blaž Rolih, Matic Fučka, Filip Wolf, Luka Čehovin Zajc
First: 2026-02-23T14:27:36+00:00 · Latest: 2026-02-23T14:27:36+00:00
Abstract
Unsupervised change detection (UCD) in remote sensing aims to localise semantic changes between two images of the same region without relying on labelled data during training. Most recent approaches rely either on frozen foundation models in a training-free manner or on training with synthetic changes generated in pixel space. Both strategies inherently rely on predefined assumptions about change types, typically introduced through handcrafted rules, external datasets, or auxiliary generative models. Due to these assumptions, such methods fail to generalise beyond a few change types, limiting their real-world usage, especially in rare or complex scenarios. To address this, we propose MaSoN (Make Some Noise), an end-to-end UCD framework that synthesises diverse changes directly in the latent feature space during training. It generates changes that are dynamically estimated using feature statistics of target data, enabling diverse yet data-driven variation aligned with the target domain. It also easily extends to new modalities, such as SAR. MaSoN generalises strongly across diverse change types and achieves state-of-the-art performance on five benchmarks, improving the average F1 score by 14.1 percentage points. Project page: https://blaz-r.github.io/mason_ucd
中文标题/摘要
标题:制造一些噪音:使用潜在空间扰动的无监督遥感变化检测
遥感中的无监督变化检测(UCD)旨在在无需在训练过程中使用标记数据的情况下,定位同一区域的两幅图像之间的语义变化。大多数最近的方法要么以训练免费的方式使用冻结的基础模型,要么使用像素空间生成的合成变化进行训练。这两种策略都不可避免地依赖于对变化类型的预设假设,通常通过手工规则、外部数据集或辅助生成模型引入。由于这些假设,此类方法无法泛化到多种变化类型,限制了它们在现实世界中的应用,尤其是在罕见或复杂场景中的应用。为了解决这个问题,我们提出了MaSoN(制造一些噪音),这是一种端到端的UCD框架,在训练过程中直接在潜在特征空间中合成多样的变化。它使用目标数据的特征统计动态估计变化,从而实现与目标领域对齐的多样化且数据驱动的变化。它还很容易扩展到新的模态,如SAR。MaSoN在多种变化类型上表现出强大的泛化能力,并在五个基准测试中实现了最先进的性能,平均F1分数提高了14.1个百分点。项目页面:https://blaz-r.github.io/mason_ucd
Summary / 总结
The research aims to improve unsupervised change detection in remote sensing by addressing the limitations of existing methods that rely on predefined assumptions about change types. MaSoN (Make Some Noise) is proposed as an end-to-end framework that synthesizes diverse changes directly in the latent feature space, enabling data-driven variation aligned with the target domain. MaSoN outperforms existing methods, achieving a 14.1 percentage point improvement in average F1 score across five benchmarks.
论文提出了MaSoN(Make Some Noise)框架,该框架直接在潜在特征空间中生成多样化的变化,避免了对变化类型的预设假设,而是使用目标数据的特征统计来动态估计变化,从而在各种变化类型上表现出更强的泛化能力。MaSoN在五个基准测试中表现优于现有方法,平均F1分数提高了14.1个百分点。
ApET: Approximation-Error Guided Token Compression for Efficient VLMs
Authors: Qiankun Ma, Ziyao Zhang, Haofei Wang, Jie Chen, Zhen Song, Hairong Zheng
First: 2026-02-23T14:15:37+00:00 · Latest: 2026-02-23T14:15:37+00:00
Comments: CVPR2026
Abstract
Recent Vision-Language Models (VLMs) have demonstrated remarkable multimodal understanding capabilities, yet the redundant visual tokens incur prohibitive computational overhead and degrade inference efficiency. Prior studies typically relies on [CLS] attention or text-vision cross-attention to identify and discard redundant visual tokens. Despite promising results, such solutions are prone to introduce positional bias and, more critically, are incompatible with efficient attention kernels such as FlashAttention, limiting their practical deployment for VLM acceleration. In this paper, we step away from attention dependencies and revisit visual token compression from an information-theoretic perspective, aiming to maximally preserve visual information without any attention involvement. We present ApET, an Approximation-Error guided Token compression framework. ApET first reconstructs the original visual tokens with a small set of basis tokens via linear approximation, then leverages the approximation error to identify and drop the least informative tokens. Extensive experiments across multiple VLMs and benchmarks demonstrate that ApET retains 95.2% of the original performance on image-understanding tasks and even attains 100.4% on video-understanding tasks, while compressing the token budgets by 88.9% and 87.5%, respectively. Thanks to its attention-free design, ApET seamlessly integrates with FlashAttention, enabling further inference acceleration and making VLM deployment more practical. Code is available at https://github.com/MaQianKun0/ApET.
中文标题/摘要
标题:ApET:基于近似误差的标记压缩框架以提高高效VLMs
近期的视觉-语言模型(VLMs)展示了显著的多模态理解能力,但冗余的视觉标记导致了巨大的计算开销并降低了推理效率。先前的研究通常依赖于[CLS]注意或文本-视觉交叉注意来识别并丢弃冗余的视觉标记。尽管取得了有希望的结果,但这些解决方案容易引入位置偏差,并且与高效的注意力内核(如FlashAttention)不兼容,限制了它们在VLM加速中的实际部署。本文中,我们从信息论的角度出发,不再依赖注意力机制,旨在在没有任何注意力参与的情况下最大限度地保留视觉信息。我们提出了ApET,一种基于近似误差的标记压缩框架。ApET首先通过线性逼近用少量的基础标记重建原始的视觉标记,然后利用近似误差来识别并丢弃最不信息性的标记。在多个VLMs和基准测试中的广泛实验表明,ApET在图像理解任务中保留了95.2%的原始性能,在视频理解任务中甚至达到了100.4%,同时分别压缩了88.9%和87.5%的标记预算。由于其无注意力设计,ApET可以无缝集成到FlashAttention中,进一步加速推理并使VLM部署更加实际。代码可在https://github.com/MaQianKun0/ApET/获取。
Summary / 总结
This paper addresses the issue of computational overhead in Vision-Language Models (VLMs) caused by redundant visual tokens. It introduces ApET, an Approximation-Error guided Token compression framework that reconstructs original visual tokens using a small set of basis tokens and then drops the least informative tokens based on approximation error. Experiments show that ApET retains 95.2% of original performance on image-understanding tasks and 100.4% on video-understanding tasks while compressing token budgets by 88.9% and 87.5%, respectively, and integrates seamlessly with FlashAttention for further acceleration.
论文针对Vision-Language Models (VLMs)中冗余视觉令牌导致的高计算开销和推理效率降低的问题,提出了ApET,一种基于逼近误差的令牌压缩框架。该框架通过线性逼近重建视觉令牌,然后根据逼近误差移除最不相关信息的令牌。实验表明,ApET在图像理解任务中保持了95.2%的原始性能,在视频理解任务中甚至提高了0.4%的性能,同时将令牌预算分别减少了88.9%和87.5%。由于其无注意力设计,ApET可以无缝集成到FlashAttention中,进一步提高推理速度并使VLM部署更加实用。
Open-vocabulary 3D scene perception in industrial environments
Authors: Keno Moenck, Adrian Philip Florea, Julian Koch, Thorsten Schüppstuhl
First: 2026-02-23T13:22:51+00:00 · Latest: 2026-02-23T13:22:51+00:00
Abstract
Autonomous vision applications in production, intralogistics, or manufacturing environments require perception capabilities beyond a small, fixed set of classes. Recent open-vocabulary methods, leveraging 2D Vision-Language Foundation Models (VLFMs), target this task but often rely on class-agnostic segmentation models pre-trained on non-industrial datasets (e.g., household scenes). In this work, we first demonstrate that such models fail to generalize, performing poorly on common industrial objects. Therefore, we propose a training-free, open-vocabulary 3D perception pipeline that overcomes this limitation. Instead of using a pre-trained model to generate instance proposals, our method simply generates masks by merging pre-computed superpoints based on their semantic features. Following, we evaluate the domain-adapted VLFM "IndustrialCLIP" on a representative 3D industrial workshop scene for open-vocabulary querying. Our qualitative results demonstrate successful segmentation of industrial objects.
中文标题/摘要
标题:工业环境中的开放词汇3D场景感知
生产、仓储物流或制造环境中自主视觉应用需要超越小型固定类别集的感知能力。最近的开放词汇方法利用2D视觉-语言基础模型(VLFM)针对此任务,但通常依赖于在非工业数据集(如家庭场景)上预训练的类别无关分割模型。在本文中,我们首先证明了这些模型无法泛化,在常见工业物体上表现不佳。因此,我们提出了一种无需训练的开放词汇3D感知管道,克服了这一限制。我们的方法不使用预训练模型生成实例提案,而是通过合并基于语义特征的超点生成掩码。随后,我们在一个代表性的3D工业车间场景上评估了领域适应的VLFM“IndustrialCLIP”用于开放词汇查询。我们的定性结果表明成功分割了工业物体。
Summary / 总结
This work addresses the challenge of open-vocabulary 3D scene perception in industrial environments, where existing methods based on class-agnostic models pre-trained on non-industrial datasets perform poorly. The authors propose a training-free approach that uses pre-computed superpoints to generate masks based on semantic features, avoiding the need for a pre-trained model. Evaluation on an industrial scene shows successful segmentation of objects, demonstrating the method's effectiveness in this domain.
该研究旨在解决工业环境中自主视觉系统需要识别超出固定类别范围的广泛物体的需求。研究发现,现有的开放词汇方法依赖于在非工业数据集上预训练的类无差别模型,无法在工业环境中很好地泛化。为解决这一问题,作者提出了一种无需训练的方法,通过基于语义特征的预计算超点生成掩码,实现了对工业物体的成功分割。该方法使用领域适应的VLFM 'IndustrialCLIP' 在一个典型的3D工业车间场景中进行评估,展示了在开放词汇查询中的良好表现。
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding
Authors: Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo
Venue: ICLR 2026
First: 2025-05-23T11:48:48+00:00 · Latest: 2026-02-23T12:51:16+00:00
Abstract
Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.
中文标题/摘要
标题:U2-BENCH:在超声理解方面评估大型视觉-语言模型
超声是一种广泛使用的成像技术,在全球医疗保健中至关重要,但由于操作者、噪声和解剖结构的差异导致其解释具有挑战性。尽管大型视觉-语言模型(LVLMs)在自然和医学领域展示了令人印象深刻的多模态能力,但它们在超声方面的表现尚未得到充分探索。我们介绍了U2-BENCH,这是首个全面评估LVLMs在超声理解方面的基准,涵盖分类、检测、回归和文本生成任务。U2-BENCH汇集了15个解剖区域的7,241个病例,并定义了8个临床启发的任务,如诊断、视图识别、病灶定位、临床价值评估和报告生成,覆盖了50个超声应用场景。我们评估了23个最先进的LVLMs,包括开源和闭源、通用和医学专用模型。我们的结果显示了在图像级分类上的强大性能,但在空间推理和临床语言生成方面仍存在持续挑战。U2-BENCH为医学超声成像这一独特多模态领域中的LVLM研究提供了一个严格的统一测试平台。
Summary / 总结
U2-BENCH is the first benchmark to evaluate large vision-language models (LVLMs) on ultrasound understanding, covering classification, detection, regression, and text generation tasks. It includes 7,241 cases from 15 anatomical regions and 50 ultrasound scenarios, assessing 23 state-of-the-art LVLMs. The results show strong performance in image-level classification but highlight challenges in spatial reasoning and clinical language generation.
U2-BENCH 是首个用于评估大型视觉-语言模型(LVLM)在超声理解上的基准,涵盖分类、检测、回归和文本生成任务。它包含来自15个解剖区域的7,241个病例和50个超声场景,定义了8个临床导向的任务。研究评估了23个最先进的LVLM,并发现其在图像级分类上有较强表现,但在空间推理和临床语言生成方面存在挑战。U2-BENCH 为推进医疗超声成像领域的LVLM研究提供了严格的测试平台。
TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
Authors: Fan Yang, Shurong Zheng, Hongyin Zhao, Yufei Zhan, Xin Li, Yousong Zhu, Chaoyang Zhao Ming Tang, Jinqiao Wang
First: 2026-02-23T12:18:26+00:00 · Latest: 2026-02-23T12:18:26+00:00
Abstract
Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.
中文标题/摘要
标题:TraceVision:轨迹感知的视觉语言模型以实现类人的空间理解
近年来,大型视觉语言模型(LVLMs)在图像理解和自然语言生成方面表现出色。然而,当前的方法主要集中在全局图像理解上,难以模拟人类的视觉注意力轨迹,并解释描述与特定区域之间的关联。我们提出了一种名为TraceVision的统一视觉语言模型,该模型在端到端框架中整合了轨迹感知的空间理解。TraceVision采用轨迹感知视觉感知(TVP)模块,实现视觉特征与轨迹信息的双向融合。我们设计了几何简化来从原始轨迹中提取语义关键点,并提出了一种三阶段训练管道,其中轨迹指导描述生成和区域定位。我们将TraceVision扩展到轨迹引导的分割和视频场景理解,实现了跨帧跟踪和时间注意力分析。我们构建了基于推理的交互式局部叙述(RILN)数据集,以增强逻辑推理和可解释性。在轨迹引导的描述、文本引导的轨迹预测、理解和分割的广泛实验中,TraceVision达到了最先进的性能,为直观的空间交互和可解释的视觉理解奠定了基础。
Summary / 总结
The research aims to improve vision-language models to better simulate human visual attention and spatial understanding. TraceVision integrates trajectory-aware spatial understanding into an end-to-end framework using a Trajectory-aware Visual Perception module and a three-stage training pipeline. The model demonstrates superior performance in trajectory-guided captioning, text-guided trajectory prediction, and understanding, setting a new standard for spatial interaction and visual understanding.
研究旨在提升视觉语言模型,使其更好地模拟人类的视觉注意力和空间理解。TraceVision 在端到端框架中整合了轨迹感知的空间理解,使用轨迹感知视觉感知模块进行视觉和轨迹信息的双向融合。该模型在轨迹引导的描述生成、文本引导的轨迹预测和理解方面表现出色,达到了最先进的性能,为直观的空间交互和可解释的视觉理解奠定了基础。
MedDIFT: Multi-Scale Diffusion-Based Correspondence in 3D Medical Imaging
Authors: Xingyu Zhang, Anna Reithmeir, Fryderyk Kögl, Rickmer Braren, Julia A. Schnabel, Daniel M. Lang
First: 2025-12-05T09:53:07+00:00 · Latest: 2026-02-23T12:05:15+00:00
Comments: Updated results
Abstract
Accurate spatial correspondence between medical images is essential for longitudinal analysis, lesion tracking, and image-guided interventions. Medical image registration methods rely on local intensity-based similarity measures, which fail to capture global semantic structure and often yield mismatches in low-contrast or anatomically variable regions. Recent advances in diffusion models suggest that their intermediate representations encode rich geometric and semantic information. We present MedDIFT, a training-free 3D correspondence framework that leverages multi-scale features from a pretrained latent medical diffusion model as voxel descriptors. MedDIFT fuses diffusion activations into rich voxel-wise descriptors and matches them via cosine similarity, with an optional local-search prior. On a publicly available lung CT dataset, MedDIFT shows promising capability in identifying anatomical correspondence without requiring any task-specific model training. Ablation experiments confirm that multi-level feature fusion and modest diffusion noise improve performance. Code is available online.
中文标题/摘要
标题:MedDIFT:基于扩散的3D医学成像多尺度对应
医学图像之间的准确空间对应对于纵向分析、病灶追踪和图像引导干预至关重要。医学图像配准方法依赖于局部基于强度的相似性度量,这无法捕捉全局语义结构,往往在低对比度或解剖结构变化区域产生错配。最近在扩散模型方面的进展表明,它们的中间表示编码了丰富的几何和语义信息。我们提出MedDIFT,这是一种无需训练的3D对应框架,利用预训练的医学扩散模型的多尺度特征作为体素描述符。MedDIFT将扩散激活融合到丰富的体素级描述符中,并通过余弦相似度进行匹配,可选地加入局部搜索先验。在公开的肺CT数据集上,MedDIFT展示了在无需任何特定任务模型训练的情况下识别解剖对应的能力。消融实验表明,多级特征融合和适度的扩散噪声可以提高性能。代码已在线发布。
Summary / 总结
MedDIFT is a training-free 3D correspondence framework that uses multi-scale features from a pretrained latent medical diffusion model to identify anatomical correspondences in medical images. It fuses diffusion activations into rich voxel-wise descriptors and matches them using cosine similarity, with an optional local-search prior. Experiments on a public lung CT dataset demonstrate MedDIFT's effectiveness in accurately identifying anatomical correspondences without task-specific model training, with ablation studies showing that multi-level feature fusion and diffusion noise improve performance.
MedDIFT 是一个无需训练的 3D 医学图像对应框架,利用预训练的医疗扩散模型的多尺度特征。它将扩散激活融合成体素级描述符,并使用余弦相似性进行匹配,可选地带有局部搜索先验。在肺部 CT 数据集上,MedDIFT 证明了在无需特定任务训练的情况下有效识别解剖对应关系,消融实验显示多级特征融合和扩散噪声增强性能。
Improving Outdoor Multi-cell Fingerprinting-based Positioning via Mobile Data Augmentation
Authors: Tony Chahoud, Lorenzo Mario Amorosa, Riccardo Marini, Luca De Nardis
First: 2025-09-23T09:09:45+00:00 · Latest: 2026-02-23T11:57:41+00:00
Abstract
Accurate outdoor positioning in cellular networks is hindered by sparse, heterogeneous measurement collections and the high cost of exhaustive site surveys. This paper introduces a lightweight, modular mobile data augmentation framework designed to enhance multi-cell fingerprinting-based positioning using operator-collected minimization of drive test (MDT) records. The proposed approach decouples spatial and radio-feature synthesis: kernel density estimation (KDE) models the empirical spatial distribution to generate geographically coherent synthetic locations, while a k-nearest-neighbor (KNN)-based block produces augmented per-cell radio fingerprints. The architecture is intentionally training-free, interpretable, and suitable for distributed or on-premise operator deployments, supporting privacy-aware workflows. We both validate each augmentation module independently and assess its end-to-end impact on fingerprinting-based positioning using a real-world MDT dataset provided by an Italian mobile network operator across diverse urban and peri-urban scenarios. Results show that the proposed KDE-KNN augmentation consistently improves positioning performance with respect to state-of-the-art approaches, reducing the median positioning error by up to 30% in the most sparsely sampled or structurally complex regions. We also observe region-dependent saturation effects, which emerge most rapidly in scenarios with high user density where the information gain from additional synthetic samples quickly diminishes. Overall, the framework offers a practical, low-complexity path to enhance operator positioning services using existing mobile data traces.
中文标题/摘要
标题:通过移动数据增强提高基于多小区指纹定位的室外定位精度
在蜂窝网络中实现准确的室外定位受到稀疏、异质测量集和详尽现场调查成本高的限制。本文介绍了一种轻量级、模块化的移动数据增强框架,旨在通过运营商收集的最小化路测记录(MDT)记录来增强基于多小区指纹的定位。所提出的方法将空间和射频特征合成解耦:核密度估计(KDE)模型实测的空间分布以生成地理上一致的合成位置,而基于k近邻(KNN)的方法生成每个小区的增强射频指纹。该架构故意不依赖于训练,具有可解释性,并适合分布式或本地部署,支持隐私保护的工作流程。我们分别验证了每个增强模块的有效性,并使用意大利移动网络运营商提供的实际MDT数据集评估其对基于指纹的定位的端到端影响,在多种城市和近郊场景中。结果表明,与最先进的方法相比,所提出的KDE-KNN增强方法在最稀疏采样或结构复杂区域的一致提高了定位性能,定位误差中位数最多降低了30%。我们还观察到区域依赖的饱和效应,在用户密度高的场景中,额外合成样本的信息增益迅速减少。总体而言,该框架提供了一条实用的、低复杂度的途径,利用现有的移动数据轨迹来增强运营商的定位服务。
Summary / 总结
This paper addresses the challenge of accurate outdoor positioning in cellular networks by proposing a lightweight mobile data augmentation framework. The framework uses operator-collected MDT records to enhance multi-cell fingerprinting-based positioning through two modules: KDE for spatial distribution modeling and KNN for radio fingerprint augmentation. Experimental results show that this approach reduces median positioning error by up to 30% in sparse or complex regions, outperforming state-of-the-art methods. However, region-specific saturation effects are observed, particularly in high-user-density areas where additional synthetic samples provide diminishing returns.
本文提出了一种轻量级的移动数据增强框架,以解决蜂窝网络中户外定位的准确性问题。该框架利用运营商收集的MDT记录生成合成位置和增强的小区射频指纹。KDE模型空间分布,而KNN生成每个小区的射频指纹。该方法使用实际的MDT数据集进行验证,结果显示在稀疏或复杂区域的中位定位误差最多可减少30%。
VGGT-MPR: VGGT-Enhanced Multimodal Place Recognition in Autonomous Driving Environments
Authors: Jingyi Xu, Zhangshuo Qi, Zhongmiao Yan, Xuyu Gao, Qianyun Jiao, Songpengcheng Xia, Xieyuanli Chen, Ling Pei
First: 2026-02-23T11:33:56+00:00 · Latest: 2026-02-23T11:33:56+00:00
Abstract
In autonomous driving, robust place recognition is critical for global localization and loop closure detection. While inter-modality fusion of camera and LiDAR data in multimodal place recognition (MPR) has shown promise in overcoming the limitations of unimodal counterparts, existing MPR methods basically attend to hand-crafted fusion strategies and heavily parameterized backbones that require costly retraining. To address this, we propose VGGT-MPR, a multimodal place recognition framework that adopts the Visual Geometry Grounded Transformer (VGGT) as a unified geometric engine for both global retrieval and re-ranking. In the global retrieval stage, VGGT extracts geometrically-rich visual embeddings through prior depth-aware and point map supervision, and densifies sparse LiDAR point clouds with predicted depth maps to improve structural representation. This enhances the discriminative ability of fused multimodal features and produces global descriptors for fast retrieval. Beyond global retrieval, we design a training-free re-ranking mechanism that exploits VGGT's cross-view keypoint-tracking capability. By combining mask-guided keypoint extraction with confidence-aware correspondence scoring, our proposed re-ranking mechanism effectively refines retrieval results without additional parameter optimization. Extensive experiments on large-scale autonomous driving benchmarks and our self-collected data demonstrate that VGGT-MPR achieves state-of-the-art performance, exhibiting strong robustness to severe environmental changes, viewpoint shifts, and occlusions. Our code and data will be made publicly available.
中文标题/摘要
标题:VGGT-MPR:增强的多模态地点识别在自主驾驶环境中的应用
在自主驾驶中,稳健的地点识别对于全局定位和环路闭合检测至关重要。虽然多模态地点识别(MPR)中相机和LiDAR数据的跨模态融合显示出克服单模态限制的潜力,但现有MPR方法主要依赖手工设计的融合策略和需要大量重新训练的复杂骨干网络。为了解决这一问题,我们提出了一种采用视觉几何引导变换器(VGGT)作为统一几何引擎的多模态地点识别框架,用于全局检索和再排序。在全局检索阶段,VGGT通过先验深度感知和点图监督提取丰富的视觉嵌入,并使用预测的深度图稠密化稀疏LiDAR点云,以提高结构表示。这增强了融合多模态特征的区分能力,并生成全局描述符以实现快速检索。除了全局检索之外,我们设计了一种无需训练的再排序机制,利用VGGT的跨视图关键点跟踪能力。通过结合掩码引导的关键点提取和置信度感知对应关系评分,我们提出的再排序机制有效地细化了检索结果,而无需额外的参数优化。在大规模自主驾驶基准测试和我们自行收集的数据上进行的广泛实验表明,VGGT-MPR达到了最先进的性能,表现出对严重环境变化、视角偏移和遮挡的强大鲁棒性。我们的代码和数据将公开提供。
Summary / 总结
VGGT-MPR is a multimodal place recognition framework that uses the Visual Geometry Grounded Transformer (VGGT) to enhance the discriminative ability of fused camera and LiDAR data. It improves structural representation by densifying sparse LiDAR point clouds and extracts geometrically-rich visual embeddings. The framework also includes a training-free re-ranking mechanism that refines retrieval results using keypoint tracking and confidence scoring. Experiments show that VGGT-MPR outperforms existing methods in various challenging scenarios, demonstrating strong robustness to environmental changes and occlusions.
VGGT-MPR 是一种使用视觉几何引导变换器(VGGT)进行全局检索和重新排名的多模态地点识别框架。通过提取几何丰富的视觉嵌入并稠密化稀疏的 LiDAR 点云来增强辨别能力。重新排名机制使用掩码引导的关键点提取和置信度感知的对应评分来优化检索结果,而无需额外的参数优化。实验表明,VGGT-MPR 在各种具有挑战性的场景中表现出色,具有很强的鲁棒性。
HDR Reconstruction Boosting with Training-Free and Exposure-Consistent Diffusion
Authors: Yo-Tin Lin, Su-Kai Chen, Hou-Ning Hu, Yen-Yu Lin, Yu-Lun Liu
Venue: WACV 2026
First: 2026-02-23T10:57:22+00:00 · Latest: 2026-02-23T10:57:22+00:00
Comments: WACV 2026. Project page: https://github.com/EusdenLin/HDR-Reconstruction-Boosting
Abstract
Single LDR to HDR reconstruction remains challenging for over-exposed regions where traditional methods often fail due to complete information loss. We present a training-free approach that enhances existing indirect and direct HDR reconstruction methods through diffusion-based inpainting. Our method combines text-guided diffusion models with SDEdit refinement to generate plausible content in over-exposed areas while maintaining consistency across multi-exposure LDR images. Unlike previous approaches requiring extensive training, our method seamlessly integrates with existing HDR reconstruction techniques through an iterative compensation mechanism that ensures luminance coherence across multiple exposures. We demonstrate significant improvements in both perceptual quality and quantitative metrics on standard HDR datasets and in-the-wild captures. Results show that our method effectively recovers natural details in challenging scenarios while preserving the advantages of existing HDR reconstruction pipelines. Project page: https://github.com/EusdenLin/HDR-Reconstruction-Boosting
中文标题/摘要
标题:基于扩散的无训练和曝光一致HDR重建增强
单LDR到HDR重建在过度曝光区域仍然具有挑战性,传统方法往往因信息完全丢失而失败。我们提出了一种无训练方法,通过基于扩散的修复增强现有的间接和直接HDR重建方法。该方法结合了文本引导的扩散模型和SDEdit细化,生成过度曝光区域的合理内容,同时保持多曝光LDR图像的一致性。与需要大量训练的先前方法不同,我们的方法通过迭代补偿机制无缝地与现有的HDR重建技术集成,确保多曝光之间的亮度一致性。我们在标准HDR数据集和野外捕获中展示了在感知质量和定量指标上的显著改进。结果表明,我们的方法在挑战性场景中有效恢复了自然细节,同时保留了现有HDR重建管道的优势。项目页面:https://github.com/EusdenLin/HDR-Reconstruction-Boosting
Summary / 总结
This paper addresses the challenge of reconstructing high dynamic range (HDR) images from low dynamic range (LDR) images, particularly in over-exposed regions where traditional methods often fail. The authors propose a training-free approach that uses diffusion-based inpainting to enhance both indirect and direct HDR reconstruction methods. By integrating text-guided diffusion models with SDEdit refinement, the method generates plausible content in over-exposed areas while maintaining consistency across multiple exposures. The results show significant improvements in perceptual quality and quantitative metrics, effectively recovering natural details in challenging scenarios while preserving the advantages of existing HDR reconstruction pipelines.
本文解决了从低动态范围(LDR)图像重建高动态范围(HDR)图像的挑战,特别是在传统方法往往失败的过曝区域。作者提出了一种无需训练的方法,通过基于扩散的修复来增强间接和直接的HDR重建方法。通过结合文本引导的扩散模型和SDEdit精修,该方法在多曝光LDR图像之间生成合理的内容,同时保持一致性。结果显示,在感知质量和定量指标方面取得了显著改进,有效地在具有挑战性的场景中恢复了自然细节,同时保留了现有HDR重建管道的优势。
Seeing Clearly, Reasoning Confidently: Plug-and-Play Remedies for Vision Language Model Blindness
Authors: Xin Hu, Haomiao Ni, Yunbei Zhang, Jihun Hamm, Zechen Li, Zhengming Ding
Venue: CVPR 2026
First: 2026-02-23T09:02:40+00:00 · Latest: 2026-02-23T09:02:40+00:00
Comments: Accepted by CVPR 2026
Abstract
Vision language models (VLMs) have achieved remarkable success in broad visual understanding, yet they remain challenged by object-centric reasoning on rare objects due to the scarcity of such instances in pretraining data. While prior efforts alleviate this issue by retrieving additional data or introducing stronger vision encoders, these methods are still computationally intensive during finetuning VLMs and don't fully exploit the original training data. In this paper, we introduce an efficient plug-and-play module that substantially improves VLMs' reasoning over rare objects by refining visual tokens and enriching input text prompts, without VLMs finetuning. Specifically, we propose to learn multi-modal class embeddings for rare objects by leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, compensating for limited training examples. These embeddings refine the visual tokens in VLMs through a lightweight attention-based enhancement module that improves fine-grained object details. In addition, we use the learned embeddings as object-aware detectors to generate informative hints, which are injected into the text prompts to help guide the VLM's attention toward relevant image regions. Experiments on two benchmarks show consistent and substantial gains for pretrained VLMs in rare object recognition and reasoning. Further analysis reveals how our method strengthens the VLM's ability to focus on and reason about rare objects.
中文标题/摘要
标题:清晰地看,自信地推理:视觉语言模型视觉盲点的即插即用解决方案
视觉语言模型(VLMs)在广泛的视觉理解方面取得了显著的成功,但在处理预训练数据中罕见对象的以对象为中心的推理方面仍面临挑战。尽管先前的努力通过检索额外数据或引入更强的视觉编码器来缓解这一问题,但这些方法在微调VLMs时仍然计算密集,并且没有充分利用原始训练数据。在本文中,我们介绍了一个高效的即插即用模块,该模块通过细化视觉标记和丰富输入文本提示,显著提高了VLMs对罕见对象的推理能力,而无需对VLMs进行微调。具体而言,我们通过利用视觉基础模型的先验知识和同义词增强的文本描述来学习罕见对象的多模态类嵌入,以弥补有限的训练样本。这些嵌入通过一个轻量级的基于注意力的增强模块来细化VLMs中的视觉标记,从而改善了细粒度的对象细节。此外,我们使用学习到的嵌入作为对象感知检测器来生成信息性提示,这些提示被注入到文本提示中,以帮助引导VLM的注意力朝向相关图像区域。在两个基准上的实验表明,对于预训练的VLMs,在罕见对象识别和推理方面都取得了持续且显著的改进。进一步的分析揭示了我们方法如何增强VLM专注于和推理罕见对象的能力。
Summary / 总结
This paper addresses the challenge of vision language models (VLMs) in reasoning about rare objects, which are underrepresented in their pretraining data. It introduces an efficient plug-and-play module that refines visual tokens and enriches text prompts without requiring VLMs to be fine-tuned. By leveraging prior knowledge from vision foundation models and synonym-augmented text descriptions, the module learns multi-modal class embeddings for rare objects, which are then used to enhance the visual tokens and generate informative hints for the VLM. Experiments on two benchmarks demonstrate consistent improvements in rare object recognition and reasoning for pretrained VLMs.
本文针对视觉语言模型(VLMs)在处理稀有物体时的推理能力不足问题,提出了一个高效的插件模块,通过细化视觉标记和丰富输入文本提示来增强VLMs,而无需进行微调。该方法利用视觉基础模型和同义词增强的文本描述来学习稀有物体的多模态类嵌入,然后用于改进物体细节并生成有助于引导VLM关注相关图像区域的提示。实验结果表明,该方法在两个基准测试中显著提高了预训练VLMs对稀有物体识别和推理的能力。
OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot
Authors: Junhan Zhu, Hesong Wang, Mingluo Su, Zefang Wang, Huan Wang
First: 2025-10-08T08:19:15+00:00 · Latest: 2026-02-23T08:33:05+00:00
Abstract
Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.
中文标题/摘要
标题:OBS-Diff:针对扩散模型的一次性精确剪枝
大规模文本到图像扩散模型虽然强大,但计算成本高昂。现有的一次性网络剪枝方法由于扩散模型的迭代去噪特性,难以直接应用于它们。为解决这一问题,本文提出了一种名为OBS-Diff的新颖一次性剪枝框架,能够实现大规模文本到图像扩散模型的无训练精确压缩。具体而言,(i) OBS-Diff 重新激活经典的Optimal Brain Surgeon (OBS),将其适应现代扩散模型的复杂架构,并支持多种剪枝粒度,包括无结构、N:M半结构和结构(MHA头和FFN神经元)稀疏性;(ii) 为了使剪枝标准与扩散过程的迭代动态相一致,通过从误差累积的角度出发,我们提出了一种新颖的时间步感知海森矩阵构建方法,引入了对数递减加权方案,赋予早期时间步更高的重要性,以减轻潜在的误差累积;(iii) 此外,还提出了一种计算高效的分组顺序剪枝策略,以摊销昂贵的校准过程。大量实验表明,OBS-Diff 实现了扩散模型的一次性剪枝的最新成果,能够在最小化视觉质量下降的情况下实现推理加速。
Summary / 总结
This paper addresses the high computational cost of large-scale text-to-image diffusion models by presenting OBS-Diff, a novel one-shot pruning framework. OBS-Diff adapts the Optimal Brain Surgeon method to support various sparsity types and incorporates a timestep-aware Hessian construction to align pruning criteria with the iterative dynamics of diffusion models. The framework also proposes a computationally efficient group-wise sequential pruning strategy. Experiments demonstrate that OBS-Diff achieves state-of-the-art one-shot pruning, providing significant inference acceleration with minimal impact on visual quality.
本文提出了一种名为OBS-Diff的新颖单次剪枝框架,以解决大规模文本到图像扩散模型的高计算成本问题。OBS-Diff将Optimal Brain Surgeon方法适应现代扩散模型架构,支持多种稀疏类型。它还提出了一种时间步长感知的Hessian构造方法,带有对数递减加权方案,以使剪枝与扩散过程动力学对齐,并提出了一种计算高效的分组顺序剪枝策略。实验表明,OBS-Diff实现了最先进的单次剪枝,提供了显著的推理加速,同时对视觉质量的影响最小。
VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense
Authors: Nadav Kadvil, Ayellet Tal
First: 2026-02-23T07:39:43+00:00 · Latest: 2026-02-23T07:39:43+00:00
Abstract
Large Vision-Language Models (LVLMs) can be vulnerable to adversarial images that subtly bias their outputs toward plausible yet incorrect responses. We introduce a general, efficient, and training-free defense that combines image transformations with agentic data consolidation to recover correct model behavior. A key component of our approach is a two-stage detection mechanism that quickly filters out the majority of clean inputs. We first assess image consistency under content-preserving transformations at negligible computational cost. For more challenging cases, we examine discrepancies in a text-embedding space. Only when necessary do we invoke a powerful LLM to resolve attack-induced divergences. A key idea is to consolidate multiple responses, leveraging both their similarities and their differences. We show that our method achieves state-of-the-art accuracy while maintaining notable efficiency: most clean images skip costly processing, and even in the presence of numerous adversarial examples, the overhead remains minimal.
中文标题/摘要
标题:VALD:多阶段视觉攻击检测以实现高效的LVLM防御
大型视觉-语言模型(LVLMs)可能会受到微妙偏倚其输出的对抗图像的攻击,导致它们产生合理但错误的响应。我们提出了一种通用、高效且无需训练的防御方法,该方法结合了图像变换与代理数据整合,以恢复正确的模型行为。我们方法的关键组成部分是一种两阶段检测机制,可以快速过滤掉大部分干净输入。我们首先在几乎不增加计算成本的情况下,评估内容保持变换下的图像一致性。对于更具挑战性的案例,我们则在文本嵌入空间中检查差异。只有在必要时,我们才会调用强大的LLM来解决攻击引起的偏差。一个关键思想是整合多个响应,利用它们的相似性和差异性。我们展示了我们的方法在保持显著效率的同时达到了最先进的准确性:大多数干净图像可以跳过昂贵的处理,即使存在大量对抗样本,额外的开销也保持在最小。
Summary / 总结
The research aims to develop an efficient defense mechanism against adversarial images that can mislead large vision-language models. The method employs a two-stage detection system combining content-preserving image transformations and text-embedding analysis. For challenging cases, a powerful LLM is used to resolve discrepancies. Key findings show that the approach achieves state-of-the-art accuracy with minimal overhead, as most clean images bypass expensive processing steps.
研究提出了VALD,一种针对大型视觉-语言模型(LVLM)的防御机制,这些模型容易受到对抗性图像的影响。该方法采用两阶段检测过程,首先评估在内容保持变换下的图像一致性,然后在更复杂的情况下检查文本嵌入差异。只有必要时才调用强大的LLM来解决攻击引起的偏差。该方法通过整合多个响应来恢复正确的模型行为。实验表明,VALD在保持最小开销的同时实现了最先进的准确性,即使存在大量对抗性示例也是如此。
Object-WIPER : Training-Free Object and Associated Effect Removal in Videos
Authors: Saksham Singh Kushwaha, Sayan Nag, Yapeng Tian, Kuldeep Kulkarni
Venue: CVPR 2026
First: 2026-01-10T02:28:31+00:00 · Latest: 2026-02-23T07:21:22+00:00
Comments: Accepted to CVPR 2026. Project Page: https://sakshamsingh1.github.io/object_wiper_webpage/
Abstract
In this paper, we introduce Object-WIPER, a training-free framework for removing dynamic objects and their associated visual effects from videos, and inpainting them with semantically consistent and temporally coherent content. Our approach leverages a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens describing the target object and its effects, we localize relevant visual tokens via visual-text cross-attention and visual self-attention. This produces an intermediate effect mask that we fuse with the user mask to obtain a final foreground token mask to replace. We first invert the video through the DiT to obtain structured noise, then reinitialize the masked tokens with Gaussian noise while preserving background tokens. During denoising, we copy values for the background tokens saved during inversion to maintain scene fidelity. To address the lack of suitable evaluation, we introduce a new object removal metric that rewards temporal consistency among foreground tokens across consecutive frames, coherence between foreground and background tokens within each frame, and dissimilarity between the input and output foreground tokens. Experiments on DAVIS and a newly curated real-world associated effect benchmark (WIPER-Bench) show that Object-WIPER surpasses both training-based and training-free baselines in terms of the metric, achieving clean removal and temporally stable reconstruction without any retraining. Our new benchmark, source code, and pre-trained models will be publicly available.
中文标题/摘要
标题:Object-WIPER : 视频中无训练对象及其相关视觉效果去除
在本文中,我们介绍了Object-WIPER,一种无需训练的框架,用于从视频中去除动态对象及其相关视觉效果,并用语义一致且时间连贯的内容进行修复。我们的方法利用了一个预训练的文本到视频扩散变换器(DiT)。给定输入视频、用户提供的对象掩码以及描述目标对象及其效果的查询标记,我们通过视觉-文本交叉注意力和视觉自我注意力定位相关视觉标记。这生成了一个中间效果掩码,我们将其与用户掩码融合以获得最终的前景标记掩码进行替换。我们首先通过DiT反转视频以获得结构化噪声,然后用高斯噪声重新初始化被遮罩的标记,同时保留背景标记。在去噪过程中,我们复制反转过程中保存的背景标记的值以保持场景的连贯性。为了解决缺乏合适评估的问题,我们引入了一个新的对象去除度量标准,该标准奖励前景标记在连续帧中的时间一致性、前景标记与背景标记在每个帧内的连贯性以及输入和输出前景标记之间的差异性。在DAVIS和一个新编纂的真实世界相关效果基准(WIPER-Bench)上的实验表明,Object-WIPER在该度量标准上超过了基于训练和无训练的基线,实现了干净的去除和时间稳定的重建,无需任何重新训练。我们的新基准、源代码和预训练模型将公开提供。
Summary / 总结
Object-WIPER is a training-free framework for removing dynamic objects and their effects from videos, using a pre-trained text-to-video diffusion transformer (DiT). Given an input video, a user-provided object mask, and query tokens, the framework localizes relevant visual tokens and produces an intermediate effect mask. It then inverts the video to obtain structured noise, reinitializes masked tokens with Gaussian noise, and maintains background tokens during denoising. Experiments show that Object-WIPER outperforms both training-based and training-free baselines, achieving clean removal and temporally stable reconstruction without retraining.
Object-WIPER 是一个无需训练的框架,用于从视频中移除动态对象及其效果,利用预训练的文本到视频扩散变换器(DiT)。给定输入视频、用户提供的对象掩码和查询标记,该框架定位相关视觉标记并生成中间效果掩码。通过 DiT 对视频进行反转以获取结构化噪声,并用高斯噪声重新初始化掩码标记,同时保留背景标记。该框架实现了干净的移除和时间上稳定的重建,在DAVIS 和一个新的人类世界关联效果基准(WIPER-Bench)上优于基于训练和无需训练的基线。
Can a Teenager Fool an AI? Evaluating Low-Cost Cosmetic Attacks on Age Estimation Systems
Authors: Xingyu Shen, Tommy Duong, Xiaodong An, Zengqi Zhao, Zebang Hu, Haoyu Hu, Ziyou Wang, Finn Guo, Simiao Ren
First: 2026-02-23T06:13:52+00:00 · Latest: 2026-02-23T06:13:52+00:00
Comments: 13 pages, 6 figures
Abstract
Age estimation systems are increasingly deployed as gatekeepers for age-restricted online content, yet their robustness to cosmetic modifications has not been systematically evaluated. We investigate whether simple, household-accessible cosmetic changes, including beards, grey hair, makeup, and simulated wrinkles, can cause AI age estimators to classify minors as adults. To study this threat at scale without ethical concerns, we simulate these physical attacks on 329 facial images of individuals aged 10 to 21 using a VLM image editor (Gemini 2.5 Flash Image). We then evaluate eight models from our prior benchmark: five specialized architectures (MiVOLO, Custom-Best, Herosan, MiViaLab, DEX) and three vision-language models (Gemini 3 Flash, Gemini 2.5 Flash, GPT-5-Nano). We introduce the Attack Conversion Rate (ACR), defined as the fraction of images predicted as minor at baseline that flip to adult after attack, a population-agnostic metric that does not depend on the ratio of minors to adults in the test set. Our results reveal that a synthetic beard alone achieves 28 to 69 percent ACR across all eight models; combining all four attacks shifts predicted age by +7.7 years on average across all 329 subjects and reaches up to 83 percent ACR; and vision-language models exhibit lower ACR (59 to 71 percent) than specialized models (63 to 83 percent) under the full attack, although the ACR ranges overlap and the difference is not statistically tested. These findings highlight a critical vulnerability in deployed age-verification pipelines and call for adversarial robustness evaluation as a mandatory criterion for model selection.
中文标题/摘要
标题:青少年能否愚弄AI?低成本化妆品攻击对年龄估计系统的评估
年龄估计系统越来越多地被部署为年龄限制在线内容的门卫,但它们对化妆品修改的鲁棒性尚未系统评估。我们研究了简单的、家庭可获取的化妆品变化,包括胡须、灰发、化妆和模拟皱纹,是否能使AI年龄估计器将未成年人分类为成年人。为了在没有伦理问题的情况下大规模研究这一威胁,我们使用VLM图像编辑器(Gemini 2.5 Flash Image)在329张10至21岁个体的面部图像上模拟这些物理攻击。然后,我们评估了我们先前基准中的八种模型:五种专门架构(MiVOLO、Custom-Best、Herosan、MiViaLab、DEX)和三种视觉-语言模型(Gemini 3 Flash、Gemini 2.5 Flash、GPT-5-Nano)。我们引入了攻击转换率(ACR),定义为基线预测为未成年人的图像中翻转为成人的比例,这是一种不依赖于测试集中未成年人与成年人比例的群体无关度量。我们的结果显示,单独使用合成胡须在所有八种模型中实现了28%至69%的ACR;结合所有四种攻击使预测年龄平均增加7.7岁,达到83%的ACR;视觉-语言模型在完整攻击下的ACR(59%至71%)低于专门模型(63%至83%),尽管ACR范围重叠且差异未进行统计测试。这些发现突显了部署中的年龄验证管道中的一个关键漏洞,并呼吁将对抗鲁棒性评估作为模型选择的强制性标准。
Summary / 总结
This study evaluates the effectiveness of simple cosmetic changes, such as beards, grey hair, makeup, and simulated wrinkles, in fooling AI age estimators. Using a VLM image editor, the researchers applied these changes to 329 facial images of individuals aged 10 to 21 and tested eight models, including specialized architectures and vision-language models. The results show that a synthetic beard alone can misclassify minors as adults in 28 to 69 percent of cases, and combining all attacks can shift predicted age by an average of 7.7 years, with vision-language models being less susceptible to these attacks compared to specialized models.
研究评估了简单化妆改变,如胡须、灰发、化妆和皱纹,对AI年龄估计系统的欺骗效果。研究人员使用VLM图像编辑器在329张10至21岁个体的面部图像上模拟了这些攻击。在八种模型中,单独使用合成胡须的攻击转换率(ACR)为28%到69%,而结合所有攻击后,预测年龄平均提高了7.7岁,视觉语言模型在全面攻击下的ACR低于专门模型,但ACR范围重叠且统计差异未被测试。
ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
Authors: Omprakash Chakraborty, Jose Dolz, Ismail Ben Ayed
First: 2026-02-23T05:47:28+00:00 · Latest: 2026-02-23T05:47:28+00:00
Abstract
Vision language models (VLMs) have demonstrated remarkable generalization across diverse tasks, yet their performance remains constrained by the quality and geometry of the textual prototypes used to represent classes. Standard zero shot classifiers, derived from frozen text encoders and handcrafted prompts, may yield correlated or weakly separated embeddings that limit task specific discriminability. We introduce ORION, a text encoder fine tuning framework that improves pretrained VLMs using only class names. Our method optimizes, via low rank adaptation, a novel loss integrating two terms, one promoting pairwise orthogonality between the textual representations of the classes of a given task and the other penalizing deviations from the initial class prototypes. Furthermore, we provide a probabilistic interpretation of our orthogonality penalty, connecting it to the general maximum likelihood estimation (MLE) principle via Huygens theorem. We report extensive experiments on 11 benchmarks and three large VLM backbones, showing that the refined textual embeddings yield powerful replacements for the standard CLIP prototypes. Added as plug and play module on top of various state of the art methods, and across different prediction settings (zero shot, few shot and test time adaptation), ORION improves the performance consistently and significantly.
中文标题/摘要
标题:ORION:通用VLM适配的正交文本编码
视觉语言模型(VLMs)在多种任务上展示了出色的泛化能力,但其性能仍受限于用于表示类别的文本原型的质量和几何结构。标准的零样本分类器,源自冻结的文本编码器和手工制作的提示,可能会产生相关或弱分离的嵌入,从而限制了任务特定的可区分性。我们提出了ORION,一种仅使用类别名称来改进预训练VLMs的文本编码微调框架。我们的方法通过低秩适应优化了一种新颖的损失函数,该函数整合了两个项,一个促进给定任务类别文本表示之间的成对正交性,另一个惩罚与初始类别原型的偏差。此外,我们提供了我们正交性惩罚的概率解释,通过惠更斯定理将其与一般最大似然估计(MLE)原则联系起来。我们在11个基准和三个大型VLM骨干网络上进行了广泛的实验,表明改进后的文本嵌入可以作为标准CLIP原型的强大替代品。作为各种最先进的方法的即插即用模块,并在不同的预测设置(零样本、少量样本和测试时适应)下,ORION能够一致且显著地提高性能。
Summary / 总结
The research aims to enhance the performance of vision language models (VLMs) by improving the quality and geometry of textual prototypes. ORION, a text encoder fine-tuning framework, optimizes pretrained VLMs using only class names through a low-rank adaptation that promotes orthogonality between textual representations and penalizes deviations from initial prototypes. Experiments on 11 benchmarks with three large VLM backbones show that ORION improves task-specific discriminability and outperforms standard CLIP prototypes in zero shot, few shot, and test time adaptation settings.
ORION 是一种仅使用类别名称来增强预训练视觉语言模型的文本编码微调框架。它通过促进正交性和惩罚与初始原型的偏差来优化文本表示。在11个基准上的广泛实验表明,ORION 在零样本、少量样本和测试时适应设置中提高了任务特定的可区分性,并且优于标准的CLIP原型。
MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models
Authors: Mingrui Wu, Hang Liu, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji
First: 2026-02-23T04:32:52+00:00 · Latest: 2026-02-23T04:32:52+00:00
Comments: CVPR2026
Abstract
Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related images, existing benchmarks rarely address the challenges of multi-image context generation, focusing mainly on text-to-image or single-image editing tasks. In this work, we introduce \textbf{MICON-Bench}, a comprehensive benchmark covering six tasks that evaluate cross-image composition, contextual reasoning, and identity preservation. We further propose an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification of semantic and visual consistency, where multimodal large language model (MLLM) serves as a verifier. Additionally, we present \textbf{Dynamic Attention Rebalancing (DAR)}, a training-free, plug-and-play mechanism that dynamically adjusts attention during inference to enhance coherence and reduce hallucinations. Extensive experiments on various state-of-the-art open-source models demonstrate both the rigor of MICON-Bench in exposing multi-image reasoning challenges and the efficacy of DAR in improving generation quality and cross-image coherence. Github: https://github.com/Angusliuuu/MICON-Bench.
中文标题/摘要
标题:MICON-Bench:统一多模态模型中多图像上下文图像生成的基准测试与增强
统一多模态模型(UMMs)的最新进展使图像理解和生成能力取得了显著进步。然而,尽管像Gemini-2.5-Flash-Image这样的模型展示了在处理多张相关图像时进行推理的能力,但现有的基准测试很少关注多图像上下文生成的挑战,主要集中在文本到图像或单图像编辑任务上。在本文中,我们引入了**MICON-Bench**,这是一个涵盖六个任务的综合基准测试,用于评估跨图像合成、上下文推理和身份保留。我们还提出了一种基于MLLM的评估-检查点框架,用于自动验证语义和视觉一致性,其中多模态大型语言模型(MLLM)作为验证器。此外,我们提出了**动态注意力重新平衡(DAR)**,这是一种无需训练、即插即用的机制,在推理过程中动态调整注意力以增强连贯性并减少幻觉。在各种最先进的开源模型上的广泛实验表明,MICON-Bench在揭示多图像推理挑战方面的严谨性以及DAR在提高生成质量和跨图像连贯性方面的有效性。Github: https://github.com/Angusliuuu/MICON-Bench.
Summary / 总结
MICON-Bench is a new benchmark for evaluating multi-image context generation in Unified Multimodal Models, covering six tasks including cross-image composition and contextual reasoning. It introduces an MLLM-driven Evaluation-by-Checkpoint framework for automatic verification and a Dynamic Attention Rebalancing (DAR) mechanism to enhance generation quality. Experiments show that MICON-Bench effectively highlights multi-image reasoning challenges and that DAR improves cross-image coherence and reduces hallucinations.
MICON-Bench 是一个新的基准,用于评估统一多模态模型(UMMs)中的多图像上下文生成,解决了现有基准的局限性。它包括六个任务来评估跨图像合成、上下文推理和身份保持。研究提出了一种基于多模态大型语言模型(MLLM)的自动验证框架 Evaluation-by-Checkpoint,并引入了动态注意力重新平衡(DAR)机制,该机制在不进行训练的情况下增强生成质量和跨图像一致性。实验表明,MICON-Bench 在突出多图像推理挑战方面有效,而 DAR 能够提高生成质量和跨图像一致性。
Forgetting-Resistant and Lesion-Aware Source-Free Domain Adaptive Fundus Image Analysis with Vision-Language Model
Authors: Zheang Huai, Hui Tang, Hualiang Wang, Xiaomeng Li
First: 2026-02-23T03:29:54+00:00 · Latest: 2026-02-23T03:29:54+00:00
Comments: 10 pages
Abstract
Source-free domain adaptation (SFDA) aims to adapt a model trained in the source domain to perform well in the target domain, with only unlabeled target domain data and the source model. Taking into account that conventional SFDA methods are inevitably error-prone under domain shift, recently greater attention has been directed to SFDA assisted with off-the-shelf foundation models, e.g., vision-language (ViL) models. However, existing works of leveraging ViL models for SFDA confront two issues: (i) Although mutual information is exploited to consider the joint distribution between the predictions of ViL model and the target model, we argue that the forgetting of some superior predictions of the target model still occurs, as indicated by the decline of the accuracies of certain classes during adaptation; (ii) Prior research disregards the rich, fine-grained knowledge embedded in the ViL model, which offers detailed grounding for fundus image diagnosis. In this paper, we introduce a novel forgetting-resistant and lesion-aware (FRLA) method for SFDA of fundus image diagnosis with ViL model. Specifically, a forgetting-resistant adaptation module explicitly preserves the confident predictions of the target model, and a lesion-aware adaptation module yields patch-wise predictions from ViL model and employs them to help the target model be aware of the lesion areas and leverage the ViL model's fine-grained knowledge. Extensive experiments show that our method not only significantly outperforms the vision-language model, but also achieves consistent improvements over the state-of-the-art methods. Our code will be released.
中文标题/摘要
标题:基于视觉语言模型的抗遗忘和病变感知源无域适应视网膜图像分析
源无域适应(SFDA)旨在仅使用目标域的未标记数据和源模型的情况下,将源域训练的模型适应到目标域中表现良好。考虑到传统SFDA方法在域转移下不可避免地存在错误,最近越来越多的研究开始关注使用现成的基础模型(如视觉语言模型)辅助的SFDA。然而,现有利用视觉语言模型进行SFDA的工作面临两个问题:(i)尽管利用互信息考虑了视觉语言模型和目标模型预测之间的联合分布,但我们认为目标模型的一些优秀预测仍然会发生遗忘,这体现在某些类别的准确性在适应过程中下降;(ii)先前的研究忽视了视觉语言模型中嵌入的丰富、细粒度的知识,这些知识为视网膜图像诊断提供了详细的依据。在本文中,我们提出了一种新颖的抗遗忘和病变感知(FRLA)方法,用于使用视觉语言模型进行视网膜图像诊断的源无域适应。具体而言,一个抗遗忘适应模块明确地保留了目标模型的自信预测,而一个病变感知适应模块从视觉语言模型中产生局部预测,并利用这些预测帮助目标模型意识到病变区域并利用视觉语言模型的细粒度知识。广泛的实验表明,我们的方法不仅显著优于视觉语言模型,而且在最先进的方法上也实现了持续改进。我们的代码将被发布。
Summary / 总结
This paper addresses the issue of source-free domain adaptation (SFDA) in fundus image analysis, where a model trained on a source domain is adapted to perform well on an unlabeled target domain. The authors introduce a forgetting-resistant and lesion-aware (FRLA) method that uses a vision-language (ViL) model to mitigate forgetting of superior target model predictions and leverage detailed lesion information. Experiments demonstrate that their approach outperforms both the ViL model and existing state-of-the-art methods in terms of accuracy and consistency.
本文提出了一种遗忘抵抗和病灶感知(FRLA)方法,用于使用视觉-语言(ViL)模型进行眼底图像分析的源无域适应(SFDA)。该方法解决了ViL模型中的遗忘和知识利用不足的问题。遗忘抵抗适应模块保留了目标模型的自信预测,而病灶感知适应模块提供斑块级别的预测以突出病灶区域,并利用细粒度知识。实验表明,所提出的方法不仅优于ViL模型,还优于最先进的SFDA方法。
VIRTUE: Visual-Interactive Text-Image Universal Embedder
Authors: Wei-Yao Wang, Kazuya Tateishi, Qiyu Wu, Shusuke Takahashi, Yuki Mitsufuji
Venue: ICLR 2026
First: 2025-10-01T05:11:54+00:00 · Latest: 2026-02-23T02:47:15+00:00
Comments: ICLR 2026. 25 pages. Project page: https://sony.github.io/virtue/
Abstract
Multimodal representation learning models have demonstrated successful operation across complex tasks, and the integration of vision-language models (VLMs) has further enabled embedding models with instruction-following capabilities. However, existing embedding models lack visual-interactive capabilities to specify regions of interest from users (e.g., point, bounding box, mask), which have been explored in generative models to broaden their human-interactive applicability. Equipping embedding models with visual interactions not only would unlock new applications with localized grounding of user intent, which remains unexplored, but also enable the models to learn entity-level information within images to complement their global representations for conventional embedding tasks. In this paper, we propose a novel Visual-InteRactive Text-Image Universal Embedder (VIRTUE) that extends the capabilities of the segmentation model and the vision-language model to the realm of representation learning. In VIRTUE, the segmentation model can process visual prompts that pinpoint specific regions within an image, thereby enabling the embedder to handle complex and ambiguous scenarios more precisely. To evaluate the visual-interaction ability of VIRTUE, we introduce a large-scale Segmentation-and-Scene Caption Retrieval (SCaR) benchmark comprising 1M samples that aims to retrieve the text caption by jointly considering the entity with a specific object and image scene. VIRTUE consistently achieves a state-of-the-art performance with significant improvements across 36 universal MMEB (3.1%-8.5%) and five visual-interactive SCaR (15.2%-20.3%) tasks.
中文标题/摘要
标题:VIRTUE:视觉互动文本图像通用嵌入器
多模态表示学习模型在复杂任务中表现出色,而视觉语言模型(VLMs)的集成进一步使嵌入模型具备了指令遵循能力。然而,现有的嵌入模型缺乏视觉互动能力,无法从用户处指定感兴趣区域(例如,点、边界框、掩码),这些能力已在生成模型中得到探索,以扩大其与人类的互动适用性。为嵌入模型配备视觉互动不仅能够解锁新的具有局部用户意图定位的应用,而且能够使模型在图像中学习实体级信息,以补充其全局表示,从而更好地完成传统嵌入任务。在本文中,我们提出了一种名为Visual-InteRactive Text-Image Universal Embedder (VIRTUE) 的新型视觉互动文本图像通用嵌入器,它将分割模型和视觉语言模型的能力扩展到了表示学习领域。在VIRTUE中,分割模型可以处理指向图像内特定区域的视觉提示,从而使嵌入器能够更精确地处理复杂和模糊的场景。为了评估VIRTUE的视觉互动能力,我们引入了一个包含100万样本的大型分割和场景描述检索(SCaR)基准,旨在通过同时考虑特定对象和图像场景来检索文本描述。VIRTUE在36项通用MMEB任务(3.1%-8.5%)和五项视觉互动SCaR任务(15.2%-20.3%)中均实现了最先进的性能。
Summary / 总结
VIRTUE is a novel multimodal embedding model that integrates visual-interactive capabilities, allowing users to specify regions of interest through visual prompts. It extends the segmentation model and vision-language model to representation learning, enabling more precise handling of complex and ambiguous scenarios. VIRTUE outperforms existing models in both universal multimodal embedding benchmarks (3.1%-8.5% improvement) and visual-interactive segmentation-and-scene caption retrieval tasks (15.2%-20.3% improvement).
研究旨在通过将视觉交互能力整合到现有的嵌入模型中,增强多模态表示学习,这些模型缺乏从用户指定感兴趣区域的能力。VIRTUE,一种新颖的视觉-交互式文本-图像通用嵌入器,将分割模型和视觉语言模型扩展到表示学习领域,使其能够更精确地处理复杂和模棱两可的场景。VIRTUE 在通用多模态嵌入基准任务和视觉交互式分割和场景描述检索任务中表现出优越性能,显著优于现有模型。
Decoupling Vision and Language: Codebook Anchored Visual Adaptation
Authors: Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu
First: 2026-02-23T02:39:26+00:00 · Latest: 2026-02-23T02:39:26+00:00
Comments: 17 pages, accepted to CVPR2026 main conference
Abstract
Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.
中文标题/摘要
标题:解耦视觉与语言:基于码本的视觉适应
大型视觉-语言模型(LVLMs)使用其视觉编码器将图像转换为下游推理的表示,但编码器在医学图像诊断或细粒度分类等特定领域视觉任务中往往表现不佳,其中表示错误可能会传递给语言模型,导致错误的响应。现有适应方法通过投影调整或参数高效更新修改编码器和语言模型之间的连续特征接口,但仍然将两个组件耦合在一起,并且每当编码器改变时都需要重新对齐。我们引入了CRAFT(码本调节微调),这是一种轻量级方法,使用离散码本对编码器进行微调,将视觉表示锚定到稳定的空间中,从而在不修改模型其他部分的情况下实现领域适应。这种解耦设计允许适应后的编码器无缝提升不同语言架构的LVLMs的性能,只要它们共享相同的码本。实验证明,CRAFT在包括VQARAD和PlantVillage在内的10个特定领域基准测试中平均提高了13.51%,同时保留了LLM的语言能力,并优于在连续标记上操作的同类方法。
Summary / 总结
The research aims to improve the performance of large vision-language models (LVLMs) in domain-specific visual tasks by decoupling the vision encoder from the language model. CRAFT, a lightweight method, fine-tunes the encoder using a discrete codebook, anchoring visual representations to a stable token space. This approach avoids modifying other parts of the model and achieves an average performance gain of 13.51% across 10 domain-specific benchmarks while preserving linguistic capabilities and outperforming methods that operate on continuous tokens.
研究旨在通过解耦视觉编码器和语言模型来提高大型视觉-语言模型(LVLM)在特定领域视觉任务中的性能。CRAFT 是一种轻量级方法,通过使用离散码本来微调编码器,将视觉表示锚定到一个稳定的令牌空间。这种方法避免了修改模型的其他部分,并在10个特定领域基准测试中实现了平均13.51%的性能提升,同时保留了语言能力并优于操作连续令牌的方法。
UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment
Authors: Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi
First: 2026-02-23T02:24:55+00:00 · Latest: 2026-02-23T02:24:55+00:00
Comments: 26 pages
Abstract
Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($κ=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.
中文标题/摘要
标题:UrbanAlign:后验语义校准以实现VLM-人类偏好对齐
在特定领域任务中使视觉语言模型(VLM)输出与人类偏好对齐通常需要微调或强化学习,两者都需要标注数据和GPU计算。我们表明,在主观感知任务中,这种对齐可以通过无需任何模型训练来实现:VLM已经是强大的概念提取器,但决策校准较差,可以通过外部手段弥补这一差距。我们提出了一种无需训练的后验概念瓶颈管道,由三个紧密耦合的阶段组成:概念挖掘、多智能体结构化评分和几何校准,统一在一个端到端维度优化循环中。可解释的评估维度从少量的人类注释中挖掘;观察者-辩手-法官链从冻结的VLM中提取稳健的连续概念评分;并在混合视觉语义流形上使用局部加权岭回归校准这些评分与人类评分。应用于城市感知作为UrbanAlign,该框架在Place Pulse 2.0的六个类别上实现了72.2%的准确率($κ=0.45$),优于最佳监督基线+15.1个百分点,优于未校准的VLM评分+16.3个百分点,具有完整的维度级可解释性和零模型权重修改。
Summary / 总结
The research aims to align vision-language model outputs with human preferences in urban perception tasks without requiring model training, which is typically resource-intensive. The method involves a post-hoc concept-bottleneck pipeline with three stages: concept mining, multi-agent structured scoring, and geometric calibration. This approach achieves 72.2% accuracy on Place Pulse 2.0, outperforming supervised baselines and uncalibrated VLM scoring by significant margins.
研究旨在通过后处理概念瓶颈管道,无需模型训练,将视觉语言模型的输出与城市感知任务中的人类偏好对齐。该方法UrbanAlign包括概念挖掘、多代理结构评分和几何校准三个阶段。该方法在Place Pulse 2.0数据集上达到了72.2%的准确率,显著优于监督基线和未校准的VLM评分。
History
20260224_0406 20260223_0338 20260222_0339 20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553