Going Down Memory Lane: Scaling Tokens for Video Stream Understanding with Dynamic KV-Cache Memory
Authors: Vatsal Agarwal, Saksham Suri, Matthew Gwilliam, Pulkit Kumar, Abhinav Shrivastava
First: 2026-02-20T18:59:50+00:00 · Latest: 2026-02-20T18:59:50+00:00
Comments: Project page: see https://vatsalag99.github.io/memstream/
Abstract
Streaming video understanding requires models to robustly encode, store, and retrieve information from a continuous video stream to support accurate video question answering (VQA). Existing state-of-the-art approaches rely on key-value caching to accumulate frame-level information over time, but use a limited number of tokens per frame, leading to the loss of fine-grained visual details. In this work, we propose scaling the token budget to enable more granular spatiotemporal understanding and reasoning. First, we find that current methods are ill-equipped to handle dense streams: their feature encoding causes query-frame similarity scores to increase over time, biasing retrieval toward later frames. To address this, we introduce an adaptive selection strategy that reduces token redundancy while preserving local spatiotemporal information. We further propose a training-free retrieval mixture-of-experts that leverages external models to better identify relevant frames. Our method, MemStream, achieves +8.0% on CG-Bench, +8.5% on LVBench, and +2.4% on VideoMME (Long) over ReKV with Qwen2.5-VL-7B.
中文标题/摘要
标题:沿着记忆之路:通过动态KV缓存记忆扩展视频流理解的令牌
视频流理解需要模型从连续视频流中稳健地编码、存储和检索信息,以支持准确的视频问答(VQA)。现有最先进的方法依赖于键值缓存来随着时间累积帧级信息,但每帧使用的令牌数量有限,导致丢失了细粒度的视觉细节。在本工作中,我们提出扩展令牌预算以实现更精细的空间-时间理解和推理。首先,我们发现当前方法无法有效处理密集流:它们的特征编码导致查询帧相似度分数随时间增加,偏向于后期帧的检索。为了解决这个问题,我们引入了一种自适应选择策略,减少令牌冗余同时保留局部空间-时间信息。我们还提出了一种无需训练的检索专家混合模型,利用外部模型更好地识别相关帧。我们的方法MemStream在CG-Bench上提高了8.0%,在LVBench上提高了8.5%,在VideoMME(长)上提高了2.4%,超过ReKV与Qwen2.5-VL-7B。
CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
Authors: Xia Su, Ruiqi Chen, Benlin Liu, Jingwei Ma, Zonglin Di, Ranjay Krishna, Jon Froehlich
First: 2026-02-20T18:46:27+00:00 · Latest: 2026-02-20T18:46:27+00:00
Abstract
Vision-Language Models (VLMs) have shown remarkable progress in Vision-Language Navigation (VLN), offering new possibilities for navigation decision-making that could benefit both robotic platforms and human users. However, real-world navigation is inherently conditioned by the agent's mobility constraints. For example, a sweeping robot cannot traverse stairs, while a quadruped can. We introduce Capability-Conditioned Navigation (CapNav), a benchmark designed to evaluate how well VLMs can navigate complex indoor spaces given an agent's specific physical and operational capabilities. CapNav defines five representative human and robot agents, each described with physical dimensions, mobility capabilities, and environmental interaction abilities. CapNav provides 45 real-world indoor scenes, 473 navigation tasks, and 2365 QA pairs to test if VLMs can traverse indoor environments based on agent capabilities. We evaluate 13 modern VLMs and find that current VLM's navigation performance drops sharply as mobility constraints tighten, and that even state-of-the-art models struggle with obstacle types that require reasoning on spatial dimensions. We conclude by discussing the implications for capability-aware navigation and the opportunities for advancing embodied spatial reasoning in future VLMs. The benchmark is available at https://github.com/makeabilitylab/CapNav
中文标题/摘要
标题:CapNav:基于能力条件的室内导航视觉语言模型基准测试
视觉语言模型(VLMs)在视觉语言导航(VLN)方面取得了显著进展,为导航决策提供了新的可能性,这不仅有助于机器人平台,也惠及人类用户。然而,现实世界的导航本质上受到代理移动约束的条件限制。例如,清洁机器人无法穿越楼梯,而四足机器人可以。我们引入了基于能力条件的导航(CapNav),这是一种基准测试,旨在评估VLMs在给定代理特定物理和操作能力的情况下,如何导航复杂的室内空间。CapNav 定义了五种代表性的人类和机器人代理,每种代理都描述了其物理尺寸、移动能力和环境交互能力。CapNav 提供了 45 个真实世界的室内场景、473 个导航任务和 2365 个问答对,以测试 VLMs 是否可以根据代理能力穿越室内环境。我们评估了 13 种现代 VLMs,发现当前 VLM 的导航性能随着移动约束的收紧而急剧下降,即使是最先进的模型也难以应对需要在空间维度上进行推理的障碍类型。最后,我们讨论了能力感知导航的含义以及未来 VLMs 中增强实体空间推理的机会。基准测试可在 https://github.com/makeabilitylab/CapNav 获取
Summary / 总结
The study introduces CapNav, a benchmark to evaluate Vision-Language Models (VLMs) in navigating indoor environments based on the physical capabilities of agents. It defines five human and robot agents with specific mobility and interaction abilities and provides 45 real-world scenes and 473 navigation tasks. The evaluation of 13 modern VLMs shows that their performance significantly decreases with tighter mobility constraints, and even state-of-the-art models struggle with spatial reasoning required for certain obstacles.
研究旨在评估视觉语言模型(VLMs)在基于不同代理的物理和操作约束条件下导航室内环境的能力。该研究引入了CapNav基准,其中包括五种具有不同物理尺寸和移动能力的人类和机器人代理,以及45个真实世界的室内场景,包含473项导航任务和2365对问答。对13种现代VLMs的评估显示,随着移动约束条件的收紧,其导航性能显著下降,即使是最先进的模型在处理某些障碍物类型所需的空间推理方面也存在困难。
ConformalNL2LTL: Translating Natural Language Instructions into Temporal Logic Formulas with Conformal Correctness Guarantees
Authors: David Smith Sundarsingh, Jun Wang, Jyotirmoy V. Deshmukh, Yiannis Kantaros
First: 2025-04-22T20:32:34+00:00 · Latest: 2026-02-20T17:50:01+00:00
Abstract
Linear Temporal Logic (LTL) is a widely used task specification language for autonomous systems. To mitigate the significant manual effort and expertise required to define LTL-encoded tasks, several methods have been proposed for translating Natural Language (NL) instructions into LTL formulas, which, however, lack correctness guarantees. To address this, we propose a new NL-to-LTL translation method, called ConformalNL2LTL that achieves user-defined translation success rates on unseen NL commands. Our method constructs LTL formulas iteratively by solving a sequence of open-vocabulary question-answering (QA) problems using large language models (LLMs). These QA tasks are handled collaboratively by a primary and an auxiliary model. The primary model answers each QA instance while quantifying uncertainty via conformal prediction; when it is insufficiently certain according to user-defined confidence thresholds, it requests assistance from the auxiliary model and, if necessary, from the user. We demonstrate theoretically and empirically that ConformalNL2LTL achieves the desired translation accuracy while minimizing user intervention.
中文标题/摘要
标题:ConformalNL2LTL:使用符合正确性保证的自然语言指令翻译成时间逻辑公式的方法
线性时间逻辑(LTL)是一种广泛用于自主系统任务规范的语言。为了减轻定义LTL编码任务所需的大量手动努力和专业知识,已经提出了几种将自然语言(NL)指令翻译成LTL公式的办法,但这些方法缺乏正确性保证。为了解决这个问题,我们提出了一种新的NL到LTL翻译方法,称为ConformalNL2LTL,该方法可以在未见过的NL命令上实现用户定义的翻译成功率。我们的方法通过使用大型语言模型(LLMs)迭代构建LTL公式,解决一系列开放词汇的问答(QA)问题。这些QA任务由一个主模型和一个辅助模型协作处理。主模型回答每个QA实例并量化不确定性通过符合性预测;当其不确定性根据用户定义的信心阈值不足时,它会请求辅助模型的帮助,必要时还会请求用户帮助。我们从理论上和实验上证明,ConformalNL2LTL可以实现所需的翻译准确性,同时将用户干预降至最低。
Summary / 总结
The paper introduces ConformalNL2LTL, a method for translating natural language instructions into Linear Temporal Logic (LTL) formulas with user-defined success rates. It uses large language models to iteratively solve open-vocabulary QA problems, with a primary model providing answers and quantifying uncertainty through conformal prediction. If the primary model is uncertain, it seeks assistance from an auxiliary model or the user. Experiments show that ConformalNL2LTL achieves the desired translation accuracy with minimal user intervention.
论文旨在解决将自然语言指令翻译成线性时序逻辑(LTL)公式时缺乏正确性保证的问题。提出了一种名为ConformalNL2LTL的方法,该方法利用大型语言模型通过解决开放词汇量的问答问题来逐步构建LTL公式。主模型使用形式化预测量化不确定性,并在必要时寻求辅助模型或用户的帮助。实验表明,ConformalNL2LTL能够在最少用户干预的情况下实现所需的翻译准确性。
Zero-shot Interactive Perception
Authors: Venkatesh Sripada, Frank Guerin, Amir Ghalamzan
First: 2026-02-20T17:30:25+00:00 · Latest: 2026-02-20T17:30:25+00:00
Comments: Original manuscript submitted on April 24, 2025. Timestamped and publicly available on OpenReview: https://openreview.net/forum?id=7MhpFcr5Nx
Abstract
Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scenes with varying occlusions and task complexities. Our experiments demonstrate that ZS-IP outperforms passive and viewpoint-based perception techniques such as Mark-Based Visual Prompting (MOKA), particularly in pushing tasks, while preserving the integrity of non-target elements.
中文标题/摘要
标题:零样本交互感知
交互感知(IP)使机器人能够通过物理与物体互动并改变环境状态来提取其工作空间中的隐藏信息并执行操作计划——这对于解决复杂、部分可观测场景中的遮挡和模糊至关重要。我们提出了零样本IP(ZS-IP),这是一种新颖的框架,将多策略操作(推和抓取)与记忆驱动的视觉语言模型(VLM)结合,以指导机器人的互动并解决语义查询。ZS-IP 结合了三个关键组件:(1)增强观察(EO)模块,该模块通过常规关键点和我们提出的推线——一种针对推操作定制的新型二维视觉增强,增强了VLM的视觉感知;(2)记忆引导的操作模块,通过上下文查找强化语义推理;(3)基于VLM输出执行推、拉或抓取的机器人控制器。与针对拾取和放置优化的基于网格的增强不同,推线捕捉接触丰富的操作的利用方式,显著提高了推操作的性能。我们在具有不同遮挡和任务复杂度的7-DOF Franka Panda 手臂上评估了ZS-IP。我们的实验表明,ZS-IP 在推操作任务中优于被动和视角基于的感知技术,如基于标记的视觉提示(MOKA),同时保持非目标元素的完整性。
Summary / 总结
The research aims to enhance robotic perception in complex, partially observable environments by integrating multi-strategy manipulation with a memory-driven Vision Language Model. The proposed Zero-Shot Interactive Perception (ZS-IP) framework includes an Enhanced Observation module that uses pushlines for pushing actions, a memory-guided action module for semantic reasoning, and a robotic controller for executing actions. Experiments show that ZS-IP outperforms passive and viewpoint-based techniques, especially in pushing tasks, while maintaining the integrity of non-target elements.
研究旨在提升机器人在复杂场景中解决遮挡和模糊问题的交互感知能力。方法是引入零样本交互感知(ZS-IP),结合多策略操作与记忆驱动的视觉语言模型(VLM)。关键组件包括增强观察模块,使用推线增强视觉感知,记忆引导动作模块进行语义推理,以及基于VLM输出的机器人控制器。实验表明,ZS-IP 在推任务中优于被动和视角基的感知技术,同时保持非目标元素的完整性。
Simplifying Outcomes of Language Model Component Analyses with ELIA
Authors: Aaron Louis Eidt, Nils Feldhus
First: 2026-02-20T14:45:27+00:00 · Latest: 2026-02-20T14:45:27+00:00
Comments: EACL 2026 System Demonstrations. GitHub: https://github.com/aaron0eidt/ELIA
Abstract
While mechanistic interpretability has developed powerful tools to analyze the internal workings of Large Language Models (LLMs), their complexity has created an accessibility gap, limiting their use to specialists. We address this challenge by designing, building, and evaluating ELIA (Explainable Language Interpretability Analysis), an interactive web application that simplifies the outcomes of various language model component analyses for a broader audience. The system integrates three key techniques -- Attribution Analysis, Function Vector Analysis, and Circuit Tracing -- and introduces a novel methodology: using a vision-language model to automatically generate natural language explanations (NLEs) for the complex visualizations produced by these methods. The effectiveness of this approach was empirically validated through a mixed-methods user study, which revealed a clear preference for interactive, explorable interfaces over simpler, static visualizations. A key finding was that the AI-powered explanations helped bridge the knowledge gap for non-experts; a statistical analysis showed no significant correlation between a user's prior LLM experience and their comprehension scores, suggesting that the system reduced barriers to comprehension across experience levels. We conclude that an AI system can indeed simplify complex model analyses, but its true power is unlocked when paired with thoughtful, user-centered design that prioritizes interactivity, specificity, and narrative guidance.
中文标题/摘要
标题:使用ELIA简化语言模型组件分析结果
虽然机制可解释性已经开发出强大的工具来分析大型语言模型(LLMs)的内部工作原理,但其复杂性造成了可访问性差距,限制了其在非专家中的使用。我们通过设计、构建和评估ELIA(可解释语言解释分析)来应对这一挑战,ELIA是一个交互式网络应用程序,简化了各种语言模型组件分析的结果,使其适用于更广泛的受众。该系统整合了三种关键技术——归因分析、功能向量分析和电路追踪,并引入了一种新的方法:使用视觉语言模型自动生成复杂可视化结果的自然语言解释(NLEs)。通过混合方法用户研究实证验证了这种方法的有效性,研究结果表明,交互式、可探索的界面比简单的静态可视化更受欢迎。一个关键发现是,AI驱动的解释有助于填补非专家的知识空白;统计分析显示,用户对LLM的经验与其理解分数之间没有显著相关性,这表明该系统降低了不同经验水平下的理解障碍。我们得出结论,AI系统确实可以简化复杂的模型分析,但其真正的力量在于与以用户为中心的设计相结合,优先考虑互动性、具体性和叙述性指导。
Summary / 总结
The paper addresses the complexity of analyzing the internal workings of Large Language Models (LLMs) by introducing ELIA, an interactive web application that simplifies these analyses. ELIA integrates Attribution Analysis, Function Vector Analysis, and Circuit Tracing, and uses a vision-language model to generate natural language explanations for complex visualizations. A user study showed that interactive, explorable interfaces were preferred over static visualizations, and that AI-generated explanations helped non-experts understand the analyses without being significantly affected by their prior LLM experience.
论文通过引入ELIA(解释性语言可解释性分析)交互式网络应用来简化大型语言模型(LLM)的分析,该应用结合了归因分析、功能向量分析和电路追踪等技术,并使用视觉语言模型自动生成自然语言解释以解释复杂的可视化结果。用户研究显示,交互式界面比静态可视化更受欢迎,AI生成的解释帮助非专家在没有LLM经验的情况下理解分析结果,表明ELIA降低了不同专业水平下的理解障碍。
Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Authors: Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang, Yuexian Zou
First: 2026-02-20T14:13:22+00:00 · Latest: 2026-02-20T14:13:22+00:00
Abstract
Recent work on test-time scaling for large language model (LLM) reasoning typically assumes that allocating more inference-time computation uniformly improves correctness. However, prior studies show that reasoning uncertainty is highly localized: a small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion. Motivated by this observation, we propose Thinking by Subtraction, a confidence-driven contrastive decoding approach that improves reasoning reliability through targeted token-level intervention. Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes selectively at these positions. It constructs a contrastive reference by replacing high-confidence tokens with minimal placeholders, and refines predictions by subtracting this reference distribution at low-confidence locations. Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead. As a training-free method, CCD enhances reasoning reliability through targeted low-confidence intervention without computational redundancy. Our code will be made available at: https://github.com/bolo-web/CCD.
中文标题/摘要
标题:通过减法思考:基于信心的对比解码在LLM推理中的应用
大型语言模型(LLM)推理的测试时扩展工作通常假设分配更多的推理时间计算会均匀地提高正确性。然而,先前的研究表明,推理不确定性是高度局部化的:一小部分低信心标记不成比例地导致推理错误和不必要的输出扩展。受这一观察的启发,我们提出了基于信心的对比解码方法——通过目标化的标记级干预来提高推理可靠性。我们的方法,基于信心的对比解码,在解码过程中检测低信心标记,并在这些位置选择性地进行干预。它通过用最小占位符替换高信心标记来构建对比参考,并通过在低信心位置减去该参考分布来细化预测。实验表明,基于信心的对比解码(CCD)在数学推理基准测试中显著提高了准确性,同时大幅减少了输出长度,且计算量开销极小。作为一种无需训练的方法,CCD 通过目标化的低信心干预提高了推理可靠性,而无需计算冗余。我们的代码将在以下地址开源:https://github.com/bolo-web/CCD。
FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Authors: Mirae Kim, Seonghun Jeong, Youngjun Kwak
First: 2026-02-20T11:40:41+00:00 · Latest: 2026-02-20T11:40:41+00:00
Comments: lrec 2026 accepted paper
Abstract
Jailbreaking poses a significant risk to the deployment of Large Language Models (LLMs) and Vision Language Models (VLMs). VLMs are particularly vulnerable because they process both text and images, creating broader attack surfaces. However, available resources for jailbreak detection are scarce, particularly in finance. To address this gap, we present FENCE, a bilingual (Korean-English) multimodal dataset for training and evaluating jailbreak detectors in financial applications. FENCE emphasizes domain realism through finance-relevant queries paired with image-grounded threats. Experiments with commercial and open-source VLMs reveal consistent vulnerabilities, with GPT-4o showing measurable attack success rates and open-source models displaying greater exposure. A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models. FENCE provides a focused resource for advancing multimodal jailbreak detection in finance and for supporting safer, more reliable AI systems in sensitive domains. Warning: This paper includes example data that may be offensive.
中文标题/摘要
标题:FENCE:一个金融和多模态逃逸检测数据集
逃逸攻击对大型语言模型(LLMs)和视觉语言模型(VLMs)的部署构成了重大风险。VLMs特别容易受到攻击,因为它们处理文本和图像,从而扩大了攻击面。然而,可用于逃逸检测的资源稀缺,尤其是在金融领域。为了解决这一缺口,我们提出了FENCE,这是一个双语(韩语-英语)多模态数据集,用于训练和评估金融应用中的逃逸检测器。FENCE通过与金融相关的问题配对图像相关的威胁,强调了领域的真实性。对商业和开源VLMs的实验揭示了持续的漏洞,GPT-4o显示出可测量的攻击成功率,开源模型则表现出更大的暴露度。基于FENCE训练的基本检测器在内部准确性上达到了99%,并在外部基准测试中保持了强大的性能,突显了该数据集在训练可靠检测模型方面的稳健性。FENCE为金融领域多模态逃逸检测的进步提供了集中资源,并支持更安全、更可靠的AI系统在敏感领域的应用。警告:本文包括可能具有冒犯性的示例数据。
Summary / 总结
The paper introduces FENCE, a bilingual multimodal dataset for detecting jailbreaks in financial applications, addressing the lack of resources for this task. The dataset includes finance-relevant queries paired with image-grounded threats to enhance domain realism. Experiments show that both commercial and open-source VLMs have vulnerabilities, with GPT-4o showing measurable success rates. A baseline detector trained on FENCE achieves high in-distribution accuracy and performs well on external benchmarks, highlighting the dataset's effectiveness for training reliable jailbreak detectors.
该论文介绍了FENCE,一个双语(韩英)多模态数据集,用于检测金融应用中的越狱。它解决了这一领域资源稀缺的问题,特别是金融领域。实验显示各种VLM存在一致性漏洞,GPT-4o显示出可测量的攻击成功率。基于FENCE训练的基本检测器在内部分布上实现了高准确率,并在外部基准测试中表现出色,突显了该数据集的有效性,用于训练可靠的越狱检测器。
MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Authors: Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj
First: 2026-02-18T21:28:56+00:00 · Latest: 2026-02-20T10:41:16+00:00
Abstract
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.MALLVi present a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.
中文标题/摘要
标题:MALLVI:一种综合机器人操作的多智能体框架
使用大型语言模型(LLMs)进行机器人操作的任务规划是一个新兴领域。先前的方法依赖于专门的模型、微调或提示调优,并且通常以开环方式运行,缺乏稳健的环境反馈,使其在动态环境中变得脆弱。MALLVi 提出了一种多智能体大型语言和视觉框架,能够通过闭环反馈驱动的机器人操作。给定自然语言指令和环境图像,MALLVi 生成可执行的原子动作供机器人执行。执行动作后,视觉语言模型(VLM)评估环境反馈并决定是否重复该过程或进行下一步。MALLVi 不使用单一模型,而是协调分解器、定位器、思考者和反思者等专门智能体来管理感知、定位、推理和高级规划。可选的描述者智能体提供初始状态的视觉记忆。反思者通过仅重新激活相关智能体来支持有针对性的错误检测和恢复,避免全面重新规划。在模拟和真实世界环境中的实验表明,迭代闭环多智能体协调可以提高泛化能力并增加零样本操作任务的成功率。代码可在 https://github.com/iman1234ahmadi/MALLVI 获取。
Summary / 总结
MALLVi is a multi-agent framework that integrates large language models and vision for robotic manipulation, enabling closed-loop feedback-driven task planning. Given a natural language instruction and an image, MALLVi generates and executes atomic actions, with a Vision Language Model evaluating feedback to decide on next steps. Experiments show improved generalization and success rates in zero-shot manipulation tasks through iterative closed-loop coordination of specialized agents.
MALLVi 是一个多代理框架,利用大型语言和视觉模型实现闭环反馈驱动的机器人操作。给定自然语言指令和环境图像,MALLVi 生成可执行的动作,并使用视觉语言模型评估环境反馈,决定是否重复或继续。该框架包括感知、定位、推理和规划的专业代理,还有一个可选的视觉记忆代理。实验表明,迭代的闭环多代理协调可以提高零样本操作任务的成功率和泛化能力。
LeafNet: A Large-Scale Dataset and Comprehensive Benchmark for Foundational Vision-Language Understanding of Plant Diseases
Authors: Khang Nguyen Quoc, Phuong D. Dao, Luyl-Da Quach
First: 2026-02-14T08:10:27+00:00 · Latest: 2026-02-20T09:41:11+00:00
Comments: 26 pages, 13 figures and 8 tables
Abstract
Foundation models and vision-language pre-training have significantly advanced Vision-Language Models (VLMs), enabling multimodal processing of visual and linguistic data. However, their application in domain-specific agricultural tasks, such as plant pathology, remains limited due to the lack of large-scale, comprehensive multimodal image--text datasets and benchmarks. To address this gap, we introduce LeafNet, a comprehensive multimodal dataset, and LeafBench, a visual question-answering benchmark developed to systematically evaluate the capabilities of VLMs in understanding plant diseases. The dataset comprises 186,000 leaf digital images spanning 97 disease classes, paired with metadata, generating 13,950 question-answer pairs spanning six critical agricultural tasks. The questions assess various aspects of plant pathology understanding, including visual symptom recognition, taxonomic relationships, and diagnostic reasoning. Benchmarking 12 state-of-the-art VLMs on our LeafBench dataset, we reveal substantial disparity in their disease understanding capabilities. Our study shows performance varies markedly across tasks: binary healthy--diseased classification exceeds 90% accuracy, while fine-grained pathogen and species identification remains below 65%. Direct comparison between vision-only models and VLMs demonstrates the critical advantage of multimodal architectures: fine-tuned VLMs outperform traditional vision models, confirming that integrating linguistic representations significantly enhances diagnostic precision. These findings highlight critical gaps in current VLMs for plant pathology applications and underscore the need for LeafBench as a rigorous framework for methodological advancement and progress evaluation toward reliable AI-assisted plant disease diagnosis. Code is available at https://github.com/EnalisUs/LeafBench.
中文标题/摘要
标题:LeafNet:植物病害基础视觉-语言理解的大规模数据集和全面基准
基础模型和视觉-语言预训练显著推动了视觉-语言模型(VLMs)的发展,使其能够处理视觉和语言数据。然而,由于缺乏大规模、全面的多模态图像-文本数据集和基准,它们在特定农业任务中的应用,如植物病理学,仍然受到限制。为解决这一问题,我们引入了LeafNet,一个全面的多模态数据集,以及LeafBench,一个视觉问答基准,旨在系统评估VLMs在理解植物病害方面的能力。该数据集包含186,000张叶子的数字图像,涵盖97个病害类别,配以元数据,生成了13,950个问题-答案对,覆盖六个关键农业任务。问题评估了植物病理学理解的各个方面,包括视觉症状识别、分类关系和诊断推理。在我们的LeafBench数据集上对12个最先进的VLMs进行基准测试,我们揭示了它们在病害理解能力上的巨大差异。我们的研究表明,任务之间的表现差异显著:二元健康-病害分类的准确率超过90%,而细粒度病原体和物种识别的准确率低于65%。视觉模型与VLMs之间的直接比较表明,多模态架构具有关键优势:微调后的VLMs优于传统的视觉模型,证实了整合语言表示显著提高了诊断精度。这些发现突显了当前VLMs在植物病理学应用中的关键差距,并强调了LeafBench作为严格框架的重要性,用于方法学进步和评估以实现可靠的AI辅助植物病害诊断。代码可在https://github.com/EnalisUs/LeafBench/获取。
Summary / 总结
The research introduces LeafNet, a large-scale multimodal dataset for plant diseases, and LeafBench, a benchmark for evaluating Vision-Language Models (VLMs) in understanding plant pathology. The dataset includes 186,000 leaf images with 13,950 question-answer pairs covering six agricultural tasks. Benchmarking 12 state-of-the-art VLMs on LeafBench, the study reveals varying performance across tasks, with binary classification outperforming fine-grained identification. Multimodal architectures significantly outperform vision-only models in diagnostic precision, highlighting the need for further development in VLMs for plant pathology applications.
研究引入了LeafNet,这是一个包含186,000张叶片图像和13,950个问题-答案对的大规模多模态数据集,涵盖97种疾病类别。在LeafBench基准上对12个最先进的VLM进行评估,研究揭示了不同任务之间的性能差异,二元分类表现优于细粒度识别。多模态架构在诊断精度上优于仅视觉模型,突显了在植物病理应用中进一步开发VLM的需求。
OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models
Authors: Ling Lin, Yang Bai, Heng Su, Congcong Zhu, Yaoxing Wang, Yang Zhou, Huazhu Fu, Jingrun Chen
First: 2026-02-20T09:34:21+00:00 · Latest: 2026-02-20T09:34:21+00:00
Comments: 54 pages, 21 figures
Abstract
Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.
中文标题/摘要
标题:OODBench:大规模视觉-语言模型的离分布数据基准
现有的视觉-语言模型(VLMs)通过在大规模数据集上进行训练,取得了显著的进步,通常是在数据独立且同分布(IID)的假设下进行的。然而,在现实世界中,期望所有由AI系统处理的数据都满足这一假设往往是不切实际的。此外,未能适当处理离分布(OOD)对象可能会在实际应用中(例如自动驾驶或医疗辅助)引入安全风险。不幸的是,当前的研究尚未提供有效的基准,可以全面评估VLMs在面对OOD数据时的表现。因此,我们提出了OODBench,这是一种主要自动化的方法,只需少量的人工验证,用于构建新的基准并评估VLMs处理OOD数据的能力。OODBench包含40000个实例级别的OOD实例类别对,我们展示了即使在底层图像类别常见的情况下,当前的VLMs在OODBench上的表现仍然显著下降。此外,我们提出了一种可靠的自动化评估指标,该指标采用从基础到高级的问题递进方式,以更全面地评估不同难度问题受OOD数据的影响。最后,我们总结了重要的发现和见解,以促进未来在获取和评估OOD数据方面的研究。
Summary / 总结
OODBench is proposed to evaluate the performance of Visual-Language Models (VLMs) on out-of-distribution (OOD) data, which is crucial for real-world applications. The method involves an automated approach with minimal human verification and includes 40K instance-level OOD instance-category pairs. Key findings show that current VLMs still experience significant performance degradation on OOD data, even for common image categories. Additionally, a reliable automated assessment metric is introduced to better evaluate the impact of OOD data on questions of varying difficulty.
OODBench 是一个用于评估大型视觉-语言模型(VLMs)在处理出-of-分布(OOD)数据时性能的基准。该方法采用自动化方式并辅以少量的人工验证,包含40K实例级别的OOD实例类别对。主要发现表明,当前的VLMs在OOD数据上表现出显著的性能下降,即使对于常见的图像类别也是如此,并提出了一种新的自动化评估指标,以更好地评估不同难度级别问题受OOD数据影响的程度。
Predict to Skip: Linear Multistep Feature Forecasting for Efficient Diffusion Transformers
Authors: Hanshuai Cui, Zhiqing Tang, Qianli Ma, Zhi Yao, Weijia Jia
First: 2026-02-20T09:33:59+00:00 · Latest: 2026-02-20T09:33:59+00:00
Abstract
Diffusion Transformers (DiT) have emerged as a widely adopted backbone for high-fidelity image and video generation, yet their iterative denoising process incurs high computational costs. Existing training-free acceleration methods rely on feature caching and reuse under the assumption of temporal stability. However, reusing features for multiple steps may lead to latent drift and visual degradation. We observe that model outputs evolve smoothly along much of the diffusion trajectory, enabling principled predictions rather than naive reuse. Based on this insight, we propose \textbf{PrediT}, a training-free acceleration framework that formulates feature prediction as a linear multistep problem. We employ classical linear multistep methods to forecast future model outputs from historical information, combined with a corrector that activates in high-dynamics regions to prevent error accumulation. A dynamic step modulation mechanism adaptively adjusts the prediction horizon by monitoring the feature change rate. Together, these components enable substantial acceleration while preserving generation fidelity. Extensive experiments validate that our method achieves up to $5.54\times$ latency reduction across various DiT-based image and video generation models, while incurring negligible quality degradation.
中文标题/摘要
标题:Predict to Skip: 线性多步特征预测以提高扩散变换器的效率
扩散变换器(DiT)已成为高保真图像和视频生成的广泛采用的基础架构,但其迭代去噪过程会带来高昂的计算成本。现有的无需训练加速方法依赖于特征缓存和重用,前提是时间上的稳定性。然而,在多个步骤中重用特征可能会导致潜在的漂移和视觉退化。我们观察到,模型输出在扩散轨迹的大部分过程中会平滑地演变,这使得可以进行有原则的预测而不是简单的重用。基于这一洞察,我们提出了一个无需训练的加速框架PrediT,将特征预测形式化为线性多步问题。我们使用经典的线性多步方法从历史信息中预测未来的模型输出,并结合一个在高动态区域激活的校正器以防止误差累积。动态步长调节机制通过监测特征变化率来适应性地调整预测范围。这些组件共同实现了显著的加速,同时保持生成保真度。广泛的实验验证了我们的方法在各种基于DiT的图像和视频生成模型中实现了高达$5.54\times$的延迟减少,而几乎不降低质量。
Summary / 总结
The research aims to address the high computational costs of Diffusion Transformers (DiT) by proposing PrediT, a training-free acceleration framework. PrediT formulates feature prediction as a linear multistep problem, using classical linear multistep methods to forecast future model outputs from historical information. A corrector is activated in high-dynamics regions to prevent error accumulation, and a dynamic step modulation mechanism adjusts the prediction horizon. Experiments show that PrediT achieves up to 5.54 times latency reduction with negligible quality degradation across various DiT-based models.
研究针对Diffusion Transformers (DiT)在图像和视频生成中的高计算成本,提出了一种无需训练的加速框架PrediT。PrediT通过线性多步方法基于历史信息预测未来模型输出,并在高动态区域使用校正器防止误差累积,同时通过动态步长调节机制调整预测范围。实验表明,PrediT可以将延迟最多减少5.54倍,同时保持质量基本不变。
GIFT: A Framework Towards Global Interpretable Faithful Textual Explanations of Vision Classifiers
Authors: Éloi Zablocki, Valentin Gerard, Amaia Cardiel, Eric Gaussier, Matthieu Cord, Eduardo Valle
First: 2024-11-23T16:52:22+00:00 · Latest: 2026-02-20T09:05:18+00:00
Comments: TMLR 2026 (featured certification)
Abstract
Understanding the decision processes of deep vision models is essential for their safe and trustworthy deployment in real-world settings. Existing explainability approaches, such as saliency maps or concept-based analyses, often suffer from limited faithfulness, local scope, or ambiguous semantics. We introduce GIFT, a post-hoc framework that aims to derive Global, Interpretable, Faithful, and Textual explanations for vision classifiers. GIFT begins by generating a large set of faithful, local visual counterfactuals, then employs vision-language models to translate these counterfactuals into natural-language descriptions of visual changes. These local explanations are aggregated by a large language model into concise, human-readable hypotheses about the model's global decision rules. Crucially, GIFT includes a verification stage that quantitatively assesses the causal effect of each proposed explanation by performing image-based interventions, ensuring that the final textual explanations remain faithful to the model's true reasoning process. Across diverse datasets, including the synthetic CLEVR benchmark, the real-world CelebA faces, and the complex BDD driving scenes, GIFT reveals not only meaningful classification rules but also unexpected biases and latent concepts driving model behavior. Altogether, GIFT bridges the gap between local counterfactual reasoning and global interpretability, offering a principled approach to causally grounded textual explanations for vision models.
中文标题/摘要
标题:GIFT:一种面向全局可解释的视觉分类器文本解释框架
理解深度视觉模型的决策过程对于它们在现实世界环境中的安全和可信部署至关重要。现有的可解释性方法,如显著图或概念分析,往往在忠实性、局部范围或语义模糊性方面存在局限。我们提出了GIFT,一种后验框架,旨在为视觉分类器生成全局、可解释、忠实且文本化的解释。GIFT首先生成大量忠实的局部视觉反事实,然后利用视觉语言模型将这些反事实转化为自然语言描述的视觉变化。这些局部解释由大型语言模型聚合为简洁的人类可读的假设,关于模型的全局决策规则。关键的是,GIFT包括一个验证阶段,通过基于图像的干预定量评估每个提出的解释的因果效应,确保最终的文本解释忠实于模型的真实推理过程。在包括合成CLEVR基准、真实世界的CelebA人脸和复杂的BDD驾驶场景在内的多种数据集上,GIFT不仅揭示了有意义的分类规则,还揭示了驱动模型行为的意外偏见和潜在概念。总体而言,GIFT弥合了局部反事实推理与全局可解释性的差距,提供了一种基于因果关系的视觉模型文本解释的原理性方法。
Summary / 总结
GIFT is a post-hoc framework designed to provide global, interpretable, and faithful textual explanations for vision classifiers. It generates local visual counterfactuals, translates them into natural language, and aggregates these into concise hypotheses using a large language model. A verification stage ensures the explanations are faithful to the model's reasoning. Across various datasets, GIFT reveals both expected and unexpected biases and latent concepts influencing model decisions, bridging the gap between local and global interpretability.
GIFT 是一个框架,旨在为视觉分类器提供全局、可解释且忠实的文本解释。它生成局部视觉反事实,并使用视觉语言模型将其翻译成自然语言描述,然后由大型语言模型将其聚合为简洁的假设。验证阶段确保这些解释忠实于模型的推理过程。GIFT 在各种数据集上揭示了有意义的分类规则和意想不到的偏差,填补了局部和全局可解释性的差距。
Dual-Channel Attention Guidance for Training-Free Image Editing Control in Diffusion Transformers
Authors: Guandong Li, Mengxia Ye
First: 2026-02-20T06:24:20+00:00 · Latest: 2026-02-20T06:24:20+00:00
Abstract
Training-free control over editing intensity is a critical requirement for diffusion-based image editing models built on the Diffusion Transformer (DiT) architecture. Existing attention manipulation methods focus exclusively on the Key space to modulate attention routing, leaving the Value space -- which governs feature aggregation -- entirely unexploited. In this paper, we first reveal that both Key and Value projections in DiT's multi-modal attention layers exhibit a pronounced bias-delta structure, where token embeddings cluster tightly around a layer-specific bias vector. Building on this observation, we propose Dual-Channel Attention Guidance (DCAG), a training-free framework that simultaneously manipulates both the Key channel (controlling where to attend) and the Value channel (controlling what to aggregate). We provide a theoretical analysis showing that the Key channel operates through the nonlinear softmax function, acting as a coarse control knob, while the Value channel operates through linear weighted summation, serving as a fine-grained complement. Together, the two-dimensional parameter space $(δ_k, δ_v)$ enables more precise editing-fidelity trade-offs than any single-channel method. Extensive experiments on the PIE-Bench benchmark (700 images, 10 editing categories) demonstrate that DCAG consistently outperforms Key-only guidance across all fidelity metrics, with the most significant improvements observed in localized editing tasks such as object deletion (4.9% LPIPS reduction) and object addition (3.2% LPIPS reduction).
中文标题/摘要
标题:扩散变换器中的双通道注意力引导无训练图像编辑控制
扩散基图像编辑模型在扩散变换器(DiT)架构上构建时,无训练控制编辑强度是一个关键要求。现有的注意力操作方法仅专注于Key空间来调节注意力路由,而完全忽略了Value空间——它控制特征聚合。在本文中,我们首先揭示了DiT多模态注意力层中的Key和Value投影表现出明显的偏差-增量结构,其中令牌嵌入紧密围绕特定层的偏差向量聚类。基于这一观察,我们提出了双通道注意力引导(DCAG),这是一种无训练框架,可以同时操作Key通道(控制注意力的焦点)和Value通道(控制聚合的内容)。我们提供了理论分析,表明Key通道通过非线性softmax函数操作,作为粗略的控制旋钮,而Value通道通过线性加权求和操作,作为精细的补充。两者结合的二维参数空间$(δ_k, δ_v)$能够比任何单通道方法提供更精确的编辑保真度权衡。在PIE-Bench基准(700张图像,10种编辑类别)上的广泛实验表明,DCAG在所有保真度指标上都优于仅Key引导,特别是在局部编辑任务如对象删除(LPIPS降低4.9%)和对象添加(LPIPS降低3.2%)方面表现最为显著。
Summary / 总结
This paper addresses the need for training-free control over editing intensity in diffusion-based image editing models using the Diffusion Transformer (DiT) architecture. It introduces Dual-Channel Attention Guidance (DCAG), which manipulates both the Key and Value channels to control attention routing and feature aggregation, respectively. Experiments show that DCAG outperforms Key-only guidance in localized editing tasks, reducing LPIPS by 4.9% for object deletion and 3.2% for object addition compared to existing methods.
本文针对基于扩散变换器(DiT)架构的图像编辑模型中训练-free 控制编辑强度的需求,提出了一种双通道注意引导(DCAG)方法,同时操控 Key 和 Value 通道以控制注意力路由和特征聚合。实验表明,DCAG 在局部编辑任务中表现更优,相比现有方法,对象删除任务的 LPIPS 减少了 4.9%,对象添加任务的 LPIPS 减少了 3.2%。
UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models
Authors: Jiabing Yang, Yixiang Chen, Yuan Xu, Peiyan Li, Xiangnan Wu, Zichen Wen, Bowen Fang, Tao Yu, Zhengbo Zhang, Yingda Li, Kai Wang, Jing Liu, Nianfeng Liu, Yan Huang, Liang Wang
First: 2026-02-20T06:22:21+00:00 · Latest: 2026-02-20T06:22:21+00:00
Abstract
Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism helps VLAs better attend to observations during inference, enabling more confident and faithful action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.
中文标题/摘要
标题:UAOR:面向视觉-语言-动作模型的不确定性感知观测重注入
视觉-语言-动作(VLA)模型利用预训练的视觉-语言模型(VLM)作为骨干,将图像和指令映射到动作,展示了广泛可移植的机器人操作潜力。为了提高性能,现有方法通常会引入额外的观测线索(例如,深度图、点云)或辅助模块(例如,物体检测器、编码器),以实现更精确和可靠的任务执行,但这些通常需要昂贵的数据收集和额外的训练。受语言模型中的前馈网络(FFN)可以作为“键值记忆”的发现启发,我们提出了一种有效的、无需训练且即插即用的模块——不确定性感知观测重注入(UAOR),用于VLA模型。具体而言,当当前的语言模型层表现出高不确定性,通过动作熵衡量时,它会通过注意力检索将关键观测信息重新注入到下一层的前馈网络(FFN)中。这种机制有助于VLA在推理过程中更好地关注观测信息,从而实现更自信和忠实的动作生成。全面的实验表明,我们的方法在模拟和真实世界任务中能够一致地改进各种VLA模型,且具有最小的开销。值得注意的是,UAOR消除了对额外观测线索或模块的需求,使其成为现有VLA管道的多功能且实用的插件。项目页面位于https://uaor.jiabingyang.cn/
Summary / 总结
The research aims to enhance the performance of Vision-Language-Action (VLA) models by proposing an Uncertainty-aware Observation Reinjection (UAOR) module, which improves action generation by reinjecting key observation information into the next layer's Feed-Forward Network (FFN) when the current layer exhibits high uncertainty. Experiments demonstrate that UAOR consistently improves various VLA models in both simulation and real-world tasks with minimal overhead and without requiring additional observation cues or modules.
研究旨在通过提出一种不确定性感知观测重注入(UAOR)模块来提升视觉-语言-动作(VLA)模型的性能。该模块在当前层表现出高不确定性时,会将关键观测信息注入到下一层的前馈网络中,而无需额外的训练。实验表明,UAOR能够以最小的开销提升各种VLA模型在模拟和真实世界任务中的表现,并且不需要额外的观测线索或模块。
Mod-Adapter: Tuning-Free and Versatile Multi-concept Personalization via Modulation Adapter
Authors: Weizhi Zhong, Huan Yang, Zheng Liu, Huiguo He, Zijian He, Xuesong Niu, Di Zhang, Guanbin Li
Venue: ICLR 2026
First: 2025-05-24T09:21:32+00:00 · Latest: 2026-02-20T05:08:23+00:00
Comments: Accepted by ICLR 2026, project page: https://weizhi-zhong.github.io/Mod-Adapter
Abstract
Personalized text-to-image generation aims to synthesize images of user-provided concepts in diverse contexts. Despite recent progress in multi-concept personalization, most are limited to object concepts and struggle to customize abstract concepts (e.g., pose, lighting). Some methods have begun exploring multi-concept personalization supporting abstract concepts, but they require test-time fine-tuning for each new concept, which is time-consuming and prone to overfitting on limited training images. In this work, we propose a novel tuning-free method for multi-concept personalization that can effectively customize both object and abstract concepts without test-time fine-tuning. Our method builds upon the modulation mechanism in pre-trained Diffusion Transformers (DiTs) model, leveraging the localized and semantically meaningful properties of the modulation space. Specifically, we propose a novel module, Mod-Adapter, to predict concept-specific modulation direction for the modulation process of concept-related text tokens. It introduces vision-language cross-attention for extracting concept visual features, and Mixture-of-Experts (MoE) layers that adaptively map the concept features into the modulation space. Furthermore, to mitigate the training difficulty caused by the large gap between the concept image space and the modulation space, we introduce a VLM-guided pre-training strategy that leverages the strong image understanding capabilities of vision-language models to provide semantic supervision signals. For a comprehensive comparison, we extend a standard benchmark by incorporating abstract concepts. Our method achieves state-of-the-art performance in multi-concept personalization, supported by quantitative, qualitative, and human evaluations.
中文标题/摘要
标题:Mod-Adapter: 无需微调的多功能多概念个性化方法
个性化文本到图像生成旨在根据用户提供的概念在多种上下文中合成图像。尽管在多概念个性化方面取得了进展,但大多数方法仅限于对象概念,并难以定制抽象概念(例如姿态、照明)。一些方法开始探索支持抽象概念的多概念个性化,但它们需要在测试时为每个新概念进行微调,这既耗时又容易过拟合。在本文中,我们提出了一种新的无需微调的多概念个性化方法,该方法可以在不进行测试时微调的情况下有效定制对象和抽象概念。我们的方法基于预训练扩散变换器(DiTs)模型中的调制机制,利用调制空间中的局部和语义有意义的属性。具体而言,我们提出了一种新的模块Mod-Adapter,用于预测与概念相关的文本标记的调制过程的概念特定调制方向。它引入了视觉-语言交叉注意力以提取概念视觉特征,并使用Mixture-of-Experts(MoE)层将概念特征自适应地映射到调制空间。此外,为了缓解由于概念图像空间和调制空间之间的巨大差距而导致的训练难度,我们引入了一种基于视觉-语言模型的强大图像理解能力的预训练策略,以提供语义监督信号。为了进行全面比较,我们扩展了一个标准基准,加入了抽象概念。我们的方法在多概念个性化方面达到了最先进的性能,得到了定量、定性和人类评估的支持。
Summary / 总结
This work introduces Mod-Adapter, a tuning-free method for multi-concept personalization in text-to-image generation. It leverages the modulation mechanism in pre-trained Diffusion Transformers, using a Mod-Adapter module to predict concept-specific modulation directions. The method includes vision-language cross-attention and Mixture-of-Experts layers to adapt concept features into the modulation space. Pre-training with a VLM-guided strategy provides semantic supervision. Experiments show Mod-Adapter outperforms existing methods in both object and abstract concept personalization, as evidenced by quantitative, qualitative, and human evaluations.
研究旨在开发一种无需调优的多概念个性化方法,能够在文本到图像生成中同时定制物体和抽象概念,无需在测试时进行微调。该方法Mod-Adapter引入了一个新的模块,使用视觉-语言交叉注意力和Mixture-of-Experts层来预测概念特定的调制方向。结合基于视觉-语言模型的预训练策略,这种方法能够有效定制概念。实验结果表明,Mod-Adapter在多概念个性化方面优于现有方法,得到了定量、定性和人类评估的支持。
VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration
Authors: Jaeyoon Jung, Yejun Yoon, Kunwoo Park
First: 2026-02-04T14:12:55+00:00 · Latest: 2026-02-20T03:51:49+00:00
Comments: A system description paper for the AVerImaTeC shared task at the Ninth FEVER Workshop (co-located with EACL 2026)
Abstract
This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.
中文标题/摘要
标题:VILLAIN在AVerImaTeC:通过多智能体协作验证图像-文本声明
本文描述了VILLAIN,一种通过基于提示的多智能体协作来验证图像-文本声明的多模态事实核查系统。在AVerImaTeC共享任务中,VILLAIN在事实核查的多个阶段使用了视觉-语言模型智能体。文本和视觉证据从通过额外网络收集丰富后的知识库中检索。为了识别关键信息并解决证据项之间的不一致,模态特定和跨模态智能体生成分析报告。在后续阶段,基于这些报告生成问题-答案对。最后,判决预测智能体根据图像-文本声明和生成的问题-答案对生成验证结果。我们的系统在所有评估指标中均排名第一。源代码可在https://github.com/ssu-humane/VILLAIN公开获取。
Summary / 总结
VILLAIN is a multimodal fact-checking system that verifies image-text claims using prompt-based multi-agent collaboration. It retrieves textual and visual evidence from a knowledge store and generates analysis reports to identify key information and address inconsistencies. The system then produces question-answer pairs and a final verification outcome. VILLAIN ranked first on the leaderboard across all evaluation metrics in the AVerImaTeC shared task at the Ninth FEVER Workshop (co-located with EACL 2026).
VILLAIN 是一个多模态事实核查系统,通过基于提示的多智能体协作验证图像-文本声明。它从知识库中检索文本和视觉证据,并使用模态特定和跨模态智能体生成分析报告。这些报告用于生成问题-答案对,然后由判决预测智能体根据图像-文本声明和生成的问题-答案对来确定验证结果。该系统在第九届FEVER研讨会(与EACL 2026联合举办)的AVerImaTeC共享任务中,在所有评估指标上排名第一。
ROCKET: Residual-Oriented Multi-Layer Alignment for Spatially-Aware Vision-Language-Action Models
Authors: Guoheng Sun, Tingting Du, Kaixi Feng, Chenxiang Luo, Xingguo Ding, Zheyu Shen, Ziyao Wang, Yexiao He, Ang Li
First: 2026-02-20T03:06:22+00:00 · Latest: 2026-02-20T03:06:22+00:00
Abstract
Vision-Language-Action (VLA) models enable instruction-following robotic manipulation, but they are typically pretrained on 2D data and lack 3D spatial understanding. An effective approach is representation alignment, where a strong vision foundation model is used to guide a 2D VLA model. However, existing methods usually apply supervision at only a single layer, failing to fully exploit the rich information distributed across depth; meanwhile, naïve multi-layer alignment can cause gradient interference. We introduce ROCKET, a residual-oriented multi-layer representation alignment framework that formulates multi-layer alignment as aligning one residual stream to another. Concretely, ROCKET employs a shared projector to align multiple layers of the VLA backbone with multiple layers of a powerful 3D vision foundation model via a layer-invariant mapping, which reduces gradient conflicts. We provide both theoretical justification and empirical analyses showing that a shared projector is sufficient and outperforms prior designs, and further propose a Matryoshka-style sparse activation scheme for the shared projector to balance multiple alignment losses. Our experiments show that, combined with a training-free layer selection strategy, ROCKET requires only about 4% of the compute budget while achieving 98.5% state-of-the-art success rate on LIBERO. We further demonstrate the superior performance of ROCKET across LIBERO-Plus and RoboTwin, as well as multiple VLA models. The code and model weights can be found at https://github.com/CASE-Lab-UMD/ROCKET-VLA.
中文标题/摘要
标题:ROCKET:面向空间感知的残差导向多层对齐的视觉-语言-动作模型
视觉-语言-动作(VLA)模型能够实现指令跟随的机器人操作,但它们通常是在2D数据上进行预训练,缺乏3D空间理解。一种有效的方法是表示对齐,其中使用一个强大的视觉基础模型来引导2D VLA模型。然而,现有方法通常只在单一层上应用监督,未能充分利用分布在深度中的丰富信息;同时,简单的多层对齐会导致梯度干扰。我们提出了ROCKET,一种面向残差的多层表示对齐框架,将多层对齐形式化为对齐一个残差流到另一个残差流。具体而言,ROCKET 使用一个共享的投影器,通过层不变映射将VLA主干的多层与一个强大的3D视觉基础模型的多层对齐,从而减少梯度冲突。我们提供了理论依据和实证分析,表明共享的投影器是足够的,并优于先前的设计,进一步提出了一种马特罗什卡风格的稀疏激活方案,用于共享投影器以平衡多个对齐损失。我们的实验表明,结合无训练层选择策略,ROCKET 只需要约4%的计算预算,而在LIBERO上实现了98.5%的最新技术水平成功率。我们还展示了ROCKET 在LIBERO-Plus、RoboTwin 以及多个VLA模型上的优越性能。代码和模型权重可以在 https://github.com/CASE-Lab-UMD/ROCKET-VLA 中找到。
Summary / 总结
ROCKET is a residual-oriented multi-layer representation alignment framework designed to enhance 3D spatial understanding in Vision-Language-Action models. It uses a shared projector to align multiple layers of the VLA backbone with a 3D vision foundation model, reducing gradient conflicts. Experiments show that ROCKET achieves 98.5% state-of-the-art success rate on LIBERO with only 4% of the compute budget. It also outperforms previous methods on LIBERO-Plus and RoboTwin tasks across different VLA models.
ROCKET 是一种残差导向的多层表示对齐框架,旨在增强 Vision-Language-Action 模型的 3D 空间理解能力。它使用共享投影器将 VLA 主干的多层与 3D 视觉基础模型对齐,减少梯度冲突。实验表明,ROCKET 在 LIBERO 上实现了 98.5% 的成功率,仅需 4% 的计算预算,并且在各种 VLA 模型和数据集上表现出优越性能。
ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks
Authors: Ahmad ALBarqawi, Mahmoud Nazzal, Issa Khalil, Abdallah Khreishah, NhatHai Phan
First: 2025-07-24T02:04:58+00:00 · Latest: 2026-02-20T01:29:18+00:00
Abstract
The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model's superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.
中文标题/摘要
标题:ViGText:结合视觉语言模型解释和图神经网络的深度假信息检测
随着深度假信息技术的迅速发展,这种能够生成逼真但虚假的数字内容的技术威胁到了媒体的真实性。传统的深度假信息检测方法往往难以应对高超且定制化的深度假信息,特别是在泛化能力和对抗恶意攻击的鲁棒性方面。本文介绍了一种名为ViGText的新方法,该方法在基于图的框架中结合了图像与视觉大型语言模型(VLLM)文本解释,以提高深度假信息检测能力。ViGText的创新之处在于它将详细的解释与视觉数据相结合,提供了比字幕更具有上下文感知的分析,因为字幕往往缺乏具体性且无法揭示细微的不一致。ViGText系统地将图像划分为块,构建图像和文本图,并使用图神经网络(GNNs)进行分析以识别深度假信息。通过在空间和频域中进行多级特征提取,ViGText捕捉到的细节增强了其对复杂深度假信息的鲁棒性和准确性。广泛的实验表明,ViGText在泛化能力上显著提升,并在检测用户定制的深度假信息时实现了显著的性能提升。具体来说,在泛化评估中,平均F1分数从72.45%提高到98.32%,反映了模型在未见过的稳定扩散模型微调变体上的优越泛化能力。在鲁棒性方面,ViGText的召回率比其他深度假信息检测方法提高了11.1%。面对利用其基于图的架构的针对性攻击时,ViGText将分类性能下降限制在不到4%。ViGText通过详细的视觉和文本分析,为检测深度假信息设立了新标准,有助于确保媒体的真实性与信息的完整性。
Summary / 总结
ViGText is a novel deepfake detection approach that integrates images with Vision Large Language Model (VLLM) text explanations using a Graph Neural Network (GNN) framework. This method improves the detection of sophisticated deepfakes by providing context-aware analysis and capturing subtle inconsistencies. Experiments show that ViGText significantly enhances generalization and robustness, achieving an average F1 score of 98.32% and a 11.1% increase in recall compared to other approaches.
ViGText 是一种结合了视觉大型语言模型文本解释和图神经网络的新颖深伪检测方法。它通过提供更具体的上下文分析和捕捉细微不一致来改进传统方法。实验表明,ViGText 显著提高了泛化能力和鲁棒性,F1 分数从 72.45% 提高到 98.32%,召回率提高了 11.1%,超过了其他方法。
Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
Authors: Dhruba Ghosh, Yuhui Zhang, Ludwig Schmidt
First: 2026-02-19T22:07:29+00:00 · Latest: 2026-02-19T22:07:29+00:00
Abstract
Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue. These improvements are evident in a wide range of VLMs built on a variety of base models, alignment architectures, and training data. However, recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge. We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks. Through a series of ablation experiments, we find that using a better LLM improves all benchmark scores equally, while a better vision encoder disproportionately improves fine-grained classification performance. Furthermore, we find that the pretraining stage is also vital to fine-grained performance, particularly when the language model weights are unfrozen during pretraining. These insights pave the way for enhancing fine-grained visual understanding and vision-centric capabilities in VLMs.
中文标题/摘要
标题:理解视觉语言模型的细粒度知识能力
视觉语言模型(VLMs)在视觉问答基准测试、视觉推理、文档理解和多模态对话等多个领域取得了显著进展。这些改进体现在多种基于不同基础模型、对齐架构和训练数据的VLMs中。然而,最近的研究表明,这些模型在传统的图像分类基准测试中落后,这些测试考察的是细粒度的视觉知识。我们测试了大量近期的VLMs在细粒度分类基准测试上,并识别出连接细粒度知识和其他视觉基准之间差距的潜在因素。通过一系列消融实验,我们发现使用更好的语言模型(LLM)可以同等提高所有基准测试的分数,而更好的视觉编码器则不成比例地提高了细粒度分类性能。此外,我们发现预训练阶段对细粒度性能也至关重要,尤其是在预训练期间解冻语言模型权重时。这些见解为增强VLMs的细粒度视觉理解和视觉中心能力铺平了道路。
Summary / 总结
The study aims to understand the fine-grained knowledge capabilities of vision-language models (VLMs) by testing them on fine-grained classification benchmarks. Through ablation experiments, the research finds that a better language model improves overall benchmark scores equally, while a better vision encoder specifically enhances fine-grained classification performance. The pretraining stage, especially when language model weights are unfrozen, is crucial for fine-grained performance. These findings guide the improvement of fine-grained visual understanding in VLMs.
研究旨在通过在细粒度分类基准上测试视觉语言模型(VLMs)来理解其细粒度知识能力。通过消融实验发现,更好的语言模型能同等提升所有基准分数,而更好的视觉编码器则特别增强了细粒度分类性能。预训练阶段,尤其是语言模型权重在预训练期间被解冻时,对细粒度性能至关重要。这些发现指导了VLMs中细粒度视觉理解的改进。
Enabling Training-Free Text-Based Remote Sensing Segmentation
Authors: Jose Sosa, Danila Rukhovich, Anis Kacem, Djamila Aouada
First: 2026-02-19T20:05:56+00:00 · Latest: 2026-02-19T20:05:56+00:00
Abstract
Recent advances in Vision Language Models (VLMs) and Vision Foundation Models (VFMs) have opened new opportunities for zero-shot text-guided segmentation of remote sensing imagery. However, most existing approaches still rely on additional trainable components, limiting their generalisation and practical applicability. In this work, we investigate to what extent text-based remote sensing segmentation can be achieved without additional training, by relying solely on existing foundation models. We propose a simple yet effective approach that integrates contrastive and generative VLMs with the Segment Anything Model (SAM), enabling a fully training-free or lightweight LoRA-tuned pipeline. Our contrastive approach employs CLIP as mask selector for SAM's grid-based proposals, achieving state-of-the-art open-vocabulary semantic segmentation (OVSS) in a completely zero-shot setting. In parallel, our generative approach enables reasoning and referring segmentation by generating click prompts for SAM using GPT-5 in a zero-shot setting and a LoRA-tuned Qwen-VL model, with the latter yielding the best results. Extensive experiments across 19 remote sensing benchmarks, including open-vocabulary, referring, and reasoning-based tasks, demonstrate the strong capabilities of our approach. Code will be released at https://github.com/josesosajs/trainfree-rs-segmentation.
中文标题/摘要
标题:无需训练的基于文本的遥感分割
近期视觉语言模型(VLMs)和视觉基础模型(VFMs)的进步为遥感图像的零样本文本引导分割提供了新的机会。然而,大多数现有方法仍然依赖额外的可训练组件,限制了它们的泛化能力和实际应用。在本文中,我们研究了在不依赖额外训练的情况下,基于文本的遥感分割能达到何种程度,仅依靠现有的基础模型。我们提出了一种简单而有效的方法,将对比生成的VLMs与Segment Anything Model (SAM) 结合,实现完全无需训练或轻量级LoRA调优的管道。我们的对比方法使用CLIP作为SAM网格提案的掩码选择器,在完全零样本设置下实现了最先进的开放词汇语义分割(OVSS)。同时,我们的生成方法通过使用GPT-5生成点击提示和LoRA调优的Qwen-VL模型在零样本设置下实现推理和引用分割,后者取得了最佳结果。广泛的实验涵盖了19个遥感基准,包括开放词汇、引用和基于推理的任务,展示了我们方法的强大能力。代码将在https://github.com/josesosajs/trainfree-rs-segmentation上发布。
Summary / 总结
This work explores text-based remote sensing segmentation without additional training, leveraging existing foundation models. It proposes a method combining contrastive and generative Vision Language Models with the Segment Anything Model, achieving state-of-the-art open-vocabulary semantic segmentation in a zero-shot setting. The approach uses CLIP for mask selection and GPT-5 for generating click prompts, with the latter yielding the best results. Experiments across 19 remote sensing benchmarks validate the method's effectiveness for various tasks.
该研究探索了无需额外训练即可实现基于文本的遥感分割,利用现有的基础模型。提出了一种结合对比和生成视觉语言模型与Segment Anything Model的方法,实现了零样本设置下的最佳开放词汇语义分割。对比方法使用CLIP进行掩码选择,而生成方法使用GPT-5和LoRA调优的Qwen-VL模型生成点击提示,后者表现最佳。在19个遥感基准测试中的实验验证了该方法在各种任务中的有效性。
CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild
Authors: Balamurugan Thambiraja, Omid Taheri, Radek Danecek, Giorgio Becherini, Gerard Pons-Moll, Justus Thies
First: 2026-02-19T19:02:22+00:00 · Latest: 2026-02-19T19:02:22+00:00
Comments: ICLR2026; Project page: https://balamuruganthambiraja.github.io/CLUTCH/
Abstract
Hands play a central role in daily life, yet modeling natural hand motions remains underexplored. Existing methods that tackle text-to-hand-motion generation or hand animation captioning rely on studio-captured datasets with limited actions and contexts, making them costly to scale to "in-the-wild" settings. Further, contemporary models and their training schemes struggle to capture animation fidelity with text-motion alignment. To address this, we (1) introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text, and (2) propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM. To build 3D-HIW, we propose a data annotation pipeline that combines vision-language models (VLMs) and state-of-the-art 3D hand trackers, and apply it to a large corpus of egocentric action videos covering a wide range of scenarios. To fully capture motion in-the-wild, CLUTCH employs SHIFT, a part-modality decomposed VQ-VAE, which improves generalization and reconstruction fidelity. Finally, to improve animation quality, we introduce a geometric refinement stage, where CLUTCH is co-supervised with a reconstruction loss applied directly to decoded hand motion parameters. Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling. Code, data and models will be released.
中文标题/摘要
标题:CLUTCH:用于解锁自然文本条件手部动作建模的上下文化语言模型
手在日常生活中扮演着重要角色,但自然手部动作建模仍处于探索阶段。现有方法在处理文本到手部动作生成或手部动画描述时依赖于有限动作和上下文的录音棚捕捉数据集,这使得它们难以扩展到“真实世界”环境。此外,当前模型及其训练方案难以捕捉文本与动作的对齐精度。为解决这一问题,我们(1)引入了“野外3D手部”(3D-HIW)数据集,包含32000个3D手部动作序列及其对齐的文本,(2)提出了一种基于LLM的手部动画系统CLUTCH,该系统包含两项关键创新:(a)一种新颖的VQ-VAE架构SHIFT,用于标记手部动作,(b)几何细化阶段以微调LLM。为了构建3D-HIW,我们提出了一种数据注释流水线,结合了视觉语言模型(VLMs)和最先进的3D手部追踪器,并将其应用于大量第一人称视角动作视频,涵盖多种场景。为了全面捕捉“真实世界”中的动作,CLUTCH采用了SHIFT,这是一种部分模态分解的VQ-VAE,提高了泛化能力和重建精度。最后,为了提高动画质量,我们引入了一个几何细化阶段,其中CLUTCH与直接应用于解码手部动作参数的重构损失共同监督。实验表明,CLUTCH在文本到动作和动作到文本任务上达到了最先进的性能,建立了首个可扩展的“真实世界”手部动作建模基准。代码、数据和模型将被发布。
When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
Authors: Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding
First: 2026-02-19T18:59:20+00:00 · Latest: 2026-02-19T18:59:20+00:00
Comments: Website: https://vla-va.github.io/
Abstract
Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
中文标题/摘要
标题:视觉优先于语言:评估和缓解VLAs中的反事实失败
视觉-语言-行动模型(VLAs)承诺将语言指令应用于机器人控制,但在实践中往往未能忠实执行语言指令。当面对缺乏强烈场景特定监督的指令时,VLAs会遭受反事实失败:它们基于由数据集偏差引起的视觉捷径行动,反复执行已学得的行为,并选择在训练期间频繁出现的对象,而不管语言意图如何。为了系统地研究这一问题,我们引入了LIBERO-CF,这是第一个用于VLAs的反事实基准,通过在视觉上合理的LIBERO布局下分配替代指令来评估语言跟随能力。我们的评估表明,反事实失败在最先进的VLAs中普遍存在但尚未得到充分探索。我们提出了反事实行动指导(CAG),这是一种简单而有效的双分支推理方案,明确地在VLAs中正则化语言条件。CAG结合了一个标准的VLA策略和一个未受语言条件的视觉-行动(VA)模块,在行动选择期间进行反事实比较。这种设计减少了对视觉捷径的依赖,提高了对未观察任务的鲁棒性,并且不需要额外的演示或对现有架构或预训练模型进行修改。广泛的实验表明,它可以在各种VLAs中实现即插即用集成,并且具有一致的改进。例如,在LIBERO-CF中,CAG在语言跟随准确性上提高了9.7%,在未观察任务上的任务成功率提高了3.6%,使用无训练策略,配以VA模型时,进一步提高了15.5%和8.5%。在实际应用中,CAG将反事实失败减少了9.4%,并将任务成功率平均提高了17.2%。
Summary / 总结
The paper addresses the issue of counterfactual failures in Vision-Language-Action models (VLAs), where models act based on visual biases rather than language instructions. It introduces LIBERO-CF, a benchmark for evaluating language-following capability by providing alternative instructions under visually plausible scenarios. The authors propose Counterfactual Action Guidance (CAG), a dual-branch inference scheme that improves robustness and reduces counterfactual failures, achieving significant improvements in language following accuracy and task success on under-observed tasks without requiring additional training or modifications to existing models.
论文探讨了Vision-Language-Action模型(VLAs)中的反事实失败问题,即模型基于视觉偏差而非语言指令行动。作者引入了LIBERO-CF基准,通过提供在视觉上合理的替代指令来评估语言跟随能力。作者提出了一种名为Counterfactual Action Guidance(CAG)的双分支推理方案,该方案提高了鲁棒性并减少了反事实失败,实现了在未观察任务上的语言跟随准确性和任务成功率的显著提升,且无需额外训练或修改现有模型架构。
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Authors: Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen
First: 2026-02-19T18:54:32+00:00 · Latest: 2026-02-19T18:54:32+00:00
Comments: Code at: https://github.com/vila-lab/M-Attack-V2
Abstract
Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.
中文标题/摘要
标题:通过细粒度细节目标推动黑盒LVLM攻击前沿
大型视觉-语言模型(LVLMs)的黑盒对抗攻击由于缺乏梯度和复杂的多模态边界而具有挑战性。尽管先前的基于转移的方法,如M-Attack,通过源和目标图像的局部切片级匹配表现良好,但我们发现这会导致梯度在迭代中高度变化且几乎正交,违反了局部一致对齐并破坏了优化。我们将其归因于(i)ViT翻译敏感性导致尖峰梯度和(ii)源和目标切片之间的结构不对称性。我们将局部匹配重新表述为源变换和目标语义的不对称期望,并构建了M-Attack的梯度去噪升级版。在源侧,多切片对齐(MCA)在每次迭代中从多个独立采样的局部视图中平均梯度以减少方差。在目标侧,辅助目标对齐(ATA)用来自语义相关分布的小辅助集替换激进的目标增强,产生更平滑、方差更低的目标流形。我们进一步将动量重新解释为块动量,回放历史切片梯度;结合精细的块大小集合(PE+),这加强了可转移的方向。这些模块一起形成了M-Attack-V2,这是一个简单的模块化增强,显著提高了前沿LVLM的基于转移的黑盒攻击成功率:将Claude-4.0的成功率从8%提升到30%,Gemini-2.5-Pro从83%提升到97%,GPT-5从98%提升到100%,超越了先前的黑盒LVLM攻击。代码和数据可在:https://github.com/vila-lab/M-Attack-V2公开获取。
IntRec: Intent-based Retrieval with Contrastive Refinement
Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
First: 2026-02-19T18:50:53+00:00 · Latest: 2026-02-19T18:50:53+00:00
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
中文标题/摘要
标题:IntRec:基于意图的对比精炼检索
从复杂场景中检索用户指定的对象仍然是一个具有挑战性的任务,尤其是在查询模糊或涉及多个相似对象的情况下。现有的开放式词汇检测器以单次操作的方式工作,缺乏根据用户反馈精炼预测的能力。为了解决这个问题,我们提出了IntRec,这是一种基于用户反馈进行预测精炼的交互式对象检索框架。其核心是一个意图状态(IS),它维护了正锚点(确认的线索)和负约束(被拒绝的假设)的双重记忆集。对比对齐函数通过最大化与正线索的相似性并惩罚被拒绝的对象来对候选对象进行排名,从而在杂乱的场景中实现细粒度的歧义消解。我们的交互式框架在不增加额外监督的情况下显著提高了检索准确性。在LVIS上,IntRec达到了35.4 AP,分别比OVMR、CoDet和CAKE高出+2.3、+3.7和+0.5。在具有挑战性的LVIS-模糊基准上,它在单次纠正反馈后提高了7.9 AP的性能,每次交互的额外延迟少于30毫秒。
Summary / 总结
IntRec is an interactive object retrieval framework that refines predictions based on user feedback, addressing the challenge of ambiguous queries in complex scenes. It uses an Intent State maintaining positive anchors and negative constraints, and a contrastive alignment function to rank candidate objects. IntRec significantly improves retrieval accuracy on LVIS, achieving 35.4 AP and outperforming existing methods by up to +7.9 AP with minimal latency.
IntRec 是一种基于用户反馈的交互式物体检索框架,通过维护正锚点和负约束的意图状态来细化预测,解决复杂场景中模糊查询的挑战。在 LVIS 数据集上,IntRec 的 AP 达到 35.4,优于 OVMR、CoDet 和 CAKE 等现有方法。在 LVIS-Ambiguous 难题基准测试中,使用一次反馈后性能提高了 7.9 AP,每次交互的额外延迟不到 30 毫秒。
Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
Authors: Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan
First: 2026-02-19T18:44:23+00:00 · Latest: 2026-02-19T18:44:23+00:00
Comments: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client's data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.
中文标题/摘要
标题:具有灾难性遗忘鲁棒性的一次性增量联邦学习
现代大数据系统生成大量异构且地理上分散的流数据,规模庞大且隐私敏感,使得集中化变得困难。虽然联邦学习(FL)提供了一种增强隐私的训练机制,但它假设静态数据流,并在多轮中学习协作模型,这使得在通信受限场景中处理增量数据的学习变得具有挑战性。本文提出了一次性增量联邦学习(OSI-FL),这是第一个解决通信开销和灾难性遗忘双重挑战的FL框架。OSI-FL通过在单个通信轮次中由每个客户端的冻结视觉-语言模型(VLM)生成类别特定嵌入,然后由服务器端预训练的扩散模型合成与客户端数据分布相似的新数据样本,在服务器端进行训练。然而,仍存在两个挑战:i) 任务以增量方式到达需要重新训练全局模型,ii) 随着未来任务的到达,重新训练模型会导致灾难性遗忘。为此,我们通过选择性样本保留(SSR)增强训练,根据样本损失识别并保留每个类别和任务对的最具有信息性的前p个样本。SSR通过确保代表性保留样本在后续迭代中被纳入训练来限制遗忘。实验结果表明,OSI-FL在三个基准数据集上的类增量和域增量场景中均优于传统和一次性FL方法的基线。
Summary / 总结
This paper addresses the challenges of communication overhead and catastrophic forgetting in federated learning with incremental data. It introduces One-Shot Incremental Federated Learning (OSI-FL), which communicates category-specific embeddings from each client in a single round, using a pre-trained diffusion model to synthesize new data. The method includes Selective Sample Retention (SSR) to mitigate catastrophic forgetting by retaining top-p most informative samples. Experimental results show that OSI-FL outperforms traditional and one-shot FL approaches in both class-incremental and domain-incremental scenarios across three benchmark datasets.
本文解决了增量数据下联邦学习中的通信开销和灾难性遗忘问题。提出了One-Shot增量联邦学习(OSI-FL),该方法从客户端向服务器传输类别特定的嵌入,服务器据此合成新数据进行训练。为了缓解灾难性遗忘,作者提出了选择性样本保留(SSR)方法,该方法根据样本损失保留每个类别和任务对中最信息性的样本。实验结果表明,OSI-FL在三个基准数据集上的类增量和域增量场景中均优于传统和单次联邦学习方法。
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Venue: NeurIPS 2025
First: 2025-05-05T17:47:42+00:00 · Latest: 2026-02-19T18:32:53+00:00
Comments: This work was accepted and presented at NeurIPS 2025. Code is available at https://github.com/mts-ai/replaceme Reviews at OpenReview: https://openreview.net/forum?id=zEj1FSYCRn NeurIPS 2025 Proceedings: https://openreview.net/pdf?id=zEj1FSYCRn
Abstract
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe
中文标题/摘要
标题:ReplaceMe:通过深度剪枝和Transformer块线性化简化网络
我们引入了ReplaceMe,这是一种通用的无需训练的深度剪枝方法,能够有效将Transformer块替换为线性操作,同时在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同,我们的方法仅需一个小型校准数据集来估计线性变换,该变换近似于剪枝后的块。估计出的线性映射可以无缝地与剩余的Transformer块合并,无需任何额外的网络参数。我们的实验表明,ReplaceMe在所有无需训练的方法中表现最佳,并且在涉及大量重新训练/微调和架构修改的最新剪枝方法中保持了高度竞争力。应用于多个大型语言模型(LLMs),ReplaceMe在开放基准测试中实现了高达25%的剪枝,同时保留了原始模型约90%的性能,无需任何训练或修复步骤,从而减少了计算开销。我们提供了一个开源库,实现了ReplaceMe以及几种最新的深度剪枝技术,可在https://github.com/mts-ai/ReplaceMe 获取。
Summary / 总结
ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations, maintaining high performance with low compression ratios. Unlike conventional methods requiring additional training, ReplaceMe uses a small calibration dataset to estimate a linear transformation that approximates pruned blocks, which can be merged with remaining blocks without additional parameters. Experiments show ReplaceMe outperforms other training-free approaches and remains competitive with state-of-the-art methods involving extensive retraining and architectural changes. Applied to large language models, ReplaceMe achieves up to 25% pruning with minimal performance loss and no training overhead.
ReplaceMe 是一种无需训练的深度剪枝方法,通过将变压器块替换为线性操作来保持低压缩比下的高性能。不同于需要额外训练的传统剪枝方法,ReplaceMe 使用一个小的校准数据集来估计一个线性变换,该变换近似于剪枝后的块。实验表明,ReplaceMe 在无需训练或修复步骤的情况下,优于其他无需训练的剪枝方法,并且与最先进的剪枝方法保持竞争力。应用于大型语言模型时,ReplaceMe 可以实现高达 25% 的剪枝,同时具有最小的计算开销。
Boosting Medical Visual Understanding From Multi-Granular Language Learning
Authors: Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan
Venue: ICLR 2026
First: 2025-11-20T00:24:26+00:00 · Latest: 2026-02-19T18:27:29+00:00
Comments: Accepted by ICLR 2026. 40 pages
Abstract
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.
中文标题/摘要
标题:从多粒度语言学习增强医学视觉理解
近期在图像-文本预训练方面的进展显著提升了视觉理解能力,通过视觉和文本表示的对齐。对比语言-图像预训练(CLIP)在多模态学习中发挥了关键作用。然而,其对齐方式仅限于单一标签和单一粒度,限制了其在复杂领域(如医学成像)中的效果,其中图像往往对应多个高级标签(例如,疾病类别)和不同粒度的注释(例如,诊断描述,临床解释)。为解决这一问题,我们提出了多粒度语言学习(MGLL),这是一种对比学习框架,旨在提高多标签和跨粒度对齐。MGLL 利用结构化的多标签监督,整合不同粒度的文本描述,并引入软标签监督和点对点约束以增强对齐。MGLL 使用平滑的Kullback-Leibler(KL)散度确保跨粒度一致性,同时保持计算效率作为视觉-语言模型的即插即用模块。在我们构建的大规模多粒度数据集上预训练,并在多个数据集上进行评估,MGLL 在下游任务中优于其他最先进的方法。代码可在 https://github.com/HUANGLIZI/MGLL/ 获取。
Summary / 总结
The research aims to improve medical visual understanding by addressing the limitations of single-granularity alignment in existing models like CLIP. MGLL, a contrastive learning framework, is proposed to enhance multi-label and cross-granularity alignment through structured multi-label supervision, integrated textual descriptions, and soft-label supervision with point-wise constraints. MGLL outperforms other state-of-the-art methods in downstream tasks when pretrained on large-scale multi-granular datasets.
研究旨在通过解决现有方法如CLIP在单一粒度对齐方面的局限性,提高医学视觉理解。MGLL作为一种对比学习框架,通过结构化的多标签监督、集成的文本描述和带有点对点约束的软标签监督来增强多标签和跨粒度对齐。MGLL在大规模多粒度数据集上预训练后,在下游任务中优于其他最先进的方法。
AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum
First: 2026-02-19T18:17:25+00:00 · Latest: 2026-02-19T18:17:25+00:00
Comments: 29 pages, 14 figures
Abstract
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
中文标题/摘要
标题:AI游戏商店:通过人类游戏评估机器通用智能的可扩展、开放性方法
在技术飞速发展的时代,严格评估机器智能与人类通用智能的广泛谱系相比变得越来越重要且具有挑战性。传统的AI基准测试通常仅评估人类活动有限范围内的狭窄能力。大多数基准测试也是静态的,随着开发人员显式或隐式地对其进行优化,它们很快就会饱和。我们提出了一种评估AI系统中类似人类的通用智能的更有前途的方法:通过一种特别强大的通用游戏玩法形式:研究它们如何以及如何很好地玩和学习玩所有可能的人类游戏,与具有相同经验水平、时间或其他资源的人类玩家进行比较。我们定义“人类游戏”为人类设计供人类玩的游戏,并认为这个所有此类游戏的空间——“人类游戏多元宇宙”——是评估的合适空间。为了实现这一愿景的第一步,我们引入了AI游戏商店,这是一个使用人类在环的LLM构建的可扩展和开放性平台,通过自动获取和适应来自流行的人类数字游戏平台的标准和容器化游戏环境变体来合成新的代表性人类游戏。作为概念验证,我们基于Apple App Store和Steam的热门排行榜生成了100个此类游戏,并对七个前沿的视觉语言模型(VLMs)进行了短片段的游戏评估。最好的模型在大多数游戏中的人类平均得分中仅达到了不到10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为困难。最后,我们提出了构建AI游戏商店的下一步,作为一种实际的测量和推动机器向人类通用智能发展的方法。
Summary / 总结
The paper proposes AI Gamestore, a scalable platform to evaluate machine general intelligence by comparing AI systems to human players in a wide range of human-designed games. It introduces a 'Multiverse of Human Games' as a benchmark for assessing human-like general intelligence. Initial experiments with seven vision-language models on 100 synthesized games showed that the best models achieved less than 10% of human scores, especially struggling with games that require world-model learning, memory, and planning.
研究旨在通过广泛的人类设计游戏来全面评估机器智能,与人类一般智能进行比较。方法是创建一个名为AI GameStore的可扩展平台,使用LLM和人类在环中生成基于流行数字游戏平台的新游戏。关键发现表明,先进的视觉语言模型表现不佳,在大多数游戏中仅达到人类平均得分的不到10%,尤其是在需要复杂世界模型学习、记忆和规划能力的游戏方面表现尤为差强人意。
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-19T17:30:28+00:00
Abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.
中文标题/摘要
标题:CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型
大型语言模型(LLMs)的后训练压缩通常依赖于低秩权重近似,将权重矩阵的每一列表示为共享低维子空间中的表示。这种策略计算效率高,但其背后的约束对于异构投影权重来说可能过于僵硬,可能会导致不必要的准确度损失。我们提出了CoSpaDi(通过稀疏字典学习压缩),这是一种无需训练的框架,用结构化稀疏分解替代低秩分解,其中每个权重矩阵表示为一个稠密字典乘以列稀疏系数矩阵。这产生了一种子空间模型:权重矩阵的列表示为不同字典原子子集的线性组合,从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的:使用一个小的校准集,我们优化分解以最小化层输出的功能重构误差,而不是权重空间误差。基于激活的格正交化将这个数据感知的目标重新表述为标准的字典学习问题,应用于转换后的权重上,我们支持层内压缩和跨层字典共享。在Llama和Qwen模型家族中,CoSpaDi 在20-40% 压缩比下,相对于基于SVD的先进基线和强大的结构化剪枝基线,始终能够改善准确度-压缩和困惑度-压缩的权衡。这种结构化稀疏性使得稀疏-密集计算成为可能,并与稀疏系数的后训练量化集成。
Summary / 总结
CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to replace low-rank factorization, improving expressiveness while maintaining accuracy. It optimizes the factorization using a calibration set to minimize functional reconstruction error of layer outputs, and supports per-layer compression and cross-layer dictionary sharing. Across Llama and Qwen models, CoSpaDi outperforms state-of-the-art SVD-based and structured pruning baselines at 20-40% compression ratios, enhancing both accuracy and perplexity. The structured sparsity allows for efficient sparse-dense computation and integrates with post-training quantization of the sparse coefficients.
CoSpaDi 是一种无需训练的大型语言模型压缩框架,使用结构化稀疏分解替换低秩分解,提高表达能力同时保持准确性。它使用校准集优化因子分解,以最小化层输出的功能重构误差,并支持逐层压缩和跨层字典共享。在 Llama 和 Qwen 模型中,CoSpaDi 在 20-40% 压缩比下优于最先进的 SVD 基准和结构化剪枝基准,同时提升准确性和困惑度。结构化稀疏性允许高效稀疏-密集计算,并与稀疏系数的后训练量化集成。
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge
First: 2026-02-19T16:45:38+00:00 · Latest: 2026-02-19T16:45:38+00:00
Comments: 18 pages, 6 figures, 4 tables
Abstract
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
中文标题/摘要
标题:LATA:拉普拉斯辅助的归纳适应性转换以提高医学VLM中的校准不确定性
医学视觉-语言模型(VLMs)在医学成像中具有强大的零样本识别能力,但它们在领域转移下的可靠性取决于有保证的校准不确定性。分割一致预测(SCP)提供了有限样本覆盖率,但预测集通常变得很大(效率低),并且类别间的覆盖率不平衡(高类别条件覆盖率差距,CCV),特别是在少量样本、类别不平衡的情况下;此外,直接适应校准标签会破坏可交换性并使保证失效。我们提出了LATA(拉普拉斯辅助的归纳适应性转换),这是一种无需训练和标签的改进方法,通过在图像-图像k-NN图上平滑零样本概率,使用少量CCC更新,通过确定性变换保持SCP的有效性。我们还引入了一种失败感知的校准分数,将其插入视觉-语言不确定性(ViLU)框架中,提供实例级的难度和标签合理性,以提高固定覆盖率下的预测集效率和类别间的平衡。LATA是黑盒的(不更新VLM),计算量轻(窗口化归纳,无反向传播),并包括一个可选的先验旋钮,可以完全不使用标签运行,或者如果需要,可以使用校准边缘信息在标签指导下运行。在三个医学VLM和九个下游任务上,LATA始终减少集合大小和CCV,同时匹配或收紧目标覆盖率,优于先前的归纳基线,并缩小与使用标签方法的差距,同时使用更少的计算资源。全面的消融实验和定性分析表明,LATA在不牺牲可交换性的情况下增强了零样本预测。
Summary / 总结
LATA (Laplacian-Assisted Transductive Adaptation) is a training- and label-free method that refines split conformal prediction for medical vision-language models by smoothing zero-shot probabilities using a k-NN graph, which reduces prediction set size and improves class-wise balance while maintaining finite-sample coverage. It achieves this with minimal computational overhead and outperforms previous transductive baselines across multiple medical VLMs and tasks.
LATA(Laplacian-Assisted Transductive Adaptation)旨在通过在不更新视觉语言模型(VLM)的情况下对校准和测试数据进行精炼,提高医疗VLMs在域转移下的可靠性。它通过小数量的CCCP均场更新在图像-图像k-NN图上平滑零样本概率,保持分裂校准预测的有效性。LATA引入了一种失败感知的校准分数,以增强预测集的效率和类别间的平衡。实验结果显示,LATA在三个医疗VLMs和九个下游任务上减少了集合大小和类别条件下的覆盖差距,同时保持或提高了目标覆盖范围,且计算成本更低,优于之前的归纳方法。