arXiv 论文速递

Snapshot: 20260222_0339

When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

Authors: Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding

First: 2026-02-19T18:59:20+00:00 · Latest: 2026-02-19T18:59:20+00:00

Comments: Website: https://vla-va.github.io/

Abstract

Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.

中文标题/摘要

标题：视觉优先于语言：评估和缓解VLAs中的反事实失败

视觉-语言-行动模型（VLAs）承诺将语言指令应用于机器人控制，但在实践中往往未能忠实执行语言指令。当面对缺乏强烈场景特定监督的指令时，VLAs会遭受反事实失败：它们基于数据集偏差诱导的视觉捷径行动，反复执行已学得的行为，并选择在训练期间频繁出现的对象，而不考虑语言意图。为了系统地研究这一问题，我们引入了LIBERO-CF，这是第一个用于VLAs的反事实基准，通过在视觉上合理的LIBERO布局下分配替代指令来评估语言跟随能力。我们的评估表明，反事实失败在最先进的VLAs中普遍存在但尚未得到充分探索。我们提出了反事实行动指导（CAG），这是一种简单而有效的双分支推理方案，明确地在VLAs中正则化语言条件。CAG结合了一个标准的VLA策略和一个未受语言条件的视觉-行动（VA）模块，在行动选择时进行反事实比较。这种设计减少了对视觉捷径的依赖，提高了对未观察任务的鲁棒性，并且无需额外演示或修改现有架构或预训练模型。广泛的实验表明，它可以在各种VLAs中实现即插即用集成，并且具有持续改进。例如，在LIBERO-CF中，CAG在语言跟随准确性上提高了9.7%，在未观察任务上的任务成功率提高了3.6%，使用无训练策略，配以VA模型时，进一步提高了15.5%和8.5%。在实际应用评估中，CAG将反事实失败减少了9.4%，平均提高了任务成功率17.2%。

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting

Authors: Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen

First: 2026-02-19T18:54:32+00:00 · Latest: 2026-02-19T18:54:32+00:00

Comments: Code at: https://github.com/vila-lab/M-Attack-V2

Abs · PDF · Code1 · Code2 · Code3

Abstract

Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.

中文标题/摘要

标题：通过细粒度细节目标推动黑盒LVLM攻击前沿

大型视觉-语言模型（LVLMs）的黑盒对抗攻击由于缺乏梯度和复杂的多模态边界而具有挑战性。尽管先前的基于转移的方法，如M-Attack，通过源图像和目标图像的局部切片级匹配表现良好，但我们发现这会导致梯度在迭代过程中高度变化且几乎正交，违反了局部一致对齐并破坏了优化。我们将其归因于（i）ViT翻译敏感性导致尖峰梯度和（ii）源图像和目标图像之间的结构不对称性。我们将局部匹配重新表述为源变换和目标语义的非对称期望，并构建了M-Attack的梯度去噪升级版。在源侧，多切片对齐（MCA）在每次迭代中从多个独立采样的局部视图中平均梯度以减少方差。在目标侧，辅助目标对齐（ATA）用来自语义相关分布的小辅助集替换激进的目标增强，产生更平滑、方差更低的目标流形。我们进一步将动量重新解释为块动量，回放历史切片梯度；结合精细的块大小集合（PE+），这加强了可转移的方向。这些模块共同构成了M-Attack-V2，这是一个简单的模块化增强，显著提高了前沿LVLM的基于转移的黑盒攻击成功率：将Claude-4.0的成功率从8%提升到30%，Gemini-2.5-Pro从83%提升到97%，GPT-5从98%提升到100%，优于先前的黑盒LVLM攻击。代码和数据可在：https://github.com/vila-lab/M-Attack-V2公开获取。

IntRec: Intent-based Retrieval with Contrastive Refinement

Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu

First: 2026-02-19T18:50:53+00:00 · Latest: 2026-02-19T18:50:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.

中文标题/摘要

标题：IntRec: 基于意图的对比精炼检索

从复杂场景中检索用户指定的对象仍然是一个具有挑战性的任务，尤其是在查询模糊或涉及多个相似对象的情况下。现有的开放式词汇检测器以单次操作的方式工作，缺乏根据用户反馈精炼预测的能力。为了解决这个问题，我们提出了IntRec，这是一种基于用户反馈的交互式对象检索框架，能够根据用户反馈精炼预测。其核心是一个意图状态(IS)，它维护了正锚点（确认的线索）和负约束（被拒绝的假设）的双重记忆集。对比对齐函数通过最大化与正线索的相似性并惩罚被拒绝的对象来对候选对象进行排名，从而在杂乱的场景中实现细粒度的消歧。我们的交互式框架在不增加额外监督的情况下显著提高了检索准确性。在LVIS上，IntRec达到了35.4 AP，分别比OVMR、CoDet和CAKE高出+2.3、+3.7和+0.5。在具有挑战性的LVIS-模糊基准上，它在单次纠正反馈后提高了7.9 AP的性能，每次交互的额外延迟不到30毫秒。

Summary / 总结

IntRec is an interactive object retrieval framework that refines predictions based on user feedback, addressing the challenge of ambiguous queries in complex scenes. It uses an Intent State maintaining positive anchors and negative constraints to rank candidate objects. On LVIS, IntRec outperforms existing methods by +2.3 to +3.7 AP, and it improves performance by +7.9 AP on the LVIS-Ambiguous benchmark with minimal latency.

IntRec 是一种基于用户反馈的交互式物体检索框架，用于解决复杂场景中模糊查询的挑战。它通过维护正锚点和负约束的意图状态来排名候选物体。在 LVIS 上，IntRec 达到了 35.4 AP，优于现有方法多达 +3.7。它还在 LVIS-Ambiguous 基准上通过单次纠正反馈提高了 +7.9 AP 的性能，且每次交互的延迟不到 30 毫秒。

Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning

Authors: Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan

First: 2026-02-19T18:44:23+00:00 · Latest: 2026-02-19T18:44:23+00:00

Comments: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025

Abs · PDF · Code1 · Code2

Abstract

Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client's data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.

中文标题/摘要

标题：具有灾难性遗忘鲁棒性的单次增量联邦学习

现代大数据系统生成大量异构且地理上分散的流数据，规模庞大且隐私敏感，使得集中化变得困难。虽然联邦学习（FL）提供了一种增强隐私的训练机制，但它假设静态的数据流，并在多轮中学习协作模型，这使得在通信受限场景中处理增量数据的学习变得具有挑战性。本文提出了单次增量联邦学习（OSI-FL），这是第一个解决通信开销和灾难性遗忘双重挑战的FL框架。OSI-FL通过在单个通信轮次中由每个客户端的冻结视觉-语言模型（VLM）生成类别特定的嵌入，然后由服务器端预训练的扩散模型合成与客户端数据分布相似的新数据样本，在服务器端进行训练。然而，仍存在两个挑战：i) 任务以增量方式到达需要重新训练全局模型，ii) 随着未来任务的到达，重新训练模型会导致灾难性遗忘。为此，我们通过选择性样本保留（SSR）增强训练，根据样本损失识别并保留每个类别和任务对中最信息丰富的top-p个样本。SSR通过确保代表性保留样本在后续迭代中被纳入训练来限制遗忘。实验结果表明，OSI-FL在三个基准数据集上的类增量和领域增量场景中均优于基线方法，包括传统的和单次的FL方法。

Summary / 总结

This paper addresses the challenges of communication overhead and catastrophic forgetting in federated learning with incremental data. It introduces One-Shot Incremental Federated Learning (OSI-FL), which communicates category-specific embeddings from clients to a server for synthesizing new data. To mitigate catastrophic forgetting, the authors propose Selective Sample Retention (SSR) to retain the most informative samples. Experimental results show that OSI-FL outperforms traditional and one-shot FL approaches in both class-incremental and domain-incremental scenarios across three benchmark datasets.

本文提出了One-Shot Incremental Federated Learning (OSI-FL)框架，旨在解决增量数据下联邦学习中的通信开销和灾难性遗忘问题。OSI-FL利用冻结的视觉-语言模型生成的类别特定嵌入来在服务器端合成新数据，然后用于训练。为了缓解灾难性遗忘，框架采用了选择性样本保留（SSR）方法，根据样本损失保留每个类别和任务对中最信息性的样本。实验结果表明，OSI-FL在三个基准数据集上的类增量和域增量场景中均优于传统的和单次联邦学习方法。

ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko

Venue: NeurIPS 2025

First: 2025-05-05T17:47:42+00:00 · Latest: 2026-02-19T18:32:53+00:00

Comments: This work was accepted and presented at NeurIPS 2025. Code is available at https://github.com/mts-ai/replaceme Reviews at OpenReview: https://openreview.net/forum?id=zEj1FSYCRn NeurIPS 2025 Proceedings: https://openreview.net/pdf?id=zEj1FSYCRn

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe

中文标题/摘要

标题：ReplaceMe：通过深度剪枝和Transformer块线性化实现网络简化

我们引入了ReplaceMe，这是一种通用的无需训练的深度剪枝方法，能够有效将Transformer块替换为线性操作，同时在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同，我们的方法仅需一个小规模的校准数据集来估计线性变换，该变换近似于剪枝后的块。估计出的线性映射可以无缝地与剩余的Transformer块合并，无需任何额外的网络参数。我们的实验表明，ReplaceMe在所有无需训练的方法中表现最佳，并且在涉及大量重新训练/微调和架构修改的最新剪枝方法中保持了高度竞争力。应用于多个大型语言模型（LLMs），ReplaceMe在开放基准测试中实现了高达25%的剪枝，同时保留了原始模型约90%的性能，无需任何训练或修复步骤，从而减少了计算开销。我们提供了一个开源库，其中包含ReplaceMe以及几种最新的深度剪枝技术，可在https://github.com/mts-ai/ReplaceMe 获取。

Summary / 总结

ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations to maintain high performance with low compression ratios. Unlike conventional pruning methods that require additional training, ReplaceMe uses a small calibration dataset to estimate a linear transformation that approximates the pruned blocks. Experiments show that ReplaceMe outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods, achieving up to 25% pruning with minimal performance loss and no additional computational overhead.

ReplaceMe 是一种无需训练的深度剪枝方法，通过将变压器块替换为线性操作来保持低压缩比下的高性能。与需要额外训练的传统剪枝方法不同，ReplaceMe 使用一个小的校准数据集来估计一个线性变换，该变换近似于剪枝后的块。实验表明，ReplaceMe 在性能上优于其他无需训练的方法，并且与涉及大量重新训练和架构修改的最新剪枝方法保持竞争力，可实现高达 25% 的剪枝，同时几乎没有性能损失和额外的计算开销。

Boosting Medical Visual Understanding From Multi-Granular Language Learning

Authors: Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan

Venue: ICLR 2026

First: 2025-11-20T00:24:26+00:00 · Latest: 2026-02-19T18:27:29+00:00

Comments: Accepted by ICLR 2026. 40 pages

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.

中文标题/摘要

标题：从多粒度语言学习增强医学视觉理解

近期在图像-文本预训练方面的进展显著提升了视觉理解能力，通过视觉和文本表示的对齐。对比语言-图像预训练（CLIP）在多模态学习中发挥了关键作用。然而，其对单标签、单粒度对齐的侧重限制了其在医学成像等复杂领域中的效果，医学图像往往对应多个高层标签（例如，疾病类别），且不同注释粒度（例如，诊断描述、临床解释）不同。为解决这一问题，我们提出了多粒度语言学习（MGLL），这是一种对比学习框架，旨在提高多标签和跨粒度对齐。MGLL 利用结构化的多标签监督，整合不同粒度的文本描述，并引入软标签监督和点对点约束以增强对齐。MGLL 使用平滑的Kullback-Leibler（KL）散度确保跨粒度一致性，同时保持计算效率作为视觉-语言模型的即插即用模块。在我们构建的大规模多粒度数据集上预训练，并在多个数据集上进行评估，MGLL 在下游任务中优于其他最先进的方法。代码可在 https://github.com/HUANGLIZI/MGLL/ 获取。

Summary / 总结

The research aims to improve medical visual understanding by addressing the limitations of single-granularity alignment in existing models like CLIP. MGLL, a contrastive learning framework, is proposed to enhance multi-label and cross-granularity alignment through structured multi-label supervision, integrated textual descriptions, and soft-label supervision. The method uses smooth KL divergence for cross-granularity consistency and is implemented as a plug-and-play module. Experiments show that MGLL outperforms other state-of-the-art methods in downstream tasks when pretrained on large-scale multi-granular datasets.

研究旨在通过解决现有方法如CLIP在单一粒度对齐方面的局限性，提高医学视觉理解。提出的多粒度语言学习（MGLL）框架通过结构化的多标签监督、集成的文本描述和带有点对点约束的软标签监督来增强多标签和跨粒度对齐。MGLL 使用平滑的 Kullback-Leibler 散度来确保粒度间的一致性，同时保持计算效率。实验结果表明，MGLL 在使用大规模多粒度数据集预训练后，在下游任务中优于其他最先进的方法。

AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games

Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum

First: 2026-02-19T18:17:25+00:00 · Latest: 2026-02-19T18:17:25+00:00

Comments: 29 pages, 14 figures

Abs · PDF · Code1 · Code2

Abstract

Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.

中文标题/摘要

标题：AI游戏库：通过人类游戏评估机器通用智能的可扩展、开放性方法

在技术飞速发展的时代，严格评估机器智能与人类通用智能的广泛谱系相比变得越来越重要且具有挑战性。传统的AI基准测试通常仅评估人类活动有限范围内的狭窄能力。大多数基准测试也是静态的，随着开发人员显式或隐式地对其进行优化，它们很快就会饱和。我们提出了一种评估AI系统中类似人类的通用智能的更有前途的方法：通过一种特别强大的通用游戏玩法形式：研究它们如何以及如何很好地玩和学习玩所有可能的人类游戏，与具有相同经验水平、时间或其他资源的人类玩家进行比较。我们定义“人类游戏”为人类设计供人类玩的游戏，并认为这个所有此类游戏的空间——“人类游戏多元宇宙”——是评估合适的领域。为了实现这一愿景的第一步，我们引入了AI游戏库，这是一个使用人类在环的LLM构建的可扩展和开放性平台，通过自动获取和适应来自流行的人类数字游戏平台的标准和容器化游戏环境变体来合成新的代表性人类游戏。作为概念验证，我们基于Apple App Store和Steam的热门排行榜生成了100个此类游戏，并对七个前沿的视觉-语言模型（VLMs）进行了短片段的游戏评估。最好的模型在大多数游戏中的人类平均得分中仅达到了不到10%，尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为困难。最后，我们提出了构建AI游戏库的下一步，作为一种实用的方法来衡量和推动机器向类似人类的通用智能发展。

Summary / 总结

The paper aims to rigorously evaluate machine intelligence by comparing it to human general intelligence through a scalable platform called AI GameStore. This platform uses large language models and human-in-the-loop synthesis to create new human games, aiming to assess how well AI systems can play and learn all conceivable human games. Key findings show that the best models achieved less than 10% of the human average score on most games, particularly struggling with games that challenge world-model learning, memory, and planning.

研究旨在通过广泛的人类设计游戏来全面评估机器智能，与人类普通智能进行比较。方法是创建一个可扩展的平台AI GameStore，使用LLMs和人类输入生成新游戏。关键发现表明，先进模型表现不佳，仅达到人类得分的不到10%，尤其是在需要世界建模、记忆和规划的游戏方面。

CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis

First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-19T17:30:28+00:00

Abs · PDF · Code1 · Code2

Abstract

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.

中文标题/摘要

标题：CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型

大型语言模型（LLMs）的后训练压缩通常依赖于低秩权重近似，将权重矩阵的每一列表示在共享的低维子空间中。这种策略计算效率高，但其背后的约束对于异构投影权重来说可能过于僵硬，可能会导致不必要的准确度损失。我们提出了CoSpaDi（通过稀疏字典学习压缩），这是一种无需训练的框架，用结构化稀疏分解替代低秩分解，其中每个权重矩阵表示为一个稠密字典乘以一列稀疏系数矩阵。这产生了一种子空间并集模型：权重矩阵的列表示为不同字典原子子集的线性组合，从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的：使用一个小的校准集，我们优化分解以最小化层输出的功能重建误差，而不是权重空间误差。激活衍生的格拉姆正交化将这种数据感知目标重新表述为转换后的权重上的标准字典学习问题，并支持层内压缩和跨层字典共享。在Llama和Qwen模型家族中，CoSpaDi 在20-40%的压缩比下，相对于基于SVD的先进基线和强大的结构化剪枝基线，始终能够改善准确度-压缩和困惑度-压缩的权衡。这种结构化稀疏性使得稀疏-密集计算成为可能，并与稀疏系数的后训练量化集成。

Summary / 总结

CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to represent weight matrices, improving expressiveness and accuracy at a fixed parameter budget. It optimizes the factorization using a calibration set to minimize functional reconstruction error of layer outputs, and supports both per-layer and cross-layer dictionary sharing. Experiments show that CoSpaDi outperforms state-of-the-art SVD-based and structured pruning baselines across Llama and Qwen models at 20-40% compression ratios.

CoSpaDi 是一种无需训练的大型语言模型压缩框架，使用结构化稀疏分解来提高表达能力同时保持固定的参数预算。它用一个稠密字典乘以列稀疏系数矩阵来替代低秩分解，并通过最小化层输出的功能重建误差来优化分解。这种方法在校准集的引导下，一致地在 Llama 和 Qwen 模型中优于基于 SVD 的和结构化剪枝的先进基线，在 20-40% 的压缩比下提高准确性和困惑度，同时减少计算复杂性。

LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs

Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge

First: 2026-02-19T16:45:38+00:00 · Latest: 2026-02-19T16:45:38+00:00

Comments: 18 pages, 6 figures, 4 tables

Abs · PDF · Code1 · Code2

Abstract

Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.

中文标题/摘要

标题：LATA：拉普拉斯辅助的归纳适应性转换以提高医学VLM中的校准不确定性

医学视觉-语言模型（VLMs）在医学成像中具有强大的零样本识别能力，但它们在领域转移下的可靠性依赖于有保证的校准不确定性。分割一致预测（SCP）提供了有限样本覆盖率，但预测集往往变得很大（低效率），并且类别间的覆盖率不平衡（高类别条件覆盖率差距，CCV），特别是在少量样本、类别不平衡的情况下；此外，直接适应校准标签会破坏可交换性并使保证失效。我们提出了LATA（拉普拉斯辅助的归纳适应性转换），这是一种无需训练和标签的改进方法，通过在图像-图像k-NN图上平滑零样本概率，使用少量CCC更新，对校准和测试池进行操作，并通过确定性变换保持SCP的有效性。我们还引入了一种失败感知的校准分数，将其插入视觉-语言不确定性（ViLU）框架中，提供实例级的难度和标签可验证性，以提高固定覆盖率下的预测集效率和类别间的平衡。LATA是黑盒的（不更新VLM），计算量轻（窗口化归纳转换，无反向传播），并包括一个可选的先验旋钮，可以完全不使用标签运行，或者如果需要，可以使用校准边缘信息在标签指导下运行。在三个医学VLM和九个下游任务上，LATA始终减少集合大小和CCV，同时匹配或收紧目标覆盖率，优于先前的归纳基线，并缩小与使用标签方法的差距，同时使用更少的计算资源。全面的消融分析和定性分析表明，LATA在不损害可交换性的情况下增强了零样本预测。

Summary / 总结

LATA (Laplacian-Assisted Transductive Adaptation) is proposed to improve the reliability of medical vision-language models (VLMs) under domain shift by refining the joint calibration and test pool. It uses a Laplacian-assisted transductive adaptation method that operates without requiring label information and minimally impacts computational resources. LATA reduces prediction set size and class-wise coverage variance while maintaining or improving target coverage, outperforming previous transductive methods and reducing the gap to label-using methods across multiple medical VLMs and tasks.

LATA（Laplacian-Assisted Transductive Adaptation）旨在通过精炼联合校准和测试池来提高医学视觉语言模型（VLMs）在域转移下的可靠性。该方法无需标签信息且对计算资源影响较小。LATA减少了预测集大小和类别间覆盖差异，同时保持或提升目标覆盖，优于先前的归纳方法，并减少了与使用标签方法之间的差距，适用于多种医学VLM和下游任务。

Visual Model Checking: Graph-Based Inference of Visual Routines for Image Retrieval

Authors: Adrià Molina, Oriol Ramos Terrades, Josep Lladós

First: 2026-02-19T14:10:55+00:00 · Latest: 2026-02-19T14:10:55+00:00

Comments: Submitted for ICPR Review

Abs · PDF · Code1 · Code2

Abstract

Information retrieval lies at the foundation of the modern digital industry. While natural language search has seen dramatic progress in recent years largely driven by embedding-based models and large-scale pretraining, the field still faces significant challenges. Specifically, queries that involve complex relationships, object compositions, or precise constraints such as identities, counts and proportions often remain unresolved or unreliable within current frameworks. In this paper, we propose a novel framework that integrates formal verification into deep learning-based image retrieval through a synergistic combination of graph-based verification methods and neural code generation. Our approach aims to support open-vocabulary natural language queries while producing results that are both trustworthy and verifiable. By grounding retrieval results in a system of formal reasoning, we move beyond the ambiguity and approximation that often characterize vector representations. Instead of accepting uncertainty as a given, our framework explicitly verifies each atomic truth in the user query against the retrieved content. This allows us to not only return matching results, but also to identify and mark which specific constraints are satisfied and which remain unmet, thereby offering a more transparent and accountable retrieval process while boosting the results of the most popular embedding-based approaches.

中文标题/摘要

标题：视觉模型检查：基于图的视觉模式图推理及其在图像检索中的应用

信息检索是现代数字产业的基础。尽管近年来自然语言搜索在嵌入式模型和大规模预训练的推动下取得了显著进展，但该领域仍面临重大挑战。具体而言，涉及复杂关系、对象组合或精确约束（如身份、数量和比例）的查询往往在当前框架中仍未得到解决或不可靠。在本文中，我们提出了一种新的框架，通过将基于图的验证方法与神经代码生成相结合，将形式验证整合到基于深度学习的图像检索中。我们的方法旨在支持开放词汇的自然语言查询，同时生成既可信又可验证的结果。通过将检索结果置于形式推理的系统中，我们超越了向量表示中常见的模糊性和近似性。我们的框架不接受不确定性作为既定事实，而是明确地将用户查询中的每个原子事实与检索内容进行验证。这不仅使我们能够返回匹配的结果，还能够识别并标记哪些特定约束得到满足，哪些未得到满足，从而提供一个更加透明和负责任的检索过程，同时提升基于最流行嵌入式模型方法的结果。

Summary / 总结

This paper addresses the challenges of complex and precise queries in image retrieval by integrating formal verification into deep learning. The proposed framework uses graph-based verification methods and neural code generation to support open-vocabulary natural language queries. It explicitly verifies each atomic truth in the user query against the retrieved content, providing transparent and accountable results. The approach aims to move beyond the ambiguity of vector representations and offer more reliable and verifiable retrieval outcomes.

本文通过将形式验证集成到深度学习中，解决了图像检索中复杂和精确查询的挑战。提出的框架使用基于图的验证方法和神经代码生成来支持开放词汇的自然语言查询。它明确验证用户查询中的每个原子真理与检索内容之间的关系，提供透明和可问责的结果。该方法旨在超越向量表示的模糊性，提供更可靠和可验证的检索结果。

Selective Training for Large Vision Language Models via Visual Information Gain

Authors: Seulbi Lee, Sangheum Hwang

First: 2026-02-19T09:12:21+00:00 · Latest: 2026-02-19T09:12:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.

中文标题/摘要

标题：大型视觉语言模型的选择性训练通过视觉信息增益

大型视觉语言模型（LVLMs）取得了显著进展，但它们往往存在语言偏见，提供的答案不依赖于视觉证据。尽管先前的工作试图通过解码策略、架构修改或精心策划的指令数据来缓解这一问题，但它们通常缺乏一个定量衡量单个训练样本或词元从图像中实际受益多少的指标。在本文中，我们引入了视觉信息增益（VIG），这是一种基于困惑度的度量，用于衡量视觉输入提供的预测不确定性减少量。VIG 允许在样本和词元级别进行精细分析，有效突出视觉基础的元素，如颜色、空间关系和属性。利用这一点，我们提出了一种基于 VIG 的选择性训练方案，优先选择高 VIG 的样本和词元。这种方法提高了视觉基础性，减轻了语言偏见，通过专注于仅具有视觉信息的样本和词元，实现了显著提高的性能并减少了监督需求。

Summary / 总结

This study addresses the issue of language bias in large vision language models (LVLMs) by introducing Visual Information Gain (VIG), a metric that quantifies the reduction in prediction uncertainty from visual inputs. The authors propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens, leading to improved visual grounding and reduced language bias with less supervision compared to existing methods.

该研究通过引入视觉信息增益（VIG）这一度量，量化视觉输入对预测不确定性减少的贡献，来解决大型视觉语言模型（LVLM）的语言偏差问题。作者提出了一种基于VIG的选择性训练方法，优先选择高VIG的样本和标记，从而提高了视觉定位能力，减少了语言偏差，并且通过专注于更具视觉信息量的样本和标记，实现了更少的监督。

Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance

Authors: Haipeng Li, Rongxuan Peng, Anwei Luo, Shunquan Tan, Changsheng Chen, Anastasia Antsiferova

First: 2026-02-06T09:32:10+00:00 · Latest: 2026-02-19T08:33:25+00:00

Comments: 17 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.

中文标题/摘要

标题：针对图像伪造检测的多模态指导通用反取证攻击

AI生成内容（AIGC）技术的迅速发展对真实性评估提出了重大挑战。然而，现有的评估协议很大程度上忽视了反取证攻击，未能确保最先进的AIGC检测器在实际应用中的全面鲁棒性。为弥补这一差距，我们提出了一种名为ForgeryEraser的框架，该框架能够在不访问目标AIGC检测器的情况下执行通用反取证攻击。我们揭示了一种来自系统性依赖视觉语言模型（VLMs）作为共享骨干（例如CLIP）的对抗性漏洞，下游AIGC检测器继承了这些公开模型的特征空间。我们设计了一种多模态指导损失，而不是传统的logit优化，以驱动伪造图像嵌入向文本衍生的真实锚点靠拢，从而消除伪造痕迹，同时将它们排斥在伪造锚点之外。广泛的实验表明，ForgeryEraser在全局合成和局部编辑基准上对高级AIGC检测器造成了显著的性能下降。此外，ForgeryEraser促使可解释的取证模型为伪造图像生成与真实图像一致的解释。我们的代码将公开发布。

Summary / 总结

This paper addresses the challenge of ensuring the robustness of AI-generated content (AIGC) detectors against anti-forensics attacks. The authors propose ForgeryEraser, a framework that performs universal anti-forensics attacks by leveraging multi-modal guidance to manipulate forged image embeddings within the feature space of Vision-Language Models (VLMs). Experiments show that ForgeryEraser significantly degrades the performance of advanced AIGC detectors and causes forensic models to generate explanations consistent with authentic images for forged images. The framework does not require access to the target AIGC detectors and is effective on both global synthesis and local editing benchmarks.

论文针对AI生成内容(AIGC)检测器对抗反取证攻击的鲁棒性问题，提出了一种名为ForgeryEraser的框架，能够在无需访问目标AIGC检测器的情况下执行通用反取证攻击。通过利用多模态指导损失，ForgeryEraser将伪造图像嵌入在视觉-语言模型(VLM)的特征空间中，使其模仿真实图像，从而显著降低高级AIGC检测器的性能。实验结果显示，在全局合成和局部编辑基准测试中，性能有显著下降，并且这些攻击使取证模型为伪造图像生成与真实图像一致的解释。

B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates

Authors: Hiromichi Kamata, Samuel Arthur Munro, Fuminori Homma

First: 2026-02-19T07:14:52+00:00 · Latest: 2026-02-19T07:14:52+00:00

Comments: Project page: https://sony.github.io/B3-Seg-project/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.

中文标题/摘要

标题：B$^3$-Seg：无需相机、无需训练的3DGS分割方法，通过解析EIG和Beta-Bernoulli贝叶斯更新

交互式3D高斯点云分割（3DGS）分割对于电影和游戏生产中的实时编辑预重建资产至关重要。然而，现有方法依赖于预定义的相机视角、真实标签或昂贵的重新训练，使其在低延迟使用中不切实际。我们提出了B$^3$-Seg（Beta-Bernoulli贝叶斯分割用于3DGS），这是一种在无需相机和无需训练条件下进行开放词汇3DGS分割的快速且理论基础的方法。我们的方法将分割重新表述为顺序的Beta-Bernoulli贝叶斯更新，并通过解析期望信息增益（EIG）主动选择下一个视角。这种贝叶斯表述保证了EIG的自适应单调性和次模性，从而产生贪婪的$(1{-}1/e)$最优视图采样策略近似。在多个数据集上的实验表明，B$^3$-Seg在几秒钟内进行端到端分割时，其结果与高成本监督方法相当。结果表明，B$^3$-Seg能够实现具有可证明信息效率的实用、交互式3DGS分割。

Summary / 总结

B$^3$-Seg is a method for camera-free and training-free 3D Gaussian Splatting (3DGS) segmentation, addressing the limitations of existing methods that require predefined camera viewpoints or costly retraining. It uses Beta-Bernoulli Bayesian updates and analytic Expected Information Gain (EIG) to select views, ensuring a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end within a few seconds, enabling practical, interactive 3DGS segmentation.

B$^3$-Seg 是一种用于相机无感知和无需训练的 3D 高斯点绘制（3DGS）分割方法，对于电影和游戏制作中的实时编辑至关重要。该方法将分割重新表述为贝塔-伯努利贝叶斯更新，并使用分析期望信息增益（EIG）来选择下一个视图，确保对最优视图采样策略的贪婪 $(1{-}1/e)$ 近似。实验表明，B$^3$-Seg 的结果与高成本监督方法相当，同时在几秒钟内进行端到端分割，实现了实用的、交互式的 3DGS 分割。

VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping

Authors: Sanoojan Baliah, Yohan Abeysinghe, Rusiru Thushara, Khan Muhammad, Abhinav Dhall, Karthik Nandakumar, Muhammad Haris Khan

Venue: WACV 2026

First: 2026-02-08T06:13:19+00:00 · Latest: 2026-02-19T03:41:54+00:00

Comments: Accepted at WACV 2026

Abs · PDF · Code1 · Code2 · Code3

Abstract

We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.

中文标题/摘要

标题：VFace：基于扩散模型的无训练面部换脸方法

我们提出了一种无训练、即插即用的方法VFace，用于高质量的视频面部换脸。它可以无缝集成到基于扩散模型的图像面部换脸方法中。首先，我们引入了一种频谱注意力插值技术，以促进生成并保持关键身份特征的完整性。其次，我们通过即插即用的注意力注入实现目标结构引导，以更好地对齐目标帧的结构特征与生成。第三，我们提出了一种流引导注意力时空平滑机制，该机制在不修改基础扩散模型的情况下，通过减少帧间生成中常见的时间不一致性来增强时空一致性。我们的方法不需要额外的训练或视频特定的微调。大量实验表明，我们的方法显著提高了时间一致性和视觉保真度，提供了一种实用且模块化的视频面部换脸解决方案。我们的代码可在https://github.com/Sanoojan/VFace获取。

Summary / 总结

VFace is a training-free method for high-quality video face swapping, integrating seamlessly with diffusion-based image face swapping approaches. It uses Frequency Spectrum Attention Interpolation to maintain key identity characteristics, Target Structure Guidance via attention injection to align structural features, and Flow-Guided Attention Temporal Smoothening to ensure spatiotemporal coherence. The method does not require additional training or fine-tuning and significantly improves temporal consistency and visual fidelity in video face swapping.

VFace 是一种无需训练的方法，用于高质量的视频面部替换，可以无缝集成到基于扩散模型的图像面部替换方法中。它使用了频谱注意力插值、目标结构引导以及流导向注意力时空平滑机制来提升视频中的面部替换效果。该方法无需额外训练或特定视频的微调，能够显著提高时间一致性和视觉保真度，提供了一种实用且模块化的视频面部替换解决方案。

StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection

Authors: Joongwon Chae, Lihui Luo, Yang Liu, Runming Wang, Dongmei Yu, Zeming Liang, Xi Yuan, Dayan Zhang, Zhenglin Chen, Peiwu Qin, Ilmoon Chae

First: 2026-02-19T03:35:24+00:00 · Latest: 2026-02-19T03:35:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap. We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization. StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.

中文标题/摘要

标题：StructCore：基于结构的图像级评分方法以实现无需训练的无监督异常检测

最大池化是基于记忆库的无监督异常检测（UAD）中将异常评分图转换为图像级决策的默认标准。然而，由于它依赖于单一的极端响应，因此会丢弃异常证据在图像中分布和结构的大部分信息，经常导致正常和异常评分重叠。我们提出了一种无需训练、基于结构的图像级评分方法StructCore，该方法超越了最大池化。给定一个异常评分图，StructCore 计算一个低维结构描述符 φ(S)，以捕捉分布性和空间特性，并通过从训练良好样本中估计的对角马氏校准来细化图像级评分，而不修改像素级定位。 StructCore 在 MVTec AD 上实现了 99.6% 的图像级 AUROC 分数，在 VisA 上实现了 98.4% 的图像级 AUROC 分数，通过利用最大池化忽略的结构特征，展示了稳健的图像级异常检测能力。

Summary / 总结

The paper addresses the limitations of max pooling in unsupervised anomaly detection, which often results in overlapping scores between normal and anomalous regions. StructCore proposes a training-free method that computes a structural descriptor from the anomaly score map to capture distributional and spatial characteristics. This descriptor is then refined using a diagonal Mahalanobis calibration, achieving high AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, indicating robust image-level anomaly detection.

论文针对最大池化在无监督异常检测中的局限性，即可能导致正常和异常得分重叠。StructCore 提出了一种无需训练的方法，从异常得分图中计算一个低维结构描述符，以捕捉分布性和空间特性。然后，它通过从训练正常样本中估计的对角马氏距离校准来细化图像级评分。在 MVTec AD 和 VisA 数据集上，StructCore 的图像级 AUROC 分数分别为 99.6% 和 98.4%，显示出通过利用最大池化未能捕捉到的结构特征提高了检测的鲁棒性。

Narrow fine-tuning erodes safety alignment in vision-language agents

Authors: Idhant Gulati, Shivam Raval

First: 2026-02-18T22:47:28+00:00 · Latest: 2026-02-18T22:47:28+00:00

Comments: 24 pages, 11 figures

Abs · PDF · Code1 · Code2

Abstract

Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.

中文标题/摘要

标题：狭窄微调会侵蚀视觉语言代理的安全对齐

终身多模态代理必须通过后训练不断适应新任务，但这在获取能力与保持安全对齐之间造成了根本性的紧张关系。我们证明，对狭窄领域有害数据集进行对齐的视觉语言模型微调会导致严重的泛化不良对齐，这种不良对齐在与任务和模态无关的任务中广泛存在。通过在Gemma3-4B上的实验，我们表明不良对齐随LoRA秩单调增加，并且多模态评估揭示的不良对齐（在r=128时为70.71±1.22）远高于仅文本评估（在r=128时为41.19±2.51），这表明单一模态的安全基准可能低估了视觉语言模型中的对齐退化。关键的是，即使训练混合数据中只有10%的有害数据也会导致显著的对齐退化。几何分析表明，有害行为占据了一个非常低维度的子空间，大部分不良对齐信息被10个主成分捕获。为了缓解不良对齐，我们评估了两种策略：良性狭窄微调和激活基于的引导。虽然这两种方法显著减少了不良对齐，但都没有完全消除学习到的有害行为。我们的研究结果强调了需要稳健的持续学习框架，因为当前的后训练范式可能无法在部署后充分保持对齐。

Summary / 总结

The research explores the challenge of maintaining safety alignment in lifelong multimodal agents that continuously adapt through post-training. By fine-tuning aligned vision-language models on narrow-domain harmful datasets, severe misalignment emerges that generalizes across tasks and modalities. Experiments on Gemma3-4B show that misalignment increases with LoRA rank and that multimodal evaluation detects higher misalignment than text-only evaluation. Even a small amount of harmful data in training can significantly degrade alignment. The study suggests that robust continual learning frameworks are needed to preserve safety alignment in deployed settings.

研究探讨了窄范围微调对视觉-语言代理安全对齐的影响。研究显示，对有害数据进行微调会导致显著的对齐偏差，这种偏差会在不同任务和模态中泛化。实验表明，对齐偏差随微调秩的增加而增加，而多模态评估揭示的对齐偏差高于仅文本评估。研究还发现，即使少量有害数据也会导致显著的对齐降解，几何分析表明有害行为被捕捉在低维子空间中。缓解策略，如良性窄范围微调和激活导向控制，可以减少但不能完全消除已学习的有害行为，这突显了需要稳健的持续学习框架。

MALLVI: a multi agent framework for integrated generalized robotics manipulation

Authors: Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj

First: 2026-02-18T21:28:56+00:00 · Latest: 2026-02-18T21:28:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.

中文标题/摘要

标题：MALLVI：一种集成通用机器人操作的多智能体框架

使用大型语言模型（LLMs）进行机器人操作任务规划是一个新兴领域。先前的方法依赖于专门的模型、微调或提示调优，并且通常以开环方式运行，缺乏稳健的环境反馈，使其在动态环境中变得脆弱。我们提出了MALLVi，一种多智能体大型语言和视觉框架，能够通过环境反馈驱动的机器人操作。给定自然语言指令和环境图像，MALLVi生成可执行的原子动作供机器人执行。执行动作后，视觉语言模型（VLM）评估环境反馈并决定是否重复该过程或进行下一步。MALLVi 不使用单一模型，而是协调分解器、定位器、思考者和反思者等专门智能体来管理感知、定位、推理和高级规划。可选的描述者智能体提供初始状态的视觉记忆。反思者通过重新激活相关智能体进行目标错误检测和恢复，避免全面重新规划。在模拟和真实世界环境中的实验表明，迭代闭环多智能体协调提高了通用性和零样本操作任务的成功率。代码可在 https://github.com/iman1234ahmadi/MALLVI 获取。

Summary / 总结

MALLVi is a multi-agent framework that integrates large language and vision models for robotic manipulation, providing closed-loop feedback to handle dynamic environments. Given a natural language instruction and an image, MALLVi generates actions, evaluates feedback, and iteratively refines the process. The framework uses specialized agents for perception, localization, reasoning, and planning, with an optional visual memory agent. Experiments show improved generalization and success rates in zero-shot manipulation tasks compared to prior approaches that rely on open-loop methods or single models.

MALLVi 是一个结合了大型语言模型和视觉模型的多代理框架，用于机器人操作。它采用闭环反馈机制来处理动态环境，根据自然语言指令和环境图像生成并执行原子动作。该框架包含感知、定位、推理和规划等专门代理，可选地还包括视觉记忆代理。实验表明，MALLVi 在零样本操作任务中提高了泛化能力和成功率，优于开环方法。

DODO: Discrete OCR Diffusion Models

Authors: Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman

First: 2026-02-18T20:59:22+00:00 · Latest: 2026-02-18T20:59:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.

中文标题/摘要

标题：DODO: 离散OCR扩散模型

光学字符识别（OCR）是数字化信息的基本任务，是视觉数据与文本理解之间的关键桥梁。尽管现代视觉-语言模型（VLM）在这个领域取得了高精度，但它们主要依赖于自回归解码，这在长文档中变得计算成本高昂且速度缓慢，因为每次生成一个新词都需要进行一次顺序前向传递。我们发现了一个关键机会来克服这一瓶颈：与开放生成不同，OCR是一个高度确定的任务，视觉输入严格决定了唯一的输出序列，理论上可以通过扩散模型实现高效的并行解码。然而，我们表明现有的掩码扩散模型无法利用这一潜力；这些模型在像描述图注这样的灵活任务中引入的结构不稳定性，在OCR的严格、精确匹配要求下是灾难性的。为了弥合这一差距，我们提出了DODO，这是第一个利用块离散扩散的VLM，解锁其在OCR中的加速潜力。通过将生成分解为块，DODO减轻了全局扩散的同步错误。实验证明，我们的方法在接近当前最佳精度的同时，比自回归基线快3倍的推理速度。

Summary / 总结

The research aims to improve the efficiency of Optical Character Recognition (OCR) by addressing the computational limitations of autoregressive models. The method introduces DODO, a discrete OCR diffusion model that decomposes generation into blocks to mitigate synchronization errors, enabling parallel decoding. This approach achieves near state-of-the-art accuracy while offering up to 3x faster inference compared to autoregressive baselines.

论文提出了DODO，一种离散OCR扩散模型，旨在解决OCR任务中自回归解码的计算效率问题。通过将生成分解为块，DODO消除了同步错误，并实现了接近当前最佳性能的同时，比自回归基线快3倍的推理速度。

Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems

Authors: Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk

Venue: ICLR 2026

First: 2025-08-16T12:26:44+00:00 · Latest: 2026-02-18T20:57:18+00:00

Comments: Accepted to The Fourteenth International Conference on Learning Representations (ICLR 2026)

Abs · PDF · Code1 · Code2

Abstract

Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.

中文标题/摘要

标题：Bongard-RWR+:现实世界中细粒度概念的表示

邦德问题（BPs）为抽象视觉推理（AVR）提供了一个具有挑战性的测试平台，要求模型仅从少量示例中识别视觉概念，并用自然语言描述它们。早期的BP基准使用合成的黑白图，可能无法完全捕捉现实场景的复杂性。随后的BP数据集使用了现实世界的图像，但代表的概念可以通过高级图像特征识别，降低了任务的复杂性。不同的是，最近发布的Bongard-RWR数据集旨在使用细粒度的现实世界图像表示原始BP中的抽象概念。然而，其手动构建限制了数据集的大小仅为60个实例，限制了评估的稳健性。在本文中，我们引入了Bongard-RWR+，这是一个由5400个实例组成的BP数据集，使用通过视觉语言模型（VLM）管道生成的现实世界样式的图像表示原始BP中的抽象概念。基于Bongard-RWR，我们使用Pixtral-12B描述手动精选的图像并生成与底层概念对齐的新描述，使用Flux.1-dev从这些描述中合成图像，并手动验证生成的图像是否忠实反映了预期的概念。我们评估了最先进的VLMs在各种BP形式中的表现，包括二元和多类分类，以及文本答案生成。我们的发现表明，虽然VLMs可以识别粗粒度的视觉概念，但在区分细粒度概念方面却表现出一致的困难，突显了它们推理能力的局限性。

Summary / 总结

This work introduces Bongard-RWR+, a dataset of 5,400 real-world-like images generated via a vision language model pipeline, to evaluate abstract visual reasoning in Bongard Problems. The study uses state-of-the-art VLMs for binary and multiclass classification and textual answer generation, finding that VLMs can recognize coarse-grained concepts but struggle with fine-grained ones, indicating limitations in reasoning capabilities.

该研究引入了包含5,400个真实世界样例的Bongard-RWR+数据集，通过视觉语言模型生成，旨在表示Bongard问题中的细粒度抽象概念。研究评估了最先进的视觉语言模型在不同Bongard问题形式上的表现，发现尽管模型可以识别粗粒度的概念，但在识别细粒度概念方面存在困难，表明其推理能力的局限性。

Can Vision-Language Models Answer Face to Face Questions in the Real-World?

Authors: Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, Roland Memisevic

Venue: ICLR 2026

First: 2025-03-25T05:13:12+00:00 · Latest: 2026-02-18T20:15:27+00:00

Comments: ICLR 2026 paper

Abs · PDF · Code1 · Code2

Abstract

AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.

中文标题/摘要

标题：视觉-语言模型能否在现实场景中面对面回答问题？

近年来，AI模型在描述和回答关于现实世界图像的问题方面取得了显著进展。它们还在使用音频输入与用户进行实时对话方面取得了进步。这引发了一个问题：我们是否已经达到了这样的程度，即连接了摄像头和麦克风的AI模型能够实时与用户就摄像头前正在发生的场景和事件进行对话？这是AI领域长期以来的一个目标，也是现实世界中的AI助手和类人机器人与人类在日常情况下互动的前提条件。在本项工作中，我们引入了一个新的数据集和基准——高通互动视频数据集（IVD），以评估现有模型支持这些能力的程度，并确定这些能力可以通过微调到何种程度被赋予。该数据集基于一个简单的问答设置，用户提出问题，系统需要基于摄像头和音频输入实时回答。我们展示了现有模型在该任务上的表现远低于人类水平，并指出了性能差距的主要来源。然而，我们还展示了对于许多所需的感知技能，通过这种形式的数据进行微调可以显著缩小这一差距。

Summary / 总结

The research aims to evaluate whether vision-language models can answer real-time questions about live scenes captured by a camera. The study introduces the Qualcomm Interactive Video Dataset (IVD) to benchmark these models, showing that current models perform poorly compared to humans and highlighting areas for improvement through fine-tuning. Fine-tuning is found to significantly enhance the models' perceptual skills.

该研究通过引入高通互动视频数据集（IVD），评估了现有模型是否能在基于摄像头和音频输入的实时场景中回答面对面的问题。研究发现，当前模型在感知技能方面远远落后于人类表现，但通过对这种类型的数据进行微调可以显著提高其能力。

Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency

Authors: Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma, Javier Gonzalez, Niranjani Prasad

First: 2026-02-18T19:00:07+00:00 · Latest: 2026-02-18T19:00:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model's ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.

中文标题/摘要

标题：三思而行：学习使用双重反事实一致性进行因果推理

尽管大型语言模型（LLMs）在推理基准测试中表现出色，但在面对反事实问题时却显得脆弱，这表明它们在因果推理方面存在弱点。虽然最近的研究表明，标记的反事实任务可以作为LLMs因果推理能力的基准，但由于需要覆盖反事实潜在空间的巨大规模，生成此类数据受到限制。在本研究中，我们引入了双重反事实一致性（DCC），这是一种轻量级的推理时方法，用于衡量和引导LLMs进行因果推理的能力。DCC 不需要标记的反事实数据，而是验证模型执行因果干预和反事实预测这两种因果推理重要元素的能力。使用DCC，我们评估了各种领先LLMs在一系列推理任务和干预中的因果推理能力。此外，我们展示了DCC作为无训练测试时拒绝采样标准的有效性，并证明它可以跨多个模型家族直接提高推理任务的性能。

Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Authors: Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta

First: 2026-02-18T18:55:02+00:00 · Latest: 2026-02-18T18:55:02+00:00

Comments: Project page: https://hero-humanoid.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.

中文标题/摘要

标题：类人机器人开放词汇视觉移动物体末端执行器控制学习

使用类人机器人在野外对任意物体进行视觉移动物体操作需要精确的末端执行器（EE）控制和通过视觉输入（例如RGB-D图像）对场景的通用理解。现有方法基于现实世界的模仿学习，由于难以收集大规模训练数据集，因此表现出有限的泛化能力。本文提出了一种新的范式HERO，用于类人机器人物体移动物体操作，将大型视觉模型的强大泛化能力和开放词汇理解与模拟训练中的强大控制性能相结合。我们通过设计一种准确的残差感知末端执行器跟踪策略来实现这一点。该末端执行器跟踪策略结合了经典机器人学和机器学习。它使用a) 逆运动学将残差末端执行器目标转换为参考轨迹，b) 用于准确前运动学的已学习神经前向模型，c) 目标调整，以及d) 重新规划。这些创新共同帮助我们将末端执行器跟踪误差减少了3.2倍。我们使用这种准确的末端执行器跟踪器构建了一个模块化移动物体系统，其中使用开放词汇大型视觉模型实现强大的视觉泛化。我们的系统能够在从办公室到咖啡馆的各种真实世界环境中运行，机器人能够可靠地操作各种日常物体（例如茶杯、苹果、玩具），这些物体位于43cm至92cm高度的表面上。在模拟和现实世界中的系统模块化和端到端测试表明我们提出的设计的有效性。我们认为本文中的进展可以为训练类人机器人与日常物体交互开辟新的途径。

Summary / 总结

This paper introduces HERO, a new paradigm for humanoid robots to perform object manipulation in diverse environments. It combines the generalization of large vision models with strong control performance from simulated training. The key method involves an accurate residual-aware end-effector tracking policy that reduces tracking error by 3.2x. The system successfully manipulates various objects in real-world settings like offices and coffee shops, demonstrating effective loco-manipulation capabilities across different heights and object types.

该论文提出了HERO，一种新的方法使类人机器人在多种环境中执行物体操作。它结合了大型视觉模型的泛化能力和模拟训练中的强大控制性能。关键方法是使用一种准确的残差感知末端执行器跟踪策略，将跟踪误差降低了3.2倍。系统在办公室和咖啡馆等真实世界环境中成功操作了各种物体，展示了在不同高度和物体类型上的有效操作能力。

Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning

Authors: Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li

First: 2026-02-18T18:49:56+00:00 · Latest: 2026-02-18T18:49:56+00:00

Comments: preprint 10 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.

中文标题/摘要

标题：注意引导多路径思考：重访视觉语言推理

视觉语言模型（VLMs）旨在通过联合利用视觉和文本模态进行推理。虽然为大型语言模型（LLMs）分配额外的推理时间计算已被证明是有效的，但在VLMs中实现类似的扩展仍然具有挑战性。一个关键障碍是视觉输入通常仅在生成的开始阶段提供一次，而文本推理（例如，早期视觉摘要）是自回归生成的，这使得推理变得越来越以文本为主导，并允许早期视觉定位错误累积。此外，推理期间的视觉定位指导通常较为粗糙且噪声较大，这使得在长文本上引导推理变得困难。为了解决这些挑战，我们提出了\emph{注意引导原则}（SAP）选择。SAP 作用于高层次的推理原则而非标记级轨迹，这使得在噪声反馈下能够稳定控制离散生成，同时允许后续推理步骤在需要重新定位时重新咨询视觉证据。此外，SAP 支持多路径推理，能够并行探索多种推理行为。SAP 是模型无关且无需数据的，不需要额外训练。实验证明，SAP 在与可比标记生成预算下实现了竞争力的性能，特别是在减少对象幻觉方面表现尤为突出，同时提供了更稳定的推理和更低的响应延迟，优于CoT风格的长序列推理。

Summary / 总结

The paper addresses the challenge of scaling vision-language models (VLMs) by proposing Saliency-Aware Principle (SAP) selection, which operates on high-level reasoning principles to enable stable control over discrete generation and support multi-route inference. Experimental results show that SAP reduces object hallucination, provides more stable reasoning, and has lower response latency compared to long sequential reasoning methods.

论文通过提出Saliency-Aware Principle (SAP)选择来解决视觉-语言模型(VLMs)的扩展问题，SAP在高阶推理原则上操作，以在嘈杂反馈下稳定离散生成。SAP支持多路径推理，允许并行探索多种推理行为。实验结果表明，SAP可以减少物体幻觉，提供更稳定的推理，并且响应延迟更低，同时在相似的令牌生成预算下保持竞争力。

MC-LLaVA: Multi-Concept Personalized Vision-Language Model

Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang

First: 2024-11-18T16:33:52+00:00 · Latest: 2026-02-18T18:33:19+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.

中文标题/摘要

标题：MC-LLaVA：多概念个性化视觉语言模型

当前的视觉语言模型（VLMs）在各种任务中表现出色，如视觉问答。为了提升用户体验，最近的研究探讨了VLM的个性化，以理解用户提供的概念。然而，这些研究主要集中在单一概念上，忽视了多个概念的存在及其相互作用，这限制了其在实际中的应用。本文提出了一种多概念个性化范式MC-LLaVA。具体而言，MC-LLaVA采用多概念指令调优策略，在单个训练步骤中有效整合多个概念。为了降低训练成本，我们提出了一种个性化的文本提示，利用视觉标记信息初始化概念标记。此外，在推理过程中，我们引入了个性化的视觉提示，聚合位置图以增强识别和语义关联能力。为了进一步提高性能上限，我们引入了一个可选的辅助损失，更好地增强了提出的个性化提示。为了丰富VLM个性化研究，我们贡献了一个高质量的数据集。我们精心收集了来自电影的多角色和多物体图像，并手动创建了多概念场景下的问题-答案样本，具有更高的多样性。全面的实验表明，MC-LLaVA实现了令人印象深刻的多概念个性化响应，为VLM成为更好的用户助手铺平了道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA发布。

Summary / 总结

This paper introduces MC-LLaVA, a multi-concept personalized vision-language model that addresses the limitations of single-concept personalization by integrating multiple concepts in a single training step. It employs a multi-concept instruction tuning strategy and uses personalized textual and visual prompts to enhance recognition and grounding capabilities. Experimental results show that MC-LLaVA can generate impressive multi-concept personalized responses, improving the real-world applicability of vision-language models as user assistants.

本文提出了MC-LLaVA，一种多概念个性化视觉语言模型，采用多概念指令调优策略和个人化文本和视觉提示，在单一训练步骤中整合多个概念。该模型展示了出色的多概念个性化响应，提升了视觉语言模型作为用户助手的实用性。还提供了一个高质量的数据集以装饰研究。

A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification

Authors: Qi You, Yitai Cheng, Zichao Zeng, James Haworth

First: 2026-02-18T16:41:32+00:00 · Latest: 2026-02-18T16:41:32+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.

中文标题/摘要

标题：一种基于注意力特征适应的对比学习框架用于街景图像分类

街景图像属性分类是图像分类的重要下游任务，能够支持自动驾驶、城市分析和高精度地图构建等应用。无论是从零开始训练、使用预训练权重初始化还是微调大型模型，该任务都具有较高的计算需求。尽管预训练的视觉-语言模型如CLIP提供了丰富的图像表示，但现有的适应或微调方法往往依赖于它们的全局图像嵌入，限制了它们捕捉复杂、杂乱街景中细微、局部化的属性的能力。为了解决这个问题，我们提出了CLIP-MHAdapter，这是一种轻量级CLIP适应范式的变体，通过在补丁标记上附加一个具有多头自注意力的瓶颈MLP来建模补丁之间的依赖关系。CLIP-MHAdapter大约有140万个可训练参数，在Global StreetScapes数据集的八个属性分类任务上实现了优于或竞争力相当的准确性，同时保持了较低的计算成本。代码可在https://github.com/SpaceTimeLab/CLIP-MHAdapter获取。

Summary / 总结

The paper addresses the challenge of classifying attributes in street-view images, crucial for applications like autonomous driving. It proposes CLIP-MHAdapter, which enhances CLIP by adding a bottleneck MLP with multi-head self-attention on patch tokens to capture fine-grained attributes. This method achieves state-of-the-art accuracy with fewer parameters, demonstrating superior performance across eight attribute classification tasks on the Global StreetScapes dataset while maintaining low computational cost.

论文针对街景图像属性分类的挑战，这是自动驾驶等应用的关键。提出了一种名为CLIP-MHAdapter的方法，通过在patch token上应用多头自注意力来增强CLIP，以捕捉细粒度的属性。该方法大约有140万可训练参数，在Global StreetScapes数据集的八个任务上优于或匹配现有方法，同时保持低计算成本。

FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment

Authors: Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger

First: 2025-04-11T15:12:05+00:00 · Latest: 2026-02-18T15:52:04+00:00

Comments: 11 pages, 5 figures

Abs · PDF · Code1 · Code2 · Project1

Abstract

Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixelwise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resourceconstrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.

中文标题/摘要

标题：FindAnything：任意词汇和对象中心的映射框架，用于机器人在任意环境中的探索

几何上精确且语义上丰富的地图表示已被证明对于在未知环境中部署机器人和任务规划至关重要。然而，实时地对大规模未知环境进行开放词汇的语义理解仍然存在挑战，主要原因是计算需求。本文提出了一种名为FindAnything的开放世界映射框架，该框架将视觉-语言信息整合到密集的体积子地图中。通过使用视觉-语言特征，FindAnything结合了纯粹的几何信息和开放词汇的语义信息，以实现更高层次的理解。它通过在对象级别聚合特征来高效存储开放词汇信息。基于eSAM片段的像素级视觉-语言特征被聚合，并整合到对象中心的体积子地图中，从而提供从开放词汇查询到3D几何的映射，该映射在内存使用方面也具有可扩展性。我们证明FindAnything在语义准确性方面与最先进的技术相当，但速度更快且更节省内存，使其能够在大规模环境中部署，并在资源受限的设备上运行，例如MAVs。我们展示了FindAnything的实时能力使其在下游任务中具有用途，例如在模拟的搜索和救援场景中自主MAV探索。项目页面：https://ethz-mrl.github.io/findanything/

Summary / 总结

FindAnything is an open-world mapping framework that integrates vision-language information into dense volumetric submaps for robot exploration. It uses object-level aggregation of features to efficiently store open-vocabulary information and provides a scalable mapping from open-vocabulary queries to 3D geometry. FindAnything matches the state-of-the-art in semantic accuracy but is faster and more memory-efficient, enabling its use in large-scale environments and resource-constrained devices like MAVs. It demonstrates real-time capabilities useful for tasks like autonomous MAV exploration in simulated Search and Rescue scenarios.

FindAnything 是一种将视觉-语言信息整合到密集体素子地图中的开放世界映射框架，以实现几何和语义理解。它使用对象级别的特征聚合和 eSAM 区段来高效存储开放词汇信息，提供可扩展的内存使用。实验结果表明，FindAnything 在语义准确性方面与最先进的技术相当，同时更快且更节省内存，适用于大规模环境和资源受限设备如 MAV。它展示了实时能力，适用于模拟搜索和救援场景中的自主 MAV 探索等任务。

DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images

Authors: Zeng Tao, Ying Jiang, Yunuo Chen, Tianyi Xie, Huamin Wang, Yingnian Wu, Yin Yang, Abishek Sampath Kumar, Kenji Tashiro, Chenfanfu Jiang

First: 2026-02-18T14:45:15+00:00 · Latest: 2026-02-18T14:45:15+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.

中文标题/摘要

标题：DressWild：从野生图像中生成前馈姿态无关服装缝制模板

最近在服装模板生成方面的进展显示出有希望的进步。然而，现有的前馈方法在处理多样化的姿态和视角方面存在困难，而基于优化的方法则计算成本高且难以扩展。本文关注于服装建模和制造应用中所需的可编辑、可分离和可用于模拟的服装的缝制模板生成。我们提出了一种名为DressWild的新型前馈管道，可以从单张野生图像中重建符合物理规律的2D缝制模板及其对应的3D服装。给定输入图像，我们的方法利用视觉-语言模型（VLMs）在图像级别上归一化姿态变化，然后提取姿态感知的、3D启发式的服装特征。这些特征通过基于变换器的编码器融合，随后用于预测缝制模板参数，这些参数可以直接应用于物理模拟、纹理合成和多层虚拟试穿。广泛的实验表明，我们的方法能够在不使用多视角输入或迭代优化的情况下，从野生图像中稳健地恢复出多样化的缝制模板及其对应的3D服装，提供了一种高效且可扩展的现实服装模拟和动画解决方案。

Summary / 总结

This paper addresses the challenge of generating sewing patterns from diverse in-the-wild images. It introduces DressWild, a feed-forward method that uses vision-language models to normalize pose variations and extract 3D-informed garment features. These features are then used to predict sewing pattern parameters, enabling direct application to physical simulation and texture synthesis. Experiments show that DressWild can robustly generate diverse sewing patterns and corresponding 3D garments without needing multi-view inputs or iterative optimization, providing an efficient and scalable solution for realistic garment simulation and animation.

本文旨在从多样化的野外图像中生成缝制图案和3D服装。提出了一种名为DressWild的前馈管道，利用视觉语言模型规范化姿态变化，并提取3D相关的服装特征。这些特征随后用于预测缝制图案参数，可以直接应用于物理模拟和纹理合成。实验表明，DressWild能够在无需多视角输入或迭代优化的情况下稳健地生成多样化的缝制图案和3D服装，提供了一种高效且可扩展的现实服装模拟和动画解决方案。

Fast and Scalable Analytical Diffusion

Authors: Xinyi Shang, Peng Sun, Jingyu Lin, Zhiqiang Shen

First: 2026-02-18T14:41:09+00:00 · Latest: 2026-02-18T14:41:09+00:00

Abs · PDF · Code1 · Code2

Abstract

Analytical diffusion models offer a mathematically transparent path to generative modeling by formulating the denoising score as an empirical-Bayes posterior mean. However, this interpretability comes at a prohibitive cost: the standard formulation necessitates a full-dataset scan at every timestep, scaling linearly with dataset size. In this work, we present the first systematic study addressing this scalability bottleneck. We challenge the prevailing assumption that the entire training data is necessary, uncovering the phenomenon of Posterior Progressive Concentration: the effective golden support of the denoising score is not static but shrinks asymptotically from the global manifold to a local neighborhood as the signal-to-noise ratio increases. Capitalizing on this, we propose Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), a training-free framework that decouples inference complexity from dataset size. Instead of static retrieval, GoldDiff uses a coarse-to-fine mechanism to dynamically pinpoint the ''Golden Subset'' for inference. Theoretically, we derive rigorous bounds guaranteeing that our sparse approximation converges to the exact score. Empirically, GoldDiff achieves a $\bf 71 \times$ speedup on AFHQ while matching or achieving even better performance than full-scan baselines. Most notably, we demonstrate the first successful scaling of analytical diffusion to ImageNet-1K, unlocking a scalable, training-free paradigm for large-scale generative modeling.

中文标题/摘要

标题：快速且可扩展的分析性扩散

分析性扩散模型通过将去噪分数形式化为经验贝叶斯后验均值，提供了一条数学上透明的生成建模路径。然而，这种可解释性付出了高昂的成本：标准形式要求在每个时间步长进行整个数据集扫描，其规模与数据集大小成线性关系。在本文中，我们首次系统地研究了这一可扩展性瓶颈。我们挑战了整个训练数据必不可少的假设，揭示了后验渐进集中现象：去噪分数的有效黄金支持并非静态，而是随着信噪比增加从全局流形渐进收缩到局部邻域。利用这一发现，我们提出了动态时间感知黄金子集扩散（GoldDiff）框架，该框架在不进行训练的情况下解耦推理复杂度与数据集大小。GoldDiff 不使用静态检索，而是采用粗到细机制动态确定推理所需的“黄金子集”。理论上，我们推导出严格的界保证我们的稀疏近似收敛到精确分数。实验上，GoldDiff 在 AFHQ 上实现了 71 倍的速度提升，同时匹配甚至超越了全扫描基线的性能。最值得注意的是，我们展示了分析性扩散首次成功扩展到 ImageNet-1K，解锁了大规模生成建模的可扩展、无需训练范式。

Summary / 总结

This work addresses the scalability issue of analytical diffusion models by proposing Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), which dynamically identifies a smaller subset of data for inference, leading to a 71-fold speedup on AFHQ while maintaining or improving performance compared to full-scan baselines. Notably, GoldDiff enables the first successful scaling of analytical diffusion to ImageNet-1K, providing a scalable and training-free approach for large-scale generative modeling.

本文通过提出动态时间感知黄金子集扩散（GoldDiff）来解决分析性扩散模型的可扩展性问题。受噪声信噪比增加时有效支持集缩小的观察驱动，GoldDiff 动态选择数据子集进行推理，显著降低了计算成本。实验结果显示，GoldDiff 在 AFHQ 上实现了 71 倍的加速，同时保持或超越了全扫描基线的性能。特别地，它成功将分析性扩散扩展到 ImageNet-1K，展示了其在大规模生成建模中的潜力。

SurgRAW: Multi-Agent Workflow with Chain of Thought Reasoning for Robotic Surgical Video Analysis

Authors: Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhu Zhuo, Zhitao Zeng, Evangelos B. Mazomenos, Yueming Jin

Venue: IEEE Robotics and Automation Letters, 2026, pp. 1-8

First: 2025-03-13T11:23:13+00:00 · Latest: 2026-02-18T14:35:21+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Robotic-assisted surgery (RAS) is central to modern surgery, driving the need for intelligent systems with accurate scene understanding. Most existing surgical AI methods rely on isolated, task-specific models, leading to fragmented pipelines with limited interpretability and no unified understanding of RAS scene. Vision-Language Models (VLMs) offer strong zero-shot reasoning, but struggle with hallucinations, domain gaps and weak task-interdependency modeling. To address the lack of unified data for RAS scene understanding, we introduce SurgCoTBench, the first reasoning-focused benchmark in RAS, covering 14256 QA pairs with frame-level annotations across five major surgical tasks. Building on SurgCoTBench, we propose SurgRAW, a clinically aligned Chain-of-Thought (CoT) driven agentic workflow for zero-shot multi-task reasoning in surgery. SurgRAW employs a hierarchical reasoning workflow where an orchestrator divides surgical scene understanding into two reasoning streams and directs specialized agents to generate task-level reasoning, while higher-level agents capture workflow interdependencies or ground output clinically. Specifically, we propose a panel discussion mechanism to ensure task-specific agents collaborate synergistically and leverage on task interdependencies. Similarly, we incorporate a retrieval-augmented generation module to enrich agents with surgical knowledge and alleviate domain gaps in general VLMs. We design task-specific CoT prompts grounded in surgical domain to ensure clinically aligned reasoning, reduce hallucinations and enhance interpretability. Extensive experiments show that SurgRAW surpasses mainstream VLMs and agentic systems and outperforms a supervised model by 14.61% accuracy. Dataset and code is available at https://github.com/jinlab-imvr/SurgRAW.git .

中文标题/摘要

标题：SurgRAW：多智能体工作流与链式思考推理在机器人手术视频分析中的应用

机器人辅助手术（RAS）是现代手术的核心，推动了智能系统准确场景理解的需求。目前大多数现有的手术AI方法依赖于孤立的任务特定模型，导致了碎片化的管道，缺乏统一的理解和解释性。视觉-语言模型（VLMs）提供了强大的零样本推理能力，但在幻觉、领域差距和任务间依赖性建模方面存在困难。为了解决RAS场景理解缺乏统一数据的问题，我们引入了SurgCoTBench，这是第一个专注于RAS的推理基准，涵盖了14256个问答对，包含五个主要手术任务的帧级注释。基于SurgCoTBench，我们提出了SurgRAW，一种临床对齐的链式思考（CoT）驱动的多任务推理智能体工作流。SurgRAW采用分层推理工作流，其中协调者将手术场景理解分为两个推理流，并指导专门的智能体生成任务级推理，而高级智能体则捕捉工作流间的相互依赖性或临床验证输出。具体而言，我们提出了一种讨论机制，以确保任务特定的智能体能够协同工作并利用任务间的相互依赖性。同样，我们引入了检索增强生成模块，以丰富智能体的手术知识并缓解通用VLMs的领域差距。我们设计了基于手术领域的特定任务CoT提示，以确保临床对齐的推理、减少幻觉并增强可解释性。广泛的实验表明，SurgRAW超越了主流的VLMs和智能体系统，并在准确率上比监督模型高出14.61%。数据集和代码可在https://github.com/jinlab-imvr/SurgRAW.git 获取。

Summary / 总结

SurgRAW is a multi-agent workflow system that uses chain-of-thought reasoning for robotic surgical video analysis. It addresses the fragmented and task-specific nature of existing surgical AI methods by introducing SurgCoTBench, a reasoning-focused benchmark for RAS. SurgRAW employs a hierarchical reasoning approach with specialized agents and an orchestrator to handle task interdependencies and improve interpretability. Experiments show that SurgRAW outperforms mainstream VLMs and supervised models by 14.61% accuracy in RAS scene understanding tasks.

论文提出了第一个用于机器人辅助手术（RAS）的推理基准SurgCoTBench，并提出了一种分层推理工作流SurgRAW，该工作流使用了小组讨论机制和检索增强生成来增强任务特定代理之间的协作和知识。实验表明，SurgRAW在RAS场景理解任务中的准确率比主流的视觉-语言模型和代理系统高出14.61%。

Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing

Authors: Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin

First: 2026-02-18T13:40:53+00:00 · Latest: 2026-02-18T13:40:53+00:00

Abs · PDF · Code1 · Code2

Abstract

While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.

中文标题/摘要

标题：视觉自我精炼：一种基于像素的图表解析新范式

虽然大型多模态语言模型（LVLMs）在文本推理和自我修正方面表现出色，但这些优势对以视觉感知为中心的复杂任务，如图表解析，提供的帮助有限。现有模型在处理视觉密集型图表时常常遇到困难，导致数据遗漏、对齐错误和幻觉等问题。受人类在阅读复杂图表时使用手指作为“视觉锚点”以确保准确性的启发，我们提出了一种新的范式——视觉自我精炼（VSR）。VSR的核心思想是使模型能够生成像素级定位输出，可视化这些输出，并将这些可视化结果反馈给模型本身，使其能够直观地检查和修正潜在的视觉感知错误。我们通过提出ChartVSR在图表解析领域实例化了VSR范式。该模型将解析过程分解为两个阶段：在精炼阶段，它通过视觉反馈迭代确保所有数据点的像素级定位的准确性；在解码阶段，它使用这些验证过的定位作为精确的视觉锚点来解析最终的结构化数据。为了克服现有基准的局限性，我们还构建了ChartP-Bench，这是一个新的、极具挑战性的图表解析基准。我们的工作还强调了VSR作为一种通用的视觉反馈机制的潜力，为提高各种视觉为中心任务的准确性提供了新的方向。

Summary / 总结

The research aims to improve the accuracy of chart parsing by addressing the limitations of existing models in handling visually dense charts. The proposed Visual Self-Refine (VSR) paradigm enables a model to generate pixel-level localization outputs, visualize them, and use these visualizations to correct potential errors. ChartVSR, an instantiation of VSR in chart parsing, decomposes the process into a Refine Stage and a Decode Stage, ensuring precise data point localizations and accurate structured data parsing. The study introduces ChartP-Bench, a new benchmark for chart parsing, to evaluate the model's performance on complex visual tasks.

研究旨在通过解决现有模型在处理视觉密集型图表时的局限性，提高图表解析的准确性。提出的视觉自我精炼（VSR）范式使模型能够生成像素级定位输出，可视化这些输出，并利用反馈纠正潜在错误。ChartVSR是VSR在图表解析领域的实现，将过程分解为精炼阶段和解码阶段。精炼阶段通过视觉反馈确保所有数据点的像素级定位准确性，而解码阶段则使用这些验证的定位进行结构化数据解析。研究还引入了ChartP-Bench，这是一个新的基准测试，用于更严格地评估图表解析模型。

History

20260221_0345 20260220_0348 20260219_0358 20260218_0358 20260217_0343 20260216_0339 20260215_0338 20260213_0401 20260212_0404 20260210_0409 20260208_0339 20260207_0349 20260206_0347 20260205_0346 20260204_0354 20260202_0337 20260201_0333 20260131_0345 20260130_0341 20260129_0344 20260128_0341 20260127_0338 20260126_0330 20260125_0329 20260124_0337 20260123_0337 20260122_0343 20260121_0424 20260119_0329 20260118_0327 20260117_0332 20260116_0339 20260115_0334 20260114_0333 20260113_0334 20260112_0331 20260111_0329 20260110_0333 20260109_0334 20260108_0335 20260107_0330 20260106_0336 20260105_0328 20260104_0328 20260103_0325 20260102_0339 20260101_0329 20251231_0333 20251230_0332 20251229_0329 20251228_0332 20251227_0329 20251226_0330 20251225_0329 20251224_0331 20251223_0332 20251222_0328 20251221_0329 20251220_0330 20251219_0330 20251218_0345 20251217_0332 20251216_0333 20251215_0333 20251214_0327 20251212_0333 20251211_0331 20251210_0332 20251209_0331 20251208_0328 20251207_0327 20251206_0330 20251205_0331 20251204_0331 20251203_0333 20251202_0335 20251201_0328 20251130_0327 20251129_0328 20251128_0327 20251127_0327 20251126_0329 20251125_0327 20251124_0327 20251123_0326 20251122_0328 20251121_0328 20251120_0329 20251119_0328 20251118_0328 20251117_0326 20251116_0325 20251115_0327 20251114_0328 20251113_0330 20251112_0329 20251111_0328 20251110_0325 20251109_0326 20251108_0328 20251107_0328 20251106_0329 20251105_0326 20251104_0327 20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553