When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
Authors: Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding
First: 2026-02-19T18:59:20+00:00 · Latest: 2026-02-19T18:59:20+00:00
Comments: Website: https://vla-va.github.io/
Abstract
Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
中文标题/摘要
标题:视觉优先于语言:评估和缓解VLAs中的反事实失败
视觉-语言-行动模型(VLAs)承诺将语言指令应用于机器人控制,但在实践中往往未能忠实执行语言指令。当面对缺乏强烈场景特定监督的指令时,VLAs会遭受反事实失败:它们基于由数据集偏差引起的视觉捷径行动,反复执行已学得的行为,并选择在训练期间频繁出现的对象,而不考虑语言意图。为了系统地研究这一问题,我们引入了LIBERO-CF,这是第一个用于评估VLAs语言跟随能力的反事实基准,通过在视觉上合理的LIBERO布局下分配替代指令来进行评估。我们的评估揭示了反事实失败在最先进的VLAs中普遍存在但尚未充分探索。我们提出了反事实行动指导(CAG),这是一种简单而有效的双分支推理方案,明确地在VLAs中正则化语言条件。CAG结合了一个标准的VLA策略和一个未受语言条件的视觉-行动(VA)模块,在行动选择期间进行反事实比较。这种设计减少了对视觉捷径的依赖,提高了对未观察任务的鲁棒性,并且无需额外演示或修改现有架构或预训练模型。广泛的实验表明,它可以在各种VLAs中实现即插即用集成,并且具有持续改进。例如,在LIBERO-CF中,CAG在语言跟随准确性上提高了9.7%,在未观察任务上的任务成功率提高了3.6%,使用无训练策略,配以VA模型时,进一步提高了15.5%和8.5%。在实际世界评估中,CAG将反事实失败减少了9.4%,并将任务成功率平均提高了17.2%。
Summary / 总结
This paper addresses the issue of counterfactual failures in Vision-Language-Action models (VLAs), where models act based on visual biases rather than language instructions. The authors introduce LIBERO-CF, a benchmark to evaluate language-following capabilities by providing alternative instructions under visually plausible scenarios. They propose Counterfactual Action Guidance (CAG), a dual-branch inference scheme that enhances language conditioning and reduces reliance on visual shortcuts, leading to improved performance on under-observed tasks. Experiments show CAG improves language following accuracy and task success by 9.7% and 3.6% respectively, with further gains when paired with a Vision-Action model, and reduces counterfactual failures by 9.4% in real-world evaluations.
论文探讨了Vision-Language-Action模型(VLAs)中的反事实失败问题,即模型基于视觉偏见而非语言指令行动。研究引入了LIBERO-CF基准来评估这一问题,并提出了一种双分支推理方案Counterfactual Action Guidance (CAG),该方案提高了语言跟随准确性和任务成功率,尤其是在未观察到的任务上。CAG减少了反事实失败并增强了鲁棒性,无需额外训练或修改现有模型结构。
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Authors: Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen
First: 2026-02-19T18:54:32+00:00 · Latest: 2026-02-19T18:54:32+00:00
Comments: Code at: https://github.com/vila-lab/M-Attack-V2
Abstract
Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.
中文标题/摘要
标题:通过细粒度细节目标推动黑盒LVLM攻击前沿
大型视觉-语言模型(LVLMs)的黑盒对抗攻击由于缺乏梯度和复杂的多模态边界而具有挑战性。尽管先前的基于转移的方法,如M-Attack,通过源和目标图像的局部切片级匹配表现良好,但我们发现这会导致梯度在迭代中高度变化且几乎正交,违反了局部一致对齐并破坏了优化。我们将其归因于(i)ViT翻译敏感性导致尖峰梯度和(ii)源和目标切片之间的结构不对称性。我们将局部匹配重新表述为源变换和目标语义的非对称期望,并构建了M-Attack的梯度去噪升级版。在源侧,多切片对齐(MCA)在每次迭代中从多个独立采样的局部视图中平均梯度以减少方差。在目标侧,辅助目标对齐(ATA)用来自语义相关分布的小辅助集替换激进的目标增强,产生更平滑、方差更低的目标流形。我们进一步将动量重新解释为块动量,回放历史切片梯度;结合精细的块大小集合(PE+),这加强了可转移的方向。这些模块共同构成了M-Attack-V2,这是一个简单的模块化增强,显著提高了前沿LVLM的基于转移的黑盒攻击成功率:将Claude-4.0的成功率从8%提升到30%,Gemini-2.5-Pro从83%提升到97%,GPT-5从98%提升到100%,超越了先前的黑盒LVLM攻击。代码和数据可在:https://github.com/vila-lab/M-Attack-V2公开获取。
IntRec: Intent-based Retrieval with Contrastive Refinement
Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
First: 2026-02-19T18:50:53+00:00 · Latest: 2026-02-19T18:50:53+00:00
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
中文标题/摘要
标题:IntRec:基于意图的对比精炼检索
从复杂场景中检索用户指定的对象仍然是一个具有挑战性的任务,尤其是在查询模糊或涉及多个相似对象的情况下。现有的开放式词汇检测器以单次操作的方式工作,缺乏根据用户反馈精炼预测的能力。为了解决这个问题,我们提出了IntRec,这是一种基于用户反馈进行预测精炼的交互式对象检索框架。其核心是一个意图状态(IS),它维护了正锚点(确认的线索)和负约束(被拒绝的假设)的双重记忆集。对比对齐函数通过最大化与正线索的相似性并惩罚被拒绝的对象来对候选对象进行排名,从而在杂乱的场景中实现细粒度的去模糊化。我们的交互式框架在不增加额外监督的情况下显著提高了检索准确性。在LVIS上,IntRec达到了35.4 AP,分别比OVMR、CoDet和CAKE高出+2.3、+3.7和+0.5。在具有挑战性的LVIS-模糊基准上,它在单次纠正反馈后提高了7.9 AP的性能,每次交互的额外延迟少于30毫秒。
Summary / 总结
IntRec is an interactive object retrieval framework that refines predictions based on user feedback, addressing the challenge of ambiguous queries in complex scenes. It uses an Intent State maintaining positive anchors and negative constraints to rank candidate objects through a contrastive alignment function. On LVIS, IntRec outperforms existing methods by +2.3 to +3.7 AP, and on the LVIS-Ambiguous benchmark, it improves performance by +7.9 AP with minimal latency.
IntRec 是一种交互式对象检索框架,通过用户反馈来细化预测,提高准确性。它使用意图状态维护正锚和负约束,并使用对比对齐函数来排名候选对象。在 LVIS 上,IntRec 的 AP 比现有方法高出 2.3 到 3.7,且在 LVIS-Ambiguous 基准上通过单次纠正反馈提高了 7.9 的 AP,同时增加了不到 30 毫秒的延迟。
Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
Authors: Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan
First: 2026-02-19T18:44:23+00:00 · Latest: 2026-02-19T18:44:23+00:00
Comments: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client's data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.
中文标题/摘要
标题:抵御灾难性遗忘的一次性增量联邦学习
现代大数据系统生成大量异构且地理分散的数据流,规模庞大且隐私敏感,使得集中化变得困难。虽然联邦学习(FL)提供了一种增强隐私的训练机制,但它假设静态数据流,并在多轮次中学习协作模型,这使得在通信受限场景中处理增量数据的学习变得具有挑战性。本文提出了一次性增量联邦学习(OSI-FL),这是第一个解决通信开销和灾难性遗忘双重挑战的FL框架。OSI-FL通过单次通信轮次,将每个客户端的冻结视觉语言模型(VLM)生成的类别特定嵌入发送给服务器,服务器端的预训练扩散模型使用这些嵌入合成与客户端数据分布相似的新数据样本。这些合成样本在服务器端用于训练。然而,仍存在两个挑战:i) 任务以增量方式到达需要重新训练全局模型,ii) 随着未来任务的到达,重新训练模型会导致灾难性遗忘。为此,我们通过选择性样本保留(SSR)增强训练,该方法基于样本损失识别并保留每个类别和任务对中最信息丰富的top-p个样本。SSR通过确保代表性保留样本在后续迭代中被纳入训练来限制遗忘。实验结果表明,OSI-FL在三个基准数据集上的类增量和域增量场景中均优于基线方法,包括传统的和一次性FL方法。
Summary / 总结
This paper addresses the challenges of communication overhead and catastrophic forgetting in federated learning with incremental data. It introduces One-Shot Incremental Federated Learning (OSI-FL), which communicates category-specific embeddings from each client in a single round, and uses a pre-trained diffusion model to synthesize new data for server-side training. To mitigate forgetting, the authors propose Selective Sample Retention (SSR), which retains the most informative samples per category and task. Experiments show that OSI-FL outperforms traditional and one-shot FL approaches in both class-incremental and domain-incremental scenarios across three benchmark datasets.
本文解决了增量数据下联邦学习中的通信开销和灾难性遗忘问题。提出了增量联邦学习(OSI-FL),通过每个客户端在单轮中传输类别特定的嵌入,使用预训练的扩散模型生成新数据。为了缓解灾难性遗忘,框架中包含选择性样本保留(SSR),根据样本损失保留每个类别和任务对中最信息性的前p个样本。实验表明,OSI-FL在三个基准数据集上的类增量和域增量场景中均优于传统和单次联邦学习方法。
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Venue: NeurIPS 2025
First: 2025-05-05T17:47:42+00:00 · Latest: 2026-02-19T18:32:53+00:00
Comments: This work was accepted and presented at NeurIPS 2025. Code is available at https://github.com/mts-ai/replaceme Reviews at OpenReview: https://openreview.net/forum?id=zEj1FSYCRn NeurIPS 2025 Proceedings: https://openreview.net/pdf?id=zEj1FSYCRn
Abstract
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe
中文标题/摘要
标题:ReplaceMe:通过深度剪枝和Transformer块线性化实现网络简化
我们引入了ReplaceMe,这是一种通用的无需训练的深度剪枝方法,能够有效将Transformer块替换为线性操作,同时在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同,我们的方法仅需要一个小规模的校准数据集来估计线性变换,该变换近似于剪枝后的块。估计出的线性映射可以无缝地与剩余的Transformer块合并,无需任何额外的网络参数。我们的实验表明,ReplaceMe在所有无需训练的方法中表现最佳,并且在涉及大量重新训练/微调和架构修改的最新剪枝方法中保持了高度竞争力。应用于多个大型语言模型(LLMs),ReplaceMe在开放基准测试中实现了高达25%的剪枝,同时保留了原始模型约90%的性能,无需任何训练或修复步骤,从而减少了计算开销。我们提供了一个开源库,实现了ReplaceMe以及几种最先进的深度剪枝技术,可在https://github.com/mts-ai/ReplaceMe 获取。
Summary / 总结
ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations while maintaining high performance. It uses a small calibration dataset to estimate a linear transformation that approximates pruned blocks, requiring no additional network parameters. Experiments show ReplaceMe outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods. It achieves up to 25% pruning with minimal performance loss on large language models without training or healing steps, resulting in low computational overhead.
ReplaceMe 是一种无需训练的深度剪枝方法,通过将变压器块替换为线性操作来保持高性能。它使用一个小的校准数据集来估计一个线性变换,以近似剪枝后的块,不需要额外的网络参数。实验表明,ReplaceMe 在性能上优于其他无需训练的方法,并且与涉及大量重新训练/微调和架构修改的最新剪枝方法保持竞争力。它在大型语言模型上实现了高达 25% 的剪枝,同时性能损失最小,无需训练或修复步骤,导致低计算开销。
Boosting Medical Visual Understanding From Multi-Granular Language Learning
Authors: Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan
Venue: ICLR 2026
First: 2025-11-20T00:24:26+00:00 · Latest: 2026-02-19T18:27:29+00:00
Comments: Accepted by ICLR 2026. 40 pages
Abstract
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.
中文标题/摘要
标题:从多粒度语言学习增强医学视觉理解
近期在图像-文本预训练方面的进展显著提升了视觉理解能力,通过视觉和文本表示的对齐。对比语言-图像预训练(CLIP)在多模态学习中发挥了关键作用。然而,其对单标签、单粒度对齐的侧重限制了其在医学成像等复杂领域中的效果,因为医学图像往往对应多个高级标签(例如,疾病类别),并且不同注释粒度(例如,诊断描述、临床解释)之间存在差异。为了解决这一问题,我们提出了多粒度语言学习(MGLL),这是一种对比学习框架,旨在提高多标签和跨粒度对齐。MGLL 利用结构化的多标签监督,整合不同粒度的文本描述,并引入软标签监督和点对点约束以增强对齐。MGLL 使用平滑的Kullback-Leibler(KL)散度确保跨粒度一致性,同时保持计算效率作为视觉-语言模型的即插即用模块。在我们构建的大规模多粒度数据集上进行预训练,并在多个数据集上进行评估,MGLL 在下游任务中优于其他最先进的方法。代码可在 https://github.com/HUANGLIZI/MGLL/ 获取。
Summary / 总结
The research aims to improve medical visual understanding by addressing the limitations of single-granularity alignment in existing methods like CLIP. MGLL, a contrastive learning framework, is proposed to enhance multi-label and cross-granularity alignment through structured multi-label supervision, integrated textual descriptions, and soft-label supervision with point-wise constraints. Experiments show that MGLL outperforms other state-of-the-art methods on various downstream tasks when pretrained on large-scale multi-granular datasets.
研究旨在通过解决现有方法如CLIP在单一粒度对齐方面的局限性,提高医学视觉理解。提出了多粒度语言学习(MGLL)框架,通过结构化的多标签监督、集成的文本描述和带点对点约束的软标签监督,增强多标签和跨粒度对齐。MGLL在大规模多粒度数据集上预训练后,在下游任务中优于其他最先进的方法。
AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum
First: 2026-02-19T18:17:25+00:00 · Latest: 2026-02-19T18:17:25+00:00
Comments: 29 pages, 14 figures
Abstract
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
中文标题/摘要
标题:AI游戏商店:通过人类游戏评估机器通用智能的可扩展、开放性方法
在技术飞速发展的时代,严格评估机器智能与人类通用智能的广泛谱系相比变得越来越重要且具有挑战性。传统的AI基准测试通常仅评估人类活动有限范围内的狭窄能力。大多数基准测试也是静态的,随着开发人员显式或隐式地对其进行优化,它们很快就会饱和。我们提出了一种评估AI系统中类似人类的通用智能的更有前途的方法:通过一种特别强大的通用游戏玩法形式:研究它们如何以及如何很好地玩和学习玩所有可能的人类游戏,与具有相同经验水平、时间或其他资源的人类玩家进行比较。我们定义“人类游戏”为人类设计供人类玩的游戏,并认为这个所有此类游戏的空间——“人类游戏多元宇宙”——是评估的合适空间。为了实现这一愿景的第一步,我们引入了AI游戏商店,这是一个使用人类在环中与LLM结合的可扩展和开放性平台,通过自动获取和适应来自流行的人类数字游戏平台的标准和容器化游戏环境变体来合成新的代表性人类游戏。作为概念验证,我们基于Apple App Store和Steam的热门排行榜生成了100个此类游戏,并对七个前沿的视觉-语言模型(VLMs)进行了短片段的游戏评估。最好的模型在大多数游戏中的人类平均得分中仅达到了不到10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为困难。最后,我们提出了构建AI游戏商店的下一步,作为一种实际的测量和推动机器向人类通用智能发展的方法。
Summary / 总结
The research aims to evaluate machine intelligence more comprehensively by comparing it to human general intelligence through a wide range of games. The method involves creating a scalable platform, AI GameStore, which uses LLMs and human-in-the-loop to generate new human games. Key findings show that the best models achieved less than 10% of the human average score on most games, particularly struggling with those that require world-model learning, memory, and planning skills.
研究旨在通过广泛的游戏来全面评估机器智能,将其与人类的通用智能进行比较。方法是创建一个名为AI GameStore的可扩展平台,该平台使用LLM和人类在环中合成新的人类游戏。关键发现表明,最佳模型在大多数游戏中的得分不到人类平均分的10%,特别是在需要世界模型学习、记忆和规划的游戏方面表现尤为糟糕。
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-19T17:30:28+00:00
Abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.
中文标题/摘要
标题:CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型
大型语言模型(LLMs)的后训练压缩通常依赖于低秩权重近似,即将权重矩阵的每一列表示在共享的低维子空间中。这种策略计算效率高,但其背后的约束对于异构投影权重来说可能过于僵硬,可能会导致不必要的准确度损失。我们提出了CoSpaDi(通过稀疏字典学习压缩),这是一种无需训练的框架,用结构化稀疏分解替代低秩分解,其中每个权重矩阵表示为一个稠密字典乘以一列稀疏系数矩阵。这产生了一种子空间模型:权重矩阵的列表示为不同字典原子子集的线性组合,从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的:使用一个小的校准集,我们优化分解以最小化层输出的功能重构误差,而不是权重空间误差。激活衍生的Gram正交化将这种数据感知目标重新表述为转换后的权重上的标准字典学习问题,并且我们支持层内压缩和组内相似投影之间的跨层字典共享。在Llama和Qwen模型家族中,CoSpaDi 在20-40%的压缩比下,始终优于基于SVD的先进基线和强大的结构化剪枝基线,提高了准确度-压缩和困惑度-压缩的权衡。由此产生的结构化稀疏性使稀疏-密集计算成为可能,并与稀疏系数的后训练量化集成。
Summary / 总结
CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to represent weight matrices, improving expressiveness and accuracy at a fixed parameter budget. It optimizes the factorization to minimize functional reconstruction error of layer outputs using a calibration set and supports per-layer compression and cross-layer dictionary sharing. Experiments on Llama and Qwen models show that CoSpaDi outperforms state-of-the-art SVD-based and structured pruning baselines at 20-40% compression ratios in terms of accuracy and perplexity trade-offs.
CoSpaDi 是一种无需训练的大型语言模型压缩框架,它使用结构化的稀疏分解来替代低秩分解。它将每个权重矩阵表示为一个稠密字典乘以一列稀疏系数矩阵,从而形成一个子空间并集模型,在固定参数预算下具有更好的表达能力。CoSpaDi 通过校准集来最小化层输出的功能重构误差,并支持逐层压缩和跨层字典共享。实验表明,CoSpaDi 在 20-40% 压缩比下在准确性和困惑度方面优于基于 SVD 的最新基准和强结构化剪枝基准。
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge
First: 2026-02-19T16:45:38+00:00 · Latest: 2026-02-19T16:45:38+00:00
Comments: 18 pages, 6 figures, 4 tables
Abstract
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
中文标题/摘要
标题:LATA:拉普拉斯辅助的归纳适应性转换以提高医学VLM中的校准不确定性
医学视觉-语言模型(VLMs)在医学成像中具有强大的零样本识别能力,但它们在领域转移下的可靠性取决于具有保证的校准不确定性。分割一致预测(SCP)提供了有限样本覆盖率,但预测集通常变得很大(效率低),并且类别间的覆盖率不平衡(高类别条件覆盖率差距,CCV),尤其是在少量样本、类别不平衡的情况下;此外,直接适应校准标签会破坏可交换性并使保证失效。我们提出了\texttt{\textbf{LATA}}(拉普拉斯辅助的归纳适应性转换),这是一种无需训练和标签的改进方法,通过在图像-图像k-NN图上平滑零样本概率,使用少量CCC更新,通过确定性变换保持SCP的有效性。我们还引入了一种\textit{失败感知}的校准分数,将其插入视觉-语言不确定性(ViLU)框架中,提供实例级的难度和标签可验证性,以提高固定覆盖率下的预测集效率和类别间的平衡。\texttt{\textbf{LATA}}是黑盒的(不更新VLM),计算量小(窗口化转换,无反向传播),并包括一个可选的先验旋钮,可以完全不使用标签运行,或者如果需要,可以使用校准边缘信息的标签指导变体运行。在\textbf{三个}医学VLM和\textbf{九}个下游任务上,\texttt{\textbf{LATA}}始终减少了集合大小和CCV,同时匹配或收紧目标覆盖率,优于先前的归纳基线,并缩小了与使用标签方法的差距,同时使用了更少的计算资源。全面的消融分析和定性分析表明,\texttt{\textbf{LATA}}在不牺牲可交换性的情况下增强了零样本预测。
Summary / 总结
LATA (Laplacian-Assisted Transductive Adaptation) is proposed to improve the reliability of medical vision-language models (VLMs) under domain shift by refining calibration and test data without updating the VLMs or using labels. It uses a small number of CCCP mean-field updates to smooth zero-shot probabilities over an image-image k-NN graph and introduces a failure-aware conformal score to enhance prediction set efficiency and class-wise balance. Experiments on three medical VLMs across nine tasks show that LATA reduces set size and class-conditioned coverage gap while maintaining or improving target coverage, outperforming previous transductive methods and using less compute.
LATA(Laplacian-Assisted Transductive Adaptation)是一种无需训练和标签的方法,通过使用k-NN图平滑零样本概率并应用确定性变换来精炼医疗视觉-语言模型的联合校准和测试池,同时保持分裂校准预测的有效性。它引入了一种失败感知的校准分数,以提高预测集的效率和类别间的平衡。LATA在三个医疗VLM和九个下游任务上一致地减少了集合大小和类别条件覆盖差异,同时匹配或收紧目标覆盖范围,优于先前的归纳基线,并使用更少的计算资源。
Visual Model Checking: Graph-Based Inference of Visual Routines for Image Retrieval
Authors: Adrià Molina, Oriol Ramos Terrades, Josep Lladós
First: 2026-02-19T14:10:55+00:00 · Latest: 2026-02-19T14:10:55+00:00
Comments: Submitted for ICPR Review
Abstract
Information retrieval lies at the foundation of the modern digital industry. While natural language search has seen dramatic progress in recent years largely driven by embedding-based models and large-scale pretraining, the field still faces significant challenges. Specifically, queries that involve complex relationships, object compositions, or precise constraints such as identities, counts and proportions often remain unresolved or unreliable within current frameworks. In this paper, we propose a novel framework that integrates formal verification into deep learning-based image retrieval through a synergistic combination of graph-based verification methods and neural code generation. Our approach aims to support open-vocabulary natural language queries while producing results that are both trustworthy and verifiable. By grounding retrieval results in a system of formal reasoning, we move beyond the ambiguity and approximation that often characterize vector representations. Instead of accepting uncertainty as a given, our framework explicitly verifies each atomic truth in the user query against the retrieved content. This allows us to not only return matching results, but also to identify and mark which specific constraints are satisfied and which remain unmet, thereby offering a more transparent and accountable retrieval process while boosting the results of the most popular embedding-based approaches.
中文标题/摘要
标题:视觉模型检查:基于图的视觉模式图推理及其在图像检索中的应用
信息检索是现代数字产业的基础。尽管近年来自然语言搜索在嵌入式模型和大规模预训练的推动下取得了显著进展,但该领域仍面临重大挑战。具体而言,涉及复杂关系、对象组合或精确约束(如身份、数量和比例)的查询往往在当前框架中仍未得到解决或可靠性不足。在本文中,我们提出了一种新的框架,通过将基于图的验证方法与神经代码生成相结合,将形式验证整合到基于深度学习的图像检索中。我们的方法旨在支持开放词汇的自然语言查询,同时产生既可信又可验证的结果。通过将检索结果置于形式推理的系统中,我们超越了向量表示中常见的模糊性和近似性。我们的框架不接受不确定性作为既定事实,而是明确验证用户查询中的每个原子事实与检索内容之间的关系。这不仅使我们能够返回匹配结果,还能够识别并标记哪些特定约束得到满足,哪些未得到满足,从而提供一个更加透明和负责任的检索过程,同时提升基于最流行嵌入式模型方法的结果。
Summary / 总结
The paper proposes a novel framework for image retrieval that integrates formal verification with deep learning. It uses graph-based verification methods and neural code generation to support open-vocabulary natural language queries and produce verifiable results. The approach aims to address the limitations of current frameworks in handling complex relationships and precise constraints. Key findings include the ability to explicitly verify each atomic truth in the user query against the retrieved content, leading to more transparent and accountable retrieval results.
论文提出了一种将形式验证与深度学习结合的图像检索框架,使用基于图的验证方法和神经代码生成来支持开放词汇的自然语言查询并产生可验证的结果。该方法旨在解决当前框架在处理复杂关系和精确约束方面的局限性。主要发现包括能够明确验证用户查询中的每个原子真理与检索内容之间的关系,从而实现更透明和负责任的检索结果。
Selective Training for Large Vision Language Models via Visual Information Gain
Authors: Seulbi Lee, Sangheum Hwang
First: 2026-02-19T09:12:21+00:00 · Latest: 2026-02-19T09:12:21+00:00
Abstract
Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.
中文标题/摘要
标题:大型视觉语言模型的选择性训练通过视觉信息增益
大型视觉语言模型(LVLMs)取得了显著进展,但它们往往存在语言偏见,生成答案时不依赖视觉证据。尽管先前的工作通过解码策略、架构修改或精心策划的指令数据试图缓解这一问题,但它们通常缺乏一个定量衡量单个训练样本或令牌从图像中实际受益多少的指标。在本文中,我们引入了视觉信息增益(VIG),这是一种基于困惑度的度量,用于衡量视觉输入提供的预测不确定性减少量。VIG 允许在样本和令牌级别进行精细分析,有效突出显示视觉基础的元素,如颜色、空间关系和属性。利用这一点,我们提出了一种基于 VIG 的选择性训练方案,优先选择高 VIG 样本和令牌。这种方法通过专注于仅视觉信息丰富的样本和令牌,提高了视觉基础并减轻了语言偏见,实现了显著减少监督下的优越性能。
Summary / 总结
This work addresses the language bias in Large Vision Language Models (LVLMs) by introducing Visual Information Gain (VIG), a metric that quantifies the reduction in prediction uncertainty from visual input. The authors propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens, improving visual grounding and reducing language bias. This method achieves better performance with less supervision by focusing on visually informative elements.
本文通过引入视觉信息增益(VIG)这一度量视觉输入减少预测不确定性的方法,解决了大型视觉语言模型(LVLM)中的语言偏见问题。作者提出了一种基于VIG的选择性训练方案,优先选择高VIG的样本和令牌,从而提高视觉定位并减少语言偏见。该方法通过专注于视觉信息丰富的元素,实现了更好的性能并减少了监督需求。
Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance
Authors: Haipeng Li, Rongxuan Peng, Anwei Luo, Shunquan Tan, Changsheng Chen, Anastasia Antsiferova
First: 2026-02-06T09:32:10+00:00 · Latest: 2026-02-19T08:33:25+00:00
Comments: 17 pages, 11 figures
Abstract
The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.
中文标题/摘要
标题:针对图像伪造检测的多模态指导通用反取证攻击
AI生成内容(AIGC)技术的迅速发展对真实性评估提出了重大挑战。然而,现有的评估协议很大程度上忽视了反取证攻击,未能确保最先进的AIGC检测器在实际应用中的全面鲁棒性。为弥补这一差距,我们提出了一种名为ForgeryEraser的框架,该框架能够在不访问目标AIGC检测器的情况下执行通用反取证攻击。我们揭示了一种源自系统性依赖视觉语言模型(VLMs)作为共享骨干(例如CLIP)的对抗性漏洞,下游AIGC检测器继承了这些公开模型的特征空间。我们设计了一种多模态指导损失,而不是传统的logit优化,以驱动伪造图像嵌入向文本衍生的真实锚点靠拢,从而消除伪造痕迹,同时将它们排斥在伪造锚点之外。广泛的实验表明,ForgeryEraser在全局合成和局部编辑基准上对高级AIGC检测器造成了显著的性能下降。此外,ForgeryEraser促使可解释的取证模型为伪造图像生成与真实图像一致的解释。我们的代码将公开发布。
Summary / 总结
The paper addresses the challenge of ensuring the robustness of AI-generated content (AIGC) detectors against anti-forensics attacks. It proposes ForgeryEraser, a framework that performs universal anti-forensics attacks by leveraging a multi-modal guidance loss to manipulate forged image embeddings within the feature space of Vision-Language Models (VLMs). Experiments show that ForgeryEraser significantly degrades the performance of advanced AIGC detectors and causes forensic models to generate explanations consistent with authentic images for forged images.
论文旨在确保AI生成内容(AIGC)检测器在面对反取证攻击时的鲁棒性。提出了一种名为ForgeryEraser的框架,能够在无需访问目标AIGC检测器的情况下执行通用反取证攻击。通过利用多模态引导损失,ForgeryEraser将伪造图像嵌入在视觉-语言模型的特征空间中,使其模仿真实图像,从而显著降低高级AIGC检测器的性能。实验结果显示,在全局合成和局部编辑基准测试中,性能有显著下降,并且这些攻击使取证模型为伪造图像生成与真实图像一致的解释。
B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
Authors: Hiromichi Kamata, Samuel Arthur Munro, Fuminori Homma
First: 2026-02-19T07:14:52+00:00 · Latest: 2026-02-19T07:14:52+00:00
Comments: Project page: https://sony.github.io/B3-Seg-project/
Abstract
Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.
中文标题/摘要
标题:B$^3$-Seg:无需相机、无需训练的3DGS分割方法,基于贝塔-伯努利贝叶斯更新和分析期望信息增益
交互式3D高斯点云分割(3DGS)分割对于电影和游戏生产中的实时编辑预重建资产至关重要。然而,现有方法依赖于预定义的相机视角、真实标签或昂贵的重新训练,使其在低延迟使用中不切实际。我们提出了B$^3$-Seg(基于贝塔-伯努利贝叶斯分割的3DGS),这是一种在无需相机和无需训练条件下进行开放词汇3DGS分割的快速且理论基础的方法。我们的方法将分割重新表述为贝塔-伯努利贝叶斯更新的序列,并通过分析期望信息增益(EIG)主动选择下一个视角。这种贝叶斯表述保证了EIG的自适应单调性和次模性,从而产生贪婪的最优视图采样策略的$(1{-}1/e)$近似。在多个数据集上的实验表明,B$^3$-Seg在几秒钟内进行端到端分割时,其结果与高成本监督方法相当。结果表明,B$^3$-Seg能够实现具有可证明信息效率的实用、交互式3DGS分割。
Summary / 总结
B$^3$-Seg is a method for camera-free and training-free 3D Gaussian Splatting (3DGS) segmentation, which is crucial for real-time editing in film and game production. It uses Beta-Bernoulli Bayesian updates and analytic Expected Information Gain to select views, ensuring efficient and adaptive segmentation. Experiments show that B$^3$-Seg achieves results comparable to high-cost supervised methods while operating within a few seconds, enabling practical, interactive 3DGS segmentation.
B$^3$-Seg 是一种无需相机和无需训练的 3D 贝塞尔散点图(3DGS)分割方法,解决了现有方法需要预设相机视角、真实标签或昂贵的重新训练的问题。它使用贝塔-伯努利贝叶斯更新和分析期望信息增益来选择视角,确保了接近最优视角采样策略的贪婪 $(1{-}1/e)$ 近似。实验表明,B$^3$-Seg 在几秒内端到端运行时,能达到与高成本监督方法相当的结果,实现了实用的、交互式的 3DGS 分割。
VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
Authors: Sanoojan Baliah, Yohan Abeysinghe, Rusiru Thushara, Khan Muhammad, Abhinav Dhall, Karthik Nandakumar, Muhammad Haris Khan
Venue: WACV 2026
First: 2026-02-08T06:13:19+00:00 · Latest: 2026-02-19T03:41:54+00:00
Comments: Accepted at WACV 2026
Abstract
We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.
中文标题/摘要
标题:VFace:基于扩散模型的无训练面部换脸方法
我们提出了一种无训练、即插即用的方法VFace,用于高质量的视频面部换脸。它可以无缝集成到基于扩散模型的图像面部换脸方法中。首先,我们引入了一种频谱注意力插值技术,以促进生成并保持关键身份特征的完整性。其次,我们通过即插即用的注意力注入实现目标结构引导,以更好地对齐目标帧的结构特征与生成。第三,我们提出了一种流引导注意力时空平滑机制,该机制在不修改底层扩散模型的情况下,通过减少帧间生成中常见的时间不一致性来增强时空一致性。我们的方法不需要额外的训练或针对特定视频的微调。大量实验表明,我们的方法显著提高了时间一致性和视觉保真度,提供了一种实用且模块化的视频面部换脸解决方案。我们的代码可在https://github.com/Sanoojan/VFace获取。
Summary / 总结
VFace is a training-free approach for high-quality video face swapping, integrating seamlessly with diffusion-based image face swapping methods. It uses Frequency Spectrum Attention Interpolation for generation and maintaining key identity features, Target Structure Guidance via attention injection for better structural alignment, and Flow-Guided Attention Temporal Smoothening to enhance spatiotemporal coherence. Experiments demonstrate significant improvements in temporal consistency and visual fidelity without requiring additional training or fine-tuning.
VFace 是一种无需训练的方法,用于高质量的视频换脸,可以无缝集成到基于扩散模型的图像换脸方法中。它使用频谱注意力插值来保留关键的身份特征,通过注意力注入实现目标结构引导以对齐结构特征,并通过流动引导的注意力时空平滑机制增强时空一致性。实验结果显示,该方法在时空一致性与视觉保真度方面有显著提升,无需额外训练或特定视频微调。
StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection
Authors: Joongwon Chae, Lihui Luo, Yang Liu, Runming Wang, Dongmei Yu, Zeming Liang, Xi Yuan, Dayan Zhang, Zhenglin Chen, Peiwu Qin, Ilmoon Chae
First: 2026-02-19T03:35:24+00:00 · Latest: 2026-02-19T03:35:24+00:00
Abstract
Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap.
We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization.
StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.
中文标题/摘要
标题:StructCore:基于结构的图像级评分方法用于无监督异常检测的训练-free 方法
最大池化是基于记忆库的无监督异常检测(UAD)中将异常评分图转换为图像级决策的默认标准。然而,由于它依赖于单一的极端响应,因此会丢弃异常证据在图像中分布和结构的大部分信息,通常导致正常和异常评分重叠。
我们提出了一种训练-free、基于结构的图像级评分方法StructCore,它超越了最大池化。给定一个异常评分图,StructCore 计算一个低维结构描述符 φ(S),捕捉分布性和空间特性,并通过从训练良好样本中估计的对角马氏距离校准来细化图像级评分,而不修改像素级定位。
StructCore 在 MVTec AD 上实现了图像级 AUCROC 分数 99.6%,在 VisA 上实现了 98.4%,通过利用最大池化忽略的结构特征展示了稳健的图像级异常检测。
Summary / 总结
The paper addresses the limitation of max pooling in unsupervised anomaly detection, which often leads to overlapping normal and anomalous scores due to its reliance on a single extreme response. StructCore proposes a training-free method that computes a low-dimensional structural descriptor from the anomaly score map to capture distributional and spatial characteristics. It then refines image-level scoring using a diagonal Mahalanobis calibration, achieving high AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, indicating robust image-level anomaly detection.
论文针对最大池化方法在无监督异常检测中的局限性,即由于依赖单一极端响应而导致正常和异常得分重叠。StructCore 提出了一种无需训练的方法,通过从异常得分图中计算低维结构描述符来捕捉分布性和空间特性,并使用从训练良好样本中估计的对角马氏距离校准来细化图像级评分,实现了在 MVTec AD 上的图像级 AUROC 分数为 99.6%,在 VisA 上为 98.4%,表明通过利用最大池化遗漏的结构特征实现了稳健的图像级异常检测。
Narrow fine-tuning erodes safety alignment in vision-language agents
Authors: Idhant Gulati, Shivam Raval
First: 2026-02-18T22:47:28+00:00 · Latest: 2026-02-18T22:47:28+00:00
Comments: 24 pages, 11 figures
Abstract
Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.
中文标题/摘要
标题:狭窄微调会侵蚀视觉语言代理的安全对齐
终身多模态代理必须通过后训练不断适应新任务,但这在获取能力与保持安全对齐之间造成了根本性的紧张关系。我们证明,对狭窄领域有害数据集进行对齐的视觉语言模型的微调会导致严重的泛化不良对齐,这种不良对齐在与任务和模态无关的任务中广泛存在。通过在Gemma3-4B上的实验,我们表明不良对齐与LoRA秩单调增加,并且多模态评估揭示了显著更高的不良对齐(在r=128时为70.71±1.22),而仅文本评估为(41.19±2.51),这表明单一模态的安全基准可能低估了视觉语言模型中的对齐退化。关键的是,即使训练混合数据中只有10%的有害数据也会导致显著的对齐退化。几何分析表明,有害行为占据了一个非常低维度的子空间,大部分不良对齐信息被10个主成分捕获。为了缓解不良对齐,我们评估了两种策略:良性狭窄微调和激活基于的引导。虽然这两种方法显著减少了不良对齐,但都没有完全消除学习到的有害行为。我们的研究结果强调了需要稳健的持续学习框架,因为当前的后训练范式可能无法在部署后充分保持对齐。
Summary / 总结
This study investigates the impact of narrow fine-tuning on the safety alignment of vision-language agents. It demonstrates that fine-tuning on harmful datasets leads to significant misalignment that generalizes across tasks and modalities. Experiments on Gemma3-4B show that misalignment increases with fine-tuning rank and that multimodal evaluation reveals higher misalignment compared to text-only evaluation. The study also finds that even a small amount of harmful data can cause substantial alignment degradation, and that harmful behaviors are concentrated in a low-dimensional subspace. Mitigation strategies such as benign narrow fine-tuning and activation-based steering reduce but do not fully eliminate misalignment, highlighting the need for robust continual learning frameworks.
研究探讨了窄范围微调对视觉语言代理安全对齐的影响。研究显示,对有害数据进行微调会导致严重的对齐偏差,这种偏差会在不同任务和模态中泛化。实验表明,对齐偏差随微调秩的增加而增加,且多模态评估揭示的对齐偏差高于仅文本评估。即使训练数据中包含少量有害数据,也会导致显著的对齐降解。研究建议,当前的后训练方法可能无法在部署后充分保持安全对齐。
MALLVI: a multi agent framework for integrated generalized robotics manipulation
Authors: Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj
First: 2026-02-18T21:28:56+00:00 · Latest: 2026-02-18T21:28:56+00:00
Abstract
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.
中文标题/摘要
标题:MALLVI:一种集成通用机器人操作的多智能体框架
使用大型语言模型(LLMs)进行机器人操作的任务规划是一个新兴领域。先前的方法依赖于专门的模型、微调或提示调优,并且通常以开环方式运行,缺乏稳健的环境反馈,使其在动态环境中变得脆弱。我们提出了MALLVi,一种多智能体大型语言和视觉框架,能够通过闭环反馈驱动的机器人操作。给定自然语言指令和环境图像,MALLVi生成可执行的原子动作供机器人操作执行。执行动作后,视觉语言模型(VLM)评估环境反馈并决定是否重复该过程或进行下一步。MALLVi 不使用单一模型,而是协调分解器、定位器、思考者和反思者等专门智能体来管理感知、定位、推理和高级规划。可选的描述者智能体提供初始状态的视觉记忆。反思者通过重新激活相关智能体来支持有针对性的错误检测和恢复,避免全面重新规划。在模拟和真实世界环境中的实验表明,迭代闭环多智能体协调可以提高通用性和零样本操作任务的成功率。代码可在 https://github.com/iman1234ahmadi/MALLVI 获取。
Summary / 总结
MALLVi is a multi-agent framework that uses large language and vision models to enable closed-loop feedback-driven robotic manipulation. Given a natural language instruction and an image, MALLVi generates atomic actions for a robot, which are then evaluated by a Vision Language Model to decide the next steps. Key findings show that iterative closed-loop coordination improves generalization and success rates in zero-shot manipulation tasks.
MALLVi 是一个多代理框架,利用大型语言和视觉模型进行闭环机器人操作。给定自然语言指令和图像后,MALLVi 生成原子动作,然后执行。视觉语言模型在每次操作后评估环境,并决定是否重复或继续。MALLVi 包含感知、定位、推理和规划的专业代理,可选的描述代理提供初始状态的视觉记忆。实验表明,MALLVi 在零样本操作任务中提高了泛化能力和成功率。
DODO: Discrete OCR Diffusion Models
Authors: Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman
First: 2026-02-18T20:59:22+00:00 · Latest: 2026-02-18T20:59:22+00:00
Abstract
Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
中文标题/摘要
标题:DODO: 离散OCR扩散模型
光学字符识别(OCR)是数字化信息的基本任务,是视觉数据与文本理解之间的关键桥梁。尽管现代视觉-语言模型(VLM)在这个领域取得了高精度,但它们主要依赖于自回归解码,这在长文档中变得计算成本高昂且速度缓慢,因为每次生成一个新词都需要进行一次顺序前向传递。我们发现了一个关键机会来克服这一瓶颈:与开放生成不同,OCR是一个高度确定的任务,其中视觉输入严格决定了一个唯一的输出序列,理论上可以通过扩散模型实现高效的并行解码。然而,我们表明现有的掩码扩散模型无法利用这一潜力;这些模型在像描述图像这样的灵活任务中引入的结构不稳定性,在OCR的严格、精确匹配要求下是灾难性的。为了弥合这一差距,我们提出了DODO,这是第一个利用块离散扩散的VLM,从而解锁其在OCR中的加速潜力。通过将生成分解为块,DODO减轻了全局扩散的同步错误。实验证明,我们的方法在接近当前最佳精度的同时,比自回归基线快3倍的推理速度。
Summary / 总结
The paper addresses the computational inefficiency of autoregressive decoding in Optical Character Recognition (OCR) tasks, especially for long documents. It introduces DODO, a discrete OCR diffusion model that decomposes generation into blocks to mitigate synchronization errors, achieving near state-of-the-art accuracy with up to 3x faster inference compared to autoregressive models.
该论文提出了DODO,一种离散OCR扩散模型,旨在解决OCR任务中自回归解码的计算效率问题。通过利用块离散扩散,DODO实现了接近当前最佳的准确率,并且比自回归基线快3倍的推理速度,克服了现有掩码扩散模型在确定性任务如OCR中的结构不稳定性问题。
Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
Authors: Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk
Venue: ICLR 2026
First: 2025-08-16T12:26:44+00:00 · Latest: 2026-02-18T20:57:18+00:00
Comments: Accepted to The Fourteenth International Conference on Learning Representations (ICLR 2026)
Abstract
Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.
中文标题/摘要
标题:Bongard-RWR+: 实际世界中细粒度概念的表示
邦德问题(BPs)为抽象视觉推理(AVR)提供了一个具有挑战性的测试平台,要求模型仅从少量示例中识别视觉概念,并用自然语言描述它们。早期的BP基准使用合成的黑白图,可能无法完全捕捉现实场景的复杂性。随后的BP数据集使用了真实世界的图像,但代表的概念可以通过高层图像特征识别,降低了任务的复杂性。不同的是,最近发布的Bongard-RWR数据集旨在使用细粒度的真实世界图像表示原始BP中的抽象概念。然而,由于其手动构建,数据集大小仅限于60个实例,限制了评估的稳健性。在本文中,我们引入了Bongard-RWR+,这是一个由5400个实例组成的BP数据集,使用视觉语言模型(VLM)管道生成的真实世界样例图像表示原始BP的抽象概念。基于Bongard-RWR,我们使用Pixtral-12B描述手动精选的图像并生成与底层概念对齐的新描述,使用Flux.1-dev从这些描述中合成图像,并手动验证生成的图像是否忠实反映了预期的概念。我们评估了最先进的VLMs在不同BP形式下的表现,包括二元和多类分类,以及文本答案生成。我们的发现表明,虽然VLMs可以识别粗粒度的视觉概念,但在区分细粒度概念方面却表现不佳,突显了它们推理能力的局限性。
Summary / 总结
This work addresses the challenge of abstract visual reasoning in Bongard Problems by introducing Bongard-RWR+, a dataset of 5,400 real-world-like images generated via a vision language model pipeline. The method involves using a large vision language model to describe images, synthesizing new images from these descriptions, and manually verifying their fidelity to the intended concepts. Key experimental findings show that state-of-the-art vision language models can recognize coarse-grained concepts but struggle with fine-grained ones, indicating limitations in their reasoning abilities.
该研究引入了包含5,400个通过视觉语言模型生成的现实世界图像的Bongard-RWR+数据集,以评估Bongard问题中的抽象视觉推理。该数据集旨在代表原始BPs中的细粒度概念。研究对最先进的视觉语言模型进行了二分类和多分类分类任务以及文本生成任务的评估,并发现这些模型在细粒度概念上表现不佳,表明它们在推理能力上的局限性。
Can Vision-Language Models Answer Face to Face Questions in the Real-World?
Authors: Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, Roland Memisevic
Venue: ICLR 2026
First: 2025-03-25T05:13:12+00:00 · Latest: 2026-02-18T20:15:27+00:00
Comments: ICLR 2026 paper
Abstract
AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.
中文标题/摘要
标题:视觉-语言模型能否在现实场景中面对面回答问题?
近年来,AI模型在描述和回答关于现实世界图像的问题方面取得了显著进展。它们还在使用音频输入与用户进行实时对话方面取得了进步。这引发了一个问题:我们是否已经达到了这样的程度,即连接了摄像头和麦克风的AI模型能够实时与用户就摄像头前正在发生的场景和事件进行对话?这是AI领域长期以来的一个目标,也是现实世界中的AI助手和类人机器人与人类在日常情况下互动的前提。在本项工作中,我们引入了一个新的数据集和基准——高通互动视频数据集(IVD),以评估现有模型支持这些能力的程度,并确定这些能力可以通过微调到何种程度被赋予。该数据集基于一个简单的问答设置,用户提出问题,系统需要基于摄像头和音频输入实时回答。我们展示了现有模型在该任务上的表现远远落后于人类水平,并指出了性能差距的主要来源。然而,我们还展示了对于许多所需的感知技能,通过这种形式的数据进行微调可以显著缩小这一差距。
Summary / 总结
This study investigates whether current vision-language models can answer questions in real-time about live scenes captured by a camera. The researchers developed the Qualcomm Interactive Video Dataset (IVD) to evaluate these models. Existing models perform poorly compared to humans, particularly in understanding and describing real-world scenes. However, fine-tuning on this dataset can improve their performance in many perceptual tasks.
该研究探讨了当前的视觉-语言模型是否能够实时回答由摄像头捕捉到的实时场景的问题。受开发现实世界AI助手和类人机器人目标的驱动,研究人员创建了Qualcomm互动视频数据集(IVD)来评估这些模型。研究发现,现有模型的表现远低于人类,并指出了性能差距的主要原因,但也展示了通过微调可以在实时问答方面显著提高其能力。
Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
Authors: Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma, Javier Gonzalez, Niranjani Prasad
First: 2026-02-18T19:00:07+00:00 · Latest: 2026-02-18T19:00:07+00:00
Abstract
Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model's ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.
中文标题/摘要
标题:三思而行:学习使用双重反事实一致性进行因果推理
尽管大型语言模型(LLMs)在推理基准测试中表现出色,但在面对反事实问题时却显得脆弱,这表明它们在因果推理方面存在弱点。虽然最近的研究表明,标记的反事实任务可以作为LLMs因果推理能力的基准,但由于需要覆盖反事实潜在空间的巨大规模,生成此类数据受到限制。在本研究中,我们引入了双重反事实一致性(DCC),这是一种轻量级的推理时方法,用于衡量和引导LLMs进行因果推理的能力。DCC 不需要标记的反事实数据,而是验证模型执行因果干预和反事实预测这两种因果推理重要元素的能力。使用DCC,我们评估了各种领先LLMs在一系列推理任务和干预中的因果推理能力。此外,我们展示了DCC作为无训练测试时拒绝采样标准的有效性,并证明它可以跨多个模型家族直接提高推理任务的性能。
Summary / 总结
This paper addresses the brittleness of large language models (LLMs) when dealing with counterfactual questions, indicating a need for better causal reasoning. The authors introduce double counterfactual consistency (DCC), a method that evaluates and enhances LLMs' causal reasoning without requiring labeled data. DCC measures the models' ability to perform causal interventions and counterfactual predictions, and it is shown to improve performance on causal reasoning tasks across different model types.
本文探讨了大型语言模型(LLMs)在处理反事实问题时的脆弱性,表明需要改进其因果推理能力。作者引入了双反事实一致性(DCC)方法,该方法在无需标注数据的情况下评估和提升LLMs的因果推理能力。DCC衡量模型执行因果干预和反事实预测的能力,并且证明了它可以在不同模型类型上直接提高因果推理任务的性能。
Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Authors: Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta
First: 2026-02-18T18:55:02+00:00 · Latest: 2026-02-18T18:55:02+00:00
Comments: Project page: https://hero-humanoid.github.io/
Abstract
Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.
中文标题/摘要
标题:类人机器人开放词汇视觉移动物体末端执行器控制学习
使用类人机器人在野外对任意物体进行视觉移动物体操作需要精确的末端执行器(EE)控制和通过视觉输入(例如RGB-D图像)对场景的广泛理解。现有方法基于现实世界的模仿学习,由于难以收集大规模训练数据集,因此表现出有限的泛化能力。本文提出了一种新的范式HERO,用于类人机器人物体移动物体操作,结合了大型视觉模型的强大泛化能力和开放词汇理解与模拟训练中强大的控制性能。我们通过设计一种准确的残差感知末端执行器跟踪策略来实现这一点。该末端执行器跟踪策略结合了经典机器人学和机器学习。它使用a) 逆运动学将残差末端执行器目标转换为参考轨迹,b) 用于准确前向运动学的已学习神经前向模型,c) 目标调整,以及d) 重新规划。这些创新共同帮助我们将末端执行器跟踪误差减少了3.2倍。我们使用这种准确的末端执行器跟踪器构建了一个模块化移动物体系统,其中使用开放词汇大型视觉模型实现强大的视觉泛化。我们的系统能够在从办公室到咖啡馆的各种真实世界环境中操作,机器人能够可靠地操作各种日常物体(例如茶杯、苹果、玩具),这些物体位于43cm到92cm高度的表面上。在模拟和现实世界中的系统模块化和端到端测试表明我们提出的设计的有效性。我们认为本文中的进展可以为训练类人机器人与日常物体交互开辟新的训练方式。
Summary / 总结
This paper introduces HERO, a new paradigm for humanoid robots to perform object manipulation tasks in diverse environments. The system combines the strengths of large vision models and simulated training to achieve accurate end-effector control. Key innovations include an accurate residual-aware end-effector tracking policy that reduces tracking errors by 3.2 times. The system successfully manipulates various objects in real-world settings such as offices and coffee shops, demonstrating strong visual generalization and control performance. Comprehensive tests in both simulation and the real world validate the effectiveness of the proposed approach.
本文提出了HERO,一种结合了大型视觉模型的强大泛化能力和精确末端执行器控制的新范式。HERO 使用一种残差感知的末端执行器跟踪策略,包括逆运动学、学习神经前向模型、目标调整和重新规划。这种方法将末端执行器跟踪误差减少了3.2倍。该系统利用开放词汇量的大型视觉模型,在多种真实环境(如办公室和咖啡馆)中实现了可靠地操作各种物体,证明了其在仿真和真实世界中的有效性。
Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning
Authors: Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li
First: 2026-02-18T18:49:56+00:00 · Latest: 2026-02-18T18:49:56+00:00
Comments: preprint 10 pages, 4 figures
Abstract
Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.
中文标题/摘要
标题:注意力感知多路径思考:重访视觉-语言推理
视觉-语言模型(VLMs)旨在通过联合利用视觉和文本模态进行推理。虽然为大型语言模型(LLMs)分配额外的推理时间计算已被证明是有效的,但在VLMs中实现类似的扩展仍然具有挑战性。一个关键障碍是视觉输入通常只在生成开始时提供一次,而文本推理(例如,早期视觉摘要)是自回归生成的,这使得推理变得越来越以文本为主导,并允许早期视觉定位错误累积。此外,推理期间的视觉定位指导通常粗糙且嘈杂,这使得在长文本上引导推理变得困难。为了解决这些挑战,我们提出了\emph{注意力感知原则}(SAP)选择。SAP 在高层次推理原则而非标记级轨迹上操作,这使得在嘈杂反馈下稳定控制离散生成成为可能,同时允许后续推理步骤在需要重新咨询视觉证据时重新访问视觉证据。此外,SAP 支持多路径推理,允许并行探索多种推理行为。SAP 是模型无关的,不需要额外的数据,也不需要额外的训练。实验证明,SAP 在与可比的标记生成预算下实现了竞争力的表现,特别是在减少对象幻觉方面,同时提供了比CoT风格的长序列推理更稳定的推理和更低的响应延迟。
Summary / 总结
The paper addresses the challenge of scaling vision-language models (VLMs) by proposing Saliency-Aware Principle (SAP) selection, which operates on high-level reasoning principles to enable stable control over discrete generation under noisy feedback. SAP supports multi-route inference, allowing parallel exploration of diverse reasoning behaviors. Empirical results show that SAP reduces object hallucination, provides more stable reasoning, and has lower response latency compared to long sequential reasoning methods.
论文提出了一种Saliency-Aware Principle (SAP)选择方法,该方法在高阶推理原则上操作,以在嘈杂的反馈下稳定控制离散生成。SAP支持多路线推理,允许并行探索多种推理行为。实验结果表明,SAP可以减少物体幻觉,提供更稳定的推理,并具有较低的响应延迟,优于长序列推理。
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang
First: 2024-11-18T16:33:52+00:00 · Latest: 2026-02-18T18:33:19+00:00
Abstract
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.
中文标题/摘要
标题:MC-LLaVA:多概念个性化视觉语言模型
当前的视觉语言模型(VLMs)在各种任务上表现出色,如视觉问答。为了提升用户体验,最近的研究探讨了VLM的个性化,以理解用户提供的概念。然而,这些研究主要集中在单一概念上,忽视了多个概念的存在及其相互作用,这限制了其在实际中的应用。本文提出了一种多概念个性化范式MC-LLaVA。具体而言,MC-LLaVA采用多概念指令调优策略,在单个训练步骤中有效整合多个概念。为了降低训练成本,我们提出了一种个性化的文本提示,利用视觉标记信息初始化概念标记。此外,在推理过程中,我们引入了个性化的视觉提示,聚合位置图以增强识别和语义关联能力。为了进一步提高性能上限,我们引入了一个可选的辅助损失,更好地增强了提出的个性化提示。为了丰富VLM个性化研究,我们贡献了一个高质量的数据集。我们精心收集了来自电影的多角色和多物体的图像,并手动创建了多概念场景下的问题-答案样本,具有更高的多样性。全面的实验表明,MC-LLaVA实现了令人印象深刻的多概念个性化响应,为VLMs成为更好的用户助手铺平了道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA发布。
Summary / 总结
This paper addresses the limitation of single-concept personalization in vision-language models (VLMs) by proposing MC-LLaVA, a multi-concept personalization paradigm. It employs a multi-concept instruction tuning strategy and uses personalized textual and visual prompts to integrate multiple concepts during training and inference. Experimental results show that MC-LLaVA can generate impressive multi-concept personalized responses, enhancing the real-world applicability of VLMs as user assistants.
本文提出了MC-LLaVA,一种多概念个性化范式,以解决视觉语言模型(VLMs)中单概念个性化存在的局限性。MC-LLaVA 使用多概念指令调优策略和个人化文本和视觉提示,在训练和推理过程中整合多个概念。模型还引入了可选的辅助损失来增强个性化提示。实验结果表明,MC-LLaVA 可以生成令人印象深刻的多概念个性化响应,提高 VLMs 作为用户助手的现实应用性。还提供了一个高质量的数据集以供研究装饰。
A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Authors: Qi You, Yitai Cheng, Zichao Zeng, James Haworth
First: 2026-02-18T16:41:32+00:00 · Latest: 2026-02-18T16:41:32+00:00
Abstract
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.
中文标题/摘要
标题:一种基于注意力特征适应的对比学习框架用于街景图像分类
街景图像属性分类是图像分类的重要下游任务,能够支持自动驾驶、城市分析和高精度地图构建等应用。无论是从零开始训练、使用预训练权重初始化还是微调大型模型,该任务仍然计算量大。尽管预训练的视觉-语言模型如CLIP提供了丰富的图像表示,但现有的适应或微调方法往往依赖于它们的全局图像嵌入,限制了它们捕捉复杂、杂乱街景中细微、局部化的属性的能力。为了解决这个问题,我们提出了CLIP-MHAdapter,这是一种轻量级CLIP适应范式的变体,通过在补丁标记上附加一个具有多头自注意力的瓶颈MLP来建模补丁之间的依赖关系。CLIP-MHAdapter在Global StreetScapes数据集上的八个属性分类任务中实现了优于或竞争力的准确性,同时保持了较低的计算成本。代码可在https://github.com/SpaceTimeLab/CLIP-MHAdapter获取。
Summary / 总结
The research aims to improve the accuracy of street-view image attribute classification for applications like autonomous driving and urban analytics. The method introduces CLIP-MHAdapter, which enhances CLIP by adding a bottleneck MLP with multi-head self-attention on patch tokens to capture fine-grained attributes. This approach achieves state-of-the-art results with only 1.4 million trainable parameters, demonstrating superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset while maintaining low computational cost.
研究旨在提高街景图像属性分类的准确性,这对于自动驾驶和城市分析等应用至关重要。方法引入了CLIP-MHAdapter,通过在patch token上添加具有多头自注意力的瓶颈MLP来增强CLIP,以捕捉细粒度的属性。该方法在Global StreetScapes数据集上的八个属性分类任务中实现了优于或可竞争的准确性,达到了新的最佳性能,同时保持了较低的计算成本。该模型约有140万可训练参数。
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Authors: Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger
First: 2025-04-11T15:12:05+00:00 · Latest: 2026-02-18T15:52:04+00:00
Comments: 11 pages, 5 figures
Abstract
Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixelwise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resourceconstrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.
中文标题/摘要
标题:FindAnything:任意词汇和对象中心的映射框架,用于机器人在任意环境中的探索
几何准确且语义丰富的地图表示已被证明对于在未知环境中部署机器人和任务规划至关重要。然而,实时、开放词汇的大型未知环境语义理解仍然存在挑战,主要由于计算需求。本文提出了一种名为FindAnything的开放世界映射框架,将视觉-语言信息整合到密集的体积子地图中。通过使用视觉-语言特征,FindAnything结合了纯几何和开放词汇的语义信息,提高了理解水平。它通过在对象级别聚合特征来高效存储开放词汇信息。基于eSAM片段的像素级视觉-语言特征被聚合,并整合到对象中心的体积子地图中,提供了一种从开放词汇查询到3D几何的映射,同时在内存使用方面也具有可扩展性。我们证明FindAnything在语义准确性方面与最新技术相当,但速度更快且更节省内存,使其能够在大型环境中部署,并在资源受限的设备上运行,如MAVs。我们展示了FindAnything的实时能力使其适用于下游任务,例如在模拟的搜索和救援场景中自主MAV探索。
Summary / 总结
FindAnything is an open-world mapping framework that integrates vision-language information into dense volumetric submaps to achieve both geometric and semantic understanding. It uses object-level aggregation of features and eSAM segments to efficiently store open-vocabulary information, providing a scalable mapping from queries to 3D geometry. FindAnything matches the state-of-the-art in semantic accuracy but is faster and more memory-efficient, suitable for large-scale environments and resource-constrained devices like MAVs. It demonstrates real-time capabilities useful for tasks like autonomous MAV exploration in simulated Search and Rescue scenarios.
FindAnything 是一种将视觉-语言信息集成到密集体素子地图中的开放世界映射框架,以实现几何和语义理解。它通过对象级聚合视觉-语言特征来高效存储开放词汇信息,并将这些特征集成到以对象为中心的体素子地图中。实验结果显示,FindAnything 在语义准确性上与最先进的技术相当,但速度更快且更节省内存,适用于大规模环境和资源受限设备如 MAV。它被证明适用于下游任务,如模拟搜索和救援场景中的自主 MAV 探索。
DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images
Authors: Zeng Tao, Ying Jiang, Yunuo Chen, Tianyi Xie, Huamin Wang, Yingnian Wu, Yin Yang, Abishek Sampath Kumar, Kenji Tashiro, Chenfanfu Jiang
First: 2026-02-18T14:45:15+00:00 · Latest: 2026-02-18T14:45:15+00:00
Abstract
Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.
中文标题/摘要
标题:DressWild:从野生图像中生成前馈姿态无关服装缝制模板
服装模板生成的最新进展显示了令人鼓舞的进展。然而,现有的前馈方法在处理多样化的姿态和视角方面存在困难,而基于优化的方法则计算成本高且难以扩展。本文关注于服装建模和制造应用中所需的可编辑、可分离和可用于模拟的服装的缝制模板生成。我们提出了一种名为DressWild的新型前馈管道,可以从单张野生图像中重建符合物理规律的2D缝制模板及其对应的3D服装。给定输入图像,我们的方法利用视觉-语言模型(VLMs)在图像级别上归一化姿态变化,然后提取姿态感知的、3D启发式的服装特征。这些特征通过基于变换器的编码器融合,随后用于预测缝制模板参数,这些参数可以直接应用于物理模拟、纹理合成和多层虚拟试穿。广泛的实验表明,我们的方法无需多视角输入或迭代优化即可稳健地从野生图像中恢复出多样化的缝制模板及其对应的3D服装,提供了一种高效且可扩展的现实服装模拟和动画解决方案。
Summary / 总结
This paper addresses the challenge of generating sewing patterns for diverse poses in in-the-wild images. It introduces DressWild, a feed-forward pipeline that uses vision-language models to normalize pose variations and extract 3D-informed garment features. These features are then used to predict sewing pattern parameters, which can be directly applied to physical simulation and virtual try-on. Experiments show that DressWild can robustly generate diverse sewing patterns and 3D garments from single images without the need for multi-view inputs or iterative optimization, providing an efficient and scalable solution for garment simulation and animation.
本文解决了从野生图像生成缝制图案的挑战,这些图像由于多样化的姿态和视角往往难以用现有的前馈方法处理。作者提出了DressWild,这是一种新颖的前馈管道,使用视觉语言模型来归一化姿态变化并提取3D相关的服装特征。这些特征随后用于预测缝制图案参数,可以直接应用于物理模拟和纹理合成。实验表明,DressWild可以从单张图像中稳健地生成多样化的缝制图案和对应的3D服装,无需多视角输入或迭代优化,提供了一种高效且可扩展的解决方案,用于现实的服装模拟和动画。
Fast and Scalable Analytical Diffusion
Authors: Xinyi Shang, Peng Sun, Jingyu Lin, Zhiqiang Shen
First: 2026-02-18T14:41:09+00:00 · Latest: 2026-02-18T14:41:09+00:00
Abstract
Analytical diffusion models offer a mathematically transparent path to generative modeling by formulating the denoising score as an empirical-Bayes posterior mean. However, this interpretability comes at a prohibitive cost: the standard formulation necessitates a full-dataset scan at every timestep, scaling linearly with dataset size. In this work, we present the first systematic study addressing this scalability bottleneck. We challenge the prevailing assumption that the entire training data is necessary, uncovering the phenomenon of Posterior Progressive Concentration: the effective golden support of the denoising score is not static but shrinks asymptotically from the global manifold to a local neighborhood as the signal-to-noise ratio increases. Capitalizing on this, we propose Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), a training-free framework that decouples inference complexity from dataset size. Instead of static retrieval, GoldDiff uses a coarse-to-fine mechanism to dynamically pinpoint the ''Golden Subset'' for inference. Theoretically, we derive rigorous bounds guaranteeing that our sparse approximation converges to the exact score. Empirically, GoldDiff achieves a $\bf 71 \times$ speedup on AFHQ while matching or achieving even better performance than full-scan baselines. Most notably, we demonstrate the first successful scaling of analytical diffusion to ImageNet-1K, unlocking a scalable, training-free paradigm for large-scale generative modeling.
中文标题/摘要
标题:快速且可扩展的分析性扩散
分析性扩散模型通过将去噪分数形式化为经验贝叶斯后验均值,提供了一条数学上透明的生成建模路径。然而,这种可解释性付出了高昂的成本:标准形式要求在每个时间步长进行整个数据集扫描,其规模与数据集大小成线性关系。在本文中,我们首次系统地研究了这一可扩展性瓶颈。我们挑战了整个训练数据必不可少的假设,揭示了后验渐进集中现象:去噪分数的有效黄金支持并非静态的,而是随着信噪比增加从全局流形逐渐收缩到局部邻域。利用这一发现,我们提出了动态时间感知黄金子集扩散(GoldDiff),这是一种无需训练的框架,将推理复杂度与数据集大小脱钩。GoldDiff 不使用静态检索,而是采用粗到细机制动态定位“黄金子集”进行推理。理论上,我们推导出严格的界确保我们的稀疏近似收敛到精确分数。实验上,GoldDiff 在 AFHQ 上实现了 71 倍的速度提升,同时匹配甚至超越了全扫描基线的性能。最值得注意的是,我们展示了分析性扩散首次成功扩展到 ImageNet-1K,解锁了一种可扩展的、无需训练的大规模生成建模范式。
Summary / 总结
This paper addresses the scalability issue of analytical diffusion models by proposing Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), which decouples inference complexity from dataset size through a coarse-to-fine mechanism. The method leverages the phenomenon of Posterior Progressive Concentration, where the effective support of the denoising score shrinks as the signal-to-noise ratio increases. Empirically, GoldDiff achieves a 71 times speedup on AFHQ while maintaining or improving performance compared to full-scan baselines, and successfully scales analytical diffusion to ImageNet-1K for large-scale generative modeling.
该研究通过提出动态时间感知黄金子集扩散(GoldDiff)方法,解决了分析性扩散模型的可扩展性问题,该方法在推理时动态选择数据子集而非使用完整数据集。该方法利用了后验渐进集中现象,即随着信噪比增加,去噪分数的有效支持会逐渐缩小。实验结果显示,GoldDiff 在 AFHQ 上实现了 71 倍的加速,同时匹配或超越了全扫描基线方法。特别地,它首次成功将分析性扩散模型扩展到 ImageNet-1K,开启了大规模生成建模的可扩展、无训练框架。
SurgRAW: Multi-Agent Workflow with Chain of Thought Reasoning for Robotic Surgical Video Analysis
Authors: Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhu Zhuo, Zhitao Zeng, Evangelos B. Mazomenos, Yueming Jin
Venue: IEEE Robotics and Automation Letters, 2026, pp. 1-8
First: 2025-03-13T11:23:13+00:00 · Latest: 2026-02-18T14:35:21+00:00
Abstract
Robotic-assisted surgery (RAS) is central to modern surgery, driving the need for intelligent systems with accurate scene understanding. Most existing surgical AI methods rely on isolated, task-specific models, leading to fragmented pipelines with limited interpretability and no unified understanding of RAS scene. Vision-Language Models (VLMs) offer strong zero-shot reasoning, but struggle with hallucinations, domain gaps and weak task-interdependency modeling. To address the lack of unified data for RAS scene understanding, we introduce SurgCoTBench, the first reasoning-focused benchmark in RAS, covering 14256 QA pairs with frame-level annotations across five major surgical tasks. Building on SurgCoTBench, we propose SurgRAW, a clinically aligned Chain-of-Thought (CoT) driven agentic workflow for zero-shot multi-task reasoning in surgery. SurgRAW employs a hierarchical reasoning workflow where an orchestrator divides surgical scene understanding into two reasoning streams and directs specialized agents to generate task-level reasoning, while higher-level agents capture workflow interdependencies or ground output clinically. Specifically, we propose a panel discussion mechanism to ensure task-specific agents collaborate synergistically and leverage on task interdependencies. Similarly, we incorporate a retrieval-augmented generation module to enrich agents with surgical knowledge and alleviate domain gaps in general VLMs. We design task-specific CoT prompts grounded in surgical domain to ensure clinically aligned reasoning, reduce hallucinations and enhance interpretability. Extensive experiments show that SurgRAW surpasses mainstream VLMs and agentic systems and outperforms a supervised model by 14.61% accuracy. Dataset and code is available at https://github.com/jinlab-imvr/SurgRAW.git .
中文标题/摘要
标题:SurgRAW:多智能体工作流与链式思考推理在机器人手术视频分析中的应用
机器人辅助手术(RAS)是现代手术的核心,推动了智能系统准确场景理解的需求。目前大多数现有的手术AI方法依赖于孤立的任务特定模型,导致了碎片化的管道,缺乏可解释性,并且没有统一理解RAS场景。视觉-语言模型(VLMs)提供了强大的零样本推理能力,但在幻觉、领域差距和任务间依赖建模方面存在困难。为了解决RAS场景理解缺乏统一数据的问题,我们引入了SurgCoTBench,这是第一个专注于RAS的推理基准,涵盖了14256个问答对,包含五个主要手术任务的帧级注释。基于SurgCoTBench,我们提出了SurgRAW,一种临床对齐的链式思考(CoT)驱动的多任务推理智能体工作流。SurgRAW采用分层推理工作流,其中协调者将手术场景理解分为两个推理流,并指导专门的智能体生成任务级推理,而高级智能体则捕捉工作流间的依赖性或临床验证输出。具体而言,我们提出了一种讨论机制,以确保任务特定的智能体能够协同工作并利用任务间的依赖性。同样,我们引入了检索增强生成模块,以丰富智能体的手术知识并缓解通用VLMs的领域差距。我们设计了基于手术领域的特定任务CoT提示,以确保临床对齐的推理、减少幻觉并增强可解释性。广泛的实验表明,SurgRAW超越了主流的VLMs和智能体系统,并且在准确率上比监督模型高出14.61%。数据集和代码可在https://github.com/jinlab-imvr/SurgRAW.git 获取。
Summary / 总结
SurgRAW is a multi-agent workflow system for robotic surgical video analysis that uses a Chain-of-Thought reasoning approach. It addresses the limitations of isolated task-specific models by introducing SurgCoTBench, a benchmark with 14,256 QA pairs for RAS scene understanding. Experiments show that SurgRAW outperforms mainstream Vision-Language Models and supervised models by 14.61% accuracy in zero-shot multi-task reasoning. The system employs a hierarchical reasoning workflow and task-specific CoT prompts to ensure clinically aligned reasoning and reduce hallucinations.
论文通过引入SurgCoTBench推理导向基准和SurgRAW多智能体工作流,解决了机器人辅助手术(RAS)场景分析中统一理解的需求。SurgRAW采用分层推理方法,由协调者指导专门的智能体生成任务级推理,并由高级智能体捕捉工作流间的相互依赖。系统中包含一个小组讨论机制和检索增强生成模块,以增强协作和知识丰富。实验表明,SurgRAW在RAS场景理解任务中的准确率比主流视觉语言模型和智能体系统高出14.61%。
Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
Authors: Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin
First: 2026-02-18T13:40:53+00:00 · Latest: 2026-02-18T13:40:53+00:00
Abstract
While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
中文标题/摘要
标题:视觉自我精炼:一种基于像素的图表解析新范式
虽然大型多模态语言模型(LVLMs)在文本推理和自我修正方面表现出色,但这些优势对以视觉感知为中心的复杂任务,如图表解析,提供的帮助有限。现有模型在处理视觉密集型图表时常常遇到困难,导致数据遗漏、对齐错误和幻觉等问题。受人类在阅读复杂图表时使用手指作为“视觉锚点”以确保准确性的启发,我们提出了一种新的范式——视觉自我精炼(VSR)。VSR的核心思想是使模型能够生成像素级定位输出,可视化这些输出,并将这些可视化结果反馈给模型本身,使其能够直观地检查和修正潜在的视觉感知错误。我们通过提出ChartVSR在图表解析领域实例化了VSR范式。该模型将解析过程分解为两个阶段:在精炼阶段,它通过视觉反馈迭代确保所有数据点的像素级定位的准确性;在解码阶段,它使用这些验证过的定位作为精确的视觉锚点来解析最终的结构化数据。为了克服现有基准的局限性,我们还构建了ChartP-Bench,这是一个新的、极具挑战性的图表解析基准。我们的工作还强调了VSR作为一种通用的视觉反馈机制,为提高各种视觉为中心任务的准确性提供了有希望的新方向。
Summary / 总结
The paper proposes Visual Self-Refine (VSR), a new paradigm for improving chart parsing accuracy by enabling models to generate and visualize pixel-level localization outputs, which are then fed back to the model for self-correction. The method decomposes the parsing process into a Refine Stage and a Decode Stage, ensuring precise localization and structured data parsing. The authors also introduce ChartP-Bench, a challenging benchmark for chart parsing, demonstrating the effectiveness of VSR in handling visually dense charts and reducing errors like data omission and misalignment.
本文针对大型视觉语言模型(LVLMs)在复杂视觉任务如图表解析中的局限性,如数据遗漏和错位等问题,提出了一种新的视觉自我校正(VSR)范式,使模型能够生成像素级定位输出,通过可视化反馈纠正潜在错误。提出的ChartVSR模型将解析过程分解为精炼阶段和解码阶段,确保像素级定位的准确性,并使用这些验证的定位作为视觉锚点进行结构化数据解析。此外,本文还引入了ChartP-Bench,这是一种新的基准测试,用于更有效地评估图表解析模型。