When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
Authors: Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, Mingyu Ding
First: 2026-02-19T18:59:20+00:00 · Latest: 2026-02-19T18:59:20+00:00
Comments: Website: https://vla-va.github.io/
Abstract
Vision-Language-Action models (VLAs) promise to ground language instructions in robot control, yet in practice often fail to faithfully follow language. When presented with instructions that lack strong scene-specific supervision, VLAs suffer from counterfactual failures: they act based on vision shortcuts induced by dataset biases, repeatedly executing well-learned behaviors and selecting objects frequently seen during training regardless of language intent. To systematically study it, we introduce LIBERO-CF, the first counterfactual benchmark for VLAs that evaluates language following capability by assigning alternative instructions under visually plausible LIBERO layouts. Our evaluation reveals that counterfactual failures are prevalent yet underexplored across state-of-the-art VLAs. We propose Counterfactual Action Guidance (CAG), a simple yet effective dual-branch inference scheme that explicitly regularizes language conditioning in VLAs. CAG combines a standard VLA policy with a language-unconditioned Vision-Action (VA) module, enabling counterfactual comparison during action selection. This design reduces reliance on visual shortcuts, improves robustness on under-observed tasks, and requires neither additional demonstrations nor modifications to existing architectures or pretrained models. Extensive experiments demonstrate its plug-and-play integration across diverse VLAs and consistent improvements. For example, on LIBERO-CF, CAG improves $π_{0.5}$ by 9.7% in language following accuracy and 3.6% in task success on under-observed tasks using a training-free strategy, with further gains of 15.5% and 8.5%, respectively, when paired with a VA model. In real-world evaluations, CAG reduces counterfactual failures of 9.4% and improves task success by 17.2% on average.
中文标题/摘要
标题:视觉优先于语言:评估和缓解VLAs中的反事实失败
视觉-语言-行动模型(VLAs)承诺将语言指令应用于机器人控制,但在实践中往往未能忠实执行语言指令。当面对缺乏强烈场景特定监督的指令时,VLAs会遭受反事实失败:它们基于由数据集偏差引起的视觉捷径行动,反复执行在训练中频繁出现的行为,并选择在训练期间频繁出现的对象,而不管语言意图如何。为了系统地研究这一问题,我们引入了LIBERO-CF,这是第一个用于VLAs的反事实基准,通过在视觉上合理的LIBERO布局下分配替代指令来评估语言遵循能力。我们的评估表明,反事实失败在最先进的VLAs中普遍存在但尚未得到充分探索。我们提出了反事实行动指导(CAG),这是一种简单而有效的双分支推理方案,明确地在VLAs中正则化语言条件。CAG结合了一个标准的VLA策略和一个未受语言条件的视觉-行动(VA)模块,在行动选择期间进行反事实比较。这种设计减少了对视觉捷径的依赖,提高了对未观察任务的鲁棒性,并且无需额外演示或修改现有架构或预训练模型。广泛的实验表明,它可以在各种VLAs中实现即插即用集成并带来一致的改进。例如,在LIBERO-CF中,CAG在语言遵循准确性上提高了9.7%,在未观察任务上的任务成功率提高了3.6%,使用无训练策略,配以VA模型时,分别进一步提高了15.5%和8.5%。在实际应用评估中,CAG将反事实失败减少了9.4%,并将任务成功率平均提高了17.2%。
Summary / 总结
The research aims to address the issue of counterfactual failures in Vision-Language-Action models (VLAs), where models act based on visual biases rather than language instructions. The study introduces LIBERO-CF, a benchmark for evaluating language-following capability by providing alternative instructions under visually plausible scenarios. The key finding is that CAG, a dual-branch inference scheme, effectively mitigates these failures by reducing reliance on visual shortcuts, improving robustness on under-observed tasks, and achieving consistent improvements across various VLAs, with significant gains in language following accuracy and task success on under-observed tasks and real-world evaluations.
该研究探讨了Vision-Language-Action模型(VLAs)中的反事实失败问题,即模型依赖于视觉捷径而非语言指令。为此,作者引入了LIBERO-CF基准,通过提供在视觉上合理的替代指令来测试语言跟随能力。他们提出了一种简单的双分支推理方案——反事实动作引导(CAG),该方案增强了语言条件性并减少了视觉捷径的依赖,从而在未观察到的任务上提高了性能。实验表明,CAG显著提高了语言跟随准确性和任务成功率,分别高达15.5%和8.5%的提升,并在真实世界评估中减少了9.4%的反事实失败。
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
Authors: Xiaohan Zhao, Zhaoyi Li, Yaxin Luo, Jiacheng Cui, Zhiqiang Shen
First: 2026-02-19T18:54:32+00:00 · Latest: 2026-02-19T18:54:32+00:00
Comments: Code at: https://github.com/vila-lab/M-Attack-V2
Abstract
Black-box adversarial attacks on Large Vision-Language Models (LVLMs) are challenging due to missing gradients and complex multimodal boundaries. While prior state-of-the-art transfer-based approaches like M-Attack perform well using local crop-level matching between source and target images, we find this induces high-variance, nearly orthogonal gradients across iterations, violating coherent local alignment and destabilizing optimization. We attribute this to (i) ViT translation sensitivity that yields spike-like gradients and (ii) structural asymmetry between source and target crops. We reformulate local matching as an asymmetric expectation over source transformations and target semantics, and build a gradient-denoising upgrade to M-Attack. On the source side, Multi-Crop Alignment (MCA) averages gradients from multiple independently sampled local views per iteration to reduce variance. On the target side, Auxiliary Target Alignment (ATA) replaces aggressive target augmentation with a small auxiliary set from a semantically correlated distribution, producing a smoother, lower-variance target manifold. We further reinterpret momentum as Patch Momentum, replaying historical crop gradients; combined with a refined patch-size ensemble (PE+), this strengthens transferable directions. Together these modules form M-Attack-V2, a simple, modular enhancement over M-Attack that substantially improves transfer-based black-box attacks on frontier LVLMs: boosting success rates on Claude-4.0 from 8% to 30%, Gemini-2.5-Pro from 83% to 97%, and GPT-5 from 98% to 100%, outperforming prior black-box LVLM attacks. Code and data are publicly available at: https://github.com/vila-lab/M-Attack-V2.
中文标题/摘要
标题:通过细粒度细节目标推动黑盒LVLM攻击前沿
大型视觉-语言模型(LVLMs)的黑盒对抗攻击由于缺乏梯度和复杂的多模态边界而具有挑战性。尽管先前的基于转移的方法,如M-Attack,通过源图像和目标图像的局部切片级匹配表现良好,但我们发现这会导致梯度在迭代中高度变化且几乎正交,违反了局部一致对齐并破坏了优化。我们将其归因于(i)ViT翻译敏感性导致尖峰梯度和(ii)源图像和目标图像之间的结构不对称性。我们将局部匹配重新表述为源变换和目标语义的不对称期望,并构建了M-Attack的梯度去噪升级版。在源侧,多切片对齐(MCA)在每次迭代中从多个独立采样的局部视图中平均梯度以减少方差。在目标侧,辅助目标对齐(ATA)用来自语义相关分布的小辅助集替换激进的目标增强,产生更平滑、方差更低的目标流形。我们进一步将动量重新解释为块动量,回放历史切片梯度;结合精细的块大小集合(PE+),这加强了可转移的方向。这些模块一起形成了M-Attack-V2,这是一个简单的模块化增强,显著提高了前沿LVLM的基于转移的黑盒攻击成功率:将Claude-4.0的成功率从8%提升到30%,Gemini-2.5-Pro从83%提升到97%,GPT-5从98%提升到100%,超越了先前的黑盒LVLM攻击。代码和数据可在:https://github.com/vila-lab/M-Attack-V2公开获取。
Summary / 总结
This paper addresses the challenges of black-box adversarial attacks on Large Vision-Language Models (LVLMs) by reformulating local matching and introducing gradient-denoising techniques. The authors find that prior methods induce high-variance gradients, leading to unstable optimization. They propose M-Attack-V2, which includes Multi-Crop Alignment (MCA) and Auxiliary Target Alignment (ATA) to reduce gradient variance and produce a smoother target manifold. The results show significant improvements in attack success rates, with boosts from 8% to 30% on Claude-4.0, from 83% to 97% on Gemini-2.5-Pro, and from 98% to 100% on GPT-5, outperforming previous black-box LVLM attacks.
该论文通过重新定义局部匹配并引入梯度去噪技术来应对大型视觉-语言模型(LVLM)的黑盒对抗攻击挑战。作者发现,先前的方法会产生高方差的梯度,导致优化不稳定。他们提出了M-Attack-V2,其中包括多视图对齐(MCA)和辅助目标对齐(ATA),以减少梯度方差并生成更平滑的目标流形。结果显示,在Claude-4.0上的成功率从8%提升到30%,在Gemini-2.5-Pro上的成功率从83%提升到97%,在GPT-5上的成功率从98%提升到100%,超过了之前的黑盒LVLM攻击方法。
IntRec: Intent-based Retrieval with Contrastive Refinement
Authors: Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, Yue Lu
First: 2026-02-19T18:50:53+00:00 · Latest: 2026-02-19T18:50:53+00:00
Abstract
Retrieving user-specified objects from complex scenes remains a challenging task, especially when queries are ambiguous or involve multiple similar objects. Existing open-vocabulary detectors operate in a one-shot manner, lacking the ability to refine predictions based on user feedback. To address this, we propose IntRec, an interactive object retrieval framework that refines predictions based on user feedback. At its core is an Intent State (IS) that maintains dual memory sets for positive anchors (confirmed cues) and negative constraints (rejected hypotheses). A contrastive alignment function ranks candidate objects by maximizing similarity to positive cues while penalizing rejected ones, enabling fine-grained disambiguation in cluttered scenes. Our interactive framework provides substantial improvements in retrieval accuracy without additional supervision. On LVIS, IntRec achieves 35.4 AP, outperforming OVMR, CoDet, and CAKE by +2.3, +3.7, and +0.5, respectively. On the challenging LVIS-Ambiguous benchmark, it improves performance by +7.9 AP over its one-shot baseline after a single corrective feedback, with less than 30 ms of added latency per interaction.
中文标题/摘要
标题:IntRec:基于意图的对比精炼检索
从复杂场景中检索用户指定的对象仍然是一个具有挑战性的任务,尤其是在查询模糊或涉及多个相似对象的情况下。现有的开放式词汇检测器以单次检测的方式运行,缺乏根据用户反馈精炼预测的能力。为了解决这个问题,我们提出了IntRec,这是一种基于用户反馈进行预测精炼的交互式对象检索框架。其核心是一个意图状态(IS),它维护了正锚点(确认的线索)和负约束(被拒绝的假设)的双重记忆集。对比对齐函数通过最大化与正线索的相似性并惩罚被拒绝的对象来对候选对象进行排名,从而在杂乱的场景中实现细粒度的去模糊化。我们的交互式框架在不增加额外监督的情况下显著提高了检索准确性。在LVIS数据集上,IntRec达到了35.4 AP,分别比OVMR、CoDet和CAKE高出+2.3、+3.7和+0.5。在具有挑战性的LVIS-模糊基准测试中,它在单次纠正反馈后提高了7.9 AP的性能,每次交互的额外延迟不到30毫秒。
Summary / 总结
IntRec is an interactive object retrieval framework that refines predictions based on user feedback, addressing the challenge of ambiguous queries in complex scenes. It uses an Intent State maintaining positive anchors and negative constraints, and a contrastive alignment function to rank candidates. On LVIS, IntRec outperforms existing methods by +2.3 to +3.7 AP, and shows significant improvement (+7.9 AP) on the LVIS-Ambiguous benchmark with minimal latency.
IntRec 是一个基于用户反馈的交互式物体检索框架,通过维护正锚点和负约束的意图状态来进行对比对齐,按与正锚点的相似度排名候选物体并惩罚被拒绝的假设。在 LVIS 数据集上,IntRec 的性能优于现有方法 2.3 到 3.7 个 AP,且在 LVIS-Ambiguous 挑战基准上,单次纠正反馈后性能提升 7.9 个 AP,且每次交互的额外延迟不到 30 毫秒。
Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
Authors: Obaidullah Zaland, Zulfiqar Ahmad Khan, Monowar Bhuyan
First: 2026-02-19T18:44:23+00:00 · Latest: 2026-02-19T18:44:23+00:00
Comments: Accepted for publication in the IEEE International Conference on Big Data (IEEE BigData) 2025
Abstract
Modern big-data systems generate massive, heterogeneous, and geographically dispersed streams that are large-scale and privacy-sensitive, making centralization challenging. While federated learning (FL) provides a privacy-enhancing training mechanism, it assumes a static data flow and learns a collaborative model over multiple rounds, making learning with \textit{incremental} data challenging in limited-communication scenarios. This paper presents One-Shot Incremental Federated Learning (OSI-FL), the first FL framework that addresses the dual challenges of communication overhead and catastrophic forgetting. OSI-FL communicates category-specific embeddings, devised by a frozen vision-language model (VLM) from each client in a single communication round, which a pre-trained diffusion model at the server uses to synthesize new data similar to the client's data distribution. The synthesized samples are used on the server for training. However, two challenges still persist: i) tasks arriving incrementally need to retrain the global model, and ii) as future tasks arrive, retraining the model introduces catastrophic forgetting. To this end, we augment training with Selective Sample Retention (SSR), which identifies and retains the top-p most informative samples per category and task pair based on sample loss. SSR bounds forgetting by ensuring that representative retained samples are incorporated into training in further iterations. The experimental results indicate that OSI-FL outperforms baselines, including traditional and one-shot FL approaches, in both class-incremental and domain-incremental scenarios across three benchmark datasets.
中文标题/摘要
标题:具有灾难性遗忘鲁棒性的单次增量联邦学习
现代大数据系统生成大量异构且地理上分散的流数据,规模庞大且隐私敏感,使得集中化变得困难。虽然联邦学习(FL)提供了一种增强隐私的训练机制,但它假设静态数据流,并在多轮次中学习协作模型,这使得在通信受限场景中处理增量数据的学习变得具有挑战性。本文提出了单次增量联邦学习(OSI-FL),这是第一个解决通信开销和灾难性遗忘双重挑战的FL框架。OSI-FL通过在单次通信轮次中由每个客户端的冻结视觉-语言模型(VLM)生成类别特定的嵌入,然后由服务器端的预训练扩散模型合成与客户端数据分布相似的新数据样本,这些合成样本在服务器端用于训练。然而,仍存在两个挑战:i) 任务以增量方式到达需要重新训练全局模型,ii) 随着未来任务的到达,重新训练模型会导致灾难性遗忘。为此,我们通过选择性样本保留(SSR)增强训练,该方法基于样本损失识别并保留每个类别和任务对中最信息丰富的top-p个样本。SSR通过确保代表性保留样本在后续迭代中被纳入训练来限制遗忘。实验结果表明,OSI-FL在三个基准数据集上的类增量和领域增量场景中均优于基线方法,包括传统的和单次的FL方法。
Summary / 总结
This paper addresses the challenges of communication overhead and catastrophic forgetting in federated learning with incremental data. It introduces One-Shot Incremental Federated Learning (OSI-FL), which communicates category-specific embeddings from clients and uses a pre-trained diffusion model to synthesize new data. To mitigate catastrophic forgetting, the method incorporates Selective Sample Retention (SSR) to retain top-p most informative samples. Experimental results show that OSI-FL outperforms traditional and one-shot FL approaches in both class-incremental and domain-incremental scenarios across three benchmark datasets.
本文解决了增量数据下联邦学习中的通信开销和灾难性遗忘问题,提出了One-Shot Incremental Federated Learning (OSI-FL)框架,该框架在单轮通信中从每个客户端发送类别特定的嵌入,并使用预训练的扩散模型生成新数据。为了缓解灾难性遗忘,作者提出了Selective Sample Retention (SSR),该方法根据样本损失保留每个类别和任务对中最信息性的样本。实验结果表明,OSI-FL在三个基准数据集上的类增量和域增量场景中均优于传统和单次联邦学习方法。
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Authors: Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Venue: NeurIPS 2025
First: 2025-05-05T17:47:42+00:00 · Latest: 2026-02-19T18:32:53+00:00
Comments: This work was accepted and presented at NeurIPS 2025. Code is available at https://github.com/mts-ai/replaceme Reviews at OpenReview: https://openreview.net/forum?id=zEj1FSYCRn NeurIPS 2025 Proceedings: https://openreview.net/pdf?id=zEj1FSYCRn
Abstract
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead. We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe
中文标题/摘要
标题:ReplaceMe:通过深度剪枝和Transformer块线性化实现网络简化
我们引入了ReplaceMe,这是一种通用的无需训练的深度剪枝方法,能够有效将Transformer块替换为线性操作,同时在低压缩比下保持高性能。与需要额外训练或微调的传统剪枝方法不同,我们的方法仅需要一个小规模的校准数据集来估计线性变换,该变换近似于剪枝后的块。估计出的线性映射可以无缝地与剩余的Transformer块合并,无需任何额外的网络参数。我们的实验表明,ReplaceMe在所有无需训练的方法中表现最佳,并且在涉及大量重新训练/微调和架构修改的最新剪枝方法中保持了高度竞争力。应用于多个大型语言模型(LLMs),ReplaceMe在开放基准测试中实现了高达25%的剪枝,同时保留了原始模型约90%的性能,无需任何训练或修复步骤,从而减少了计算开销。我们提供了一个开源库,实现了ReplaceMe以及几种最先进的深度剪枝技术,可在https://github.com/mts-ai/ReplaceMe 获取。
Summary / 总结
ReplaceMe is a training-free depth pruning method that replaces transformer blocks with linear operations, maintaining high performance with low compression ratios. Unlike conventional pruning methods that require additional training, ReplaceMe uses a small calibration dataset to estimate a linear transformation that approximates the pruned blocks. Experiments show that ReplaceMe outperforms other training-free approaches and remains competitive with state-of-the-art pruning methods, achieving up to 25% pruning with minimal computational overhead on large language models without any training or healing steps.
ReplaceMe 是一种无需训练的深度剪枝方法,通过将变压器块替换为线性操作来保持高性能和低压缩比。与需要额外训练的常规剪枝方法不同,ReplaceMe 使用一个小的校准数据集来估计一个线性变换,该变换近似于剪枝后的块。实验表明,ReplaceMe 在无需任何训练或修复步骤的情况下,能够实现高达 25% 的剪枝,同时保持与最先进的剪枝方法相当的性能,并且具有最小的计算开销。
Boosting Medical Visual Understanding From Multi-Granular Language Learning
Authors: Zihan Li, Yiqing Wang, Sina Farsiu, Paul Kinahan
Venue: ICLR 2026
First: 2025-11-20T00:24:26+00:00 · Latest: 2026-02-19T18:27:29+00:00
Comments: Accepted by ICLR 2026. 40 pages
Abstract
Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple high-level labels (e.g., disease categories) across different annotation granularities (e.g., diagnostic description, clinical explanation). To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. MGLL leverages structured multi-label supervision, integrates textual descriptions across granularities, and introduces soft-label supervision with point-wise constraints to enhance alignment. MGLL employs smooth Kullback-Leibler (KL) divergence to ensure cross-granularity consistency while maintaining computational efficiency as a plug-and-play module for vision-language models. Pretrained on our constructed large-scale multi-granular datasets and evaluated across multiple datasets, MGLL outperforms other state-of-the-art methods in downstream tasks. The code is available at https://github.com/HUANGLIZI/MGLL.
中文标题/摘要
标题:从多粒度语言学习增强医学视觉理解
近期在图像-文本预训练方面的进展显著提升了视觉理解能力,通过视觉和文本表示的对齐。对比语言-图像预训练(CLIP)在多模态学习中发挥了关键作用。然而,其对单标签、单粒度对齐的侧重限制了其在医学成像等复杂领域的有效性,医学图像往往对应多个高层标签(例如,疾病类别),且不同注释粒度(例如,诊断描述、临床解释)不同。为解决这一问题,我们提出了多粒度语言学习(MGLL),这是一种对比学习框架,旨在提高多标签和跨粒度对齐。MGLL 利用结构化的多标签监督,整合不同粒度的文本描述,并引入软标签监督和点对点约束以增强对齐。MGLL 使用平滑的Kullback-Leibler(KL)散度确保跨粒度一致性,同时保持计算效率作为视觉-语言模型的即插即用模块。在我们构建的大规模多粒度数据集上预训练,并在多个数据集上进行评估,MGLL 在下游任务中优于其他最先进的方法。代码可在 https://github.com/HUANGLIZI/MGLL/ 获取。
Summary / 总结
The research aims to improve medical visual understanding by addressing the limitations of single-granularity alignment in existing methods like CLIP. The proposed Multi-Granular Language Learning (MGLL) framework enhances multi-label and cross-granularity alignment through structured multi-label supervision, integrated textual descriptions, and soft-label supervision with point-wise constraints. MGLL uses smooth Kullback-Leibler divergence to ensure consistency across granularities while maintaining computational efficiency. Experiments show that MGLL outperforms other state-of-the-art methods in downstream tasks.
研究旨在通过解决现有方法如CLIP在单粒度对齐方面的局限性,提升医学视觉理解。提出的多粒度语言学习(MGLL)框架通过结构化的多标签监督、集成的文本描述和带有点对点约束的软标签监督来改善多标签和跨粒度对齐。MGLL 使用平滑的 Kullback-Leibler 散度来保持跨粒度一致性,并且计算效率高。实验表明,MGLL 在下游任务中优于其他最先进的方法。
AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
Authors: Lance Ying, Ryan Truong, Prafull Sharma, Kaiya Ivy Zhao, Nathan Cloos, Kelsey R. Allen, Thomas L. Griffiths, Katherine M. Collins, José Hernández-Orallo, Phillip Isola, Samuel J. Gershman, Joshua B. Tenenbaum
First: 2026-02-19T18:17:25+00:00 · Latest: 2026-02-19T18:17:25+00:00
Comments: 29 pages, 14 figures
Abstract
Rigorously evaluating machine intelligence against the broad spectrum of human general intelligence has become increasingly important and challenging in this era of rapid technological advance. Conventional AI benchmarks typically assess only narrow capabilities in a limited range of human activity. Most are also static, quickly saturating as developers explicitly or implicitly optimize for them. We propose that a more promising way to evaluate human-like general intelligence in AI systems is through a particularly strong form of general game playing: studying how and how well they play and learn to play \textbf{all conceivable human games}, in comparison to human players with the same level of experience, time, or other resources. We define a "human game" to be a game designed by humans for humans, and argue for the evaluative suitability of this space of all such games people can imagine and enjoy -- the "Multiverse of Human Games". Taking a first step towards this vision, we introduce the AI GameStore, a scalable and open-ended platform that uses LLMs with humans-in-the-loop to synthesize new representative human games, by automatically sourcing and adapting standardized and containerized variants of game environments from popular human digital gaming platforms. As a proof of concept, we generated 100 such games based on the top charts of Apple App Store and Steam, and evaluated seven frontier vision-language models (VLMs) on short episodes of play. The best models achieved less than 10\% of the human average score on the majority of the games, and especially struggled with games that challenge world-model learning, memory and planning. We conclude with a set of next steps for building out the AI GameStore as a practical way to measure and drive progress toward human-like general intelligence in machines.
中文标题/摘要
标题:AI游戏库:通过人类游戏评估机器通用智能的可扩展、开放性方法
在技术飞速发展的时代,严格评估机器智能与人类通用智能的广泛谱系相比变得越来越重要且具有挑战性。传统的AI基准测试通常仅评估人类活动有限范围内的狭窄能力。大多数基准测试也是静态的,随着开发人员显式或隐式地对其进行优化,它们很快就会饱和。我们提出了一种更可行的方法来评估AI系统中的人类般通用智能:通过一种特别强大的通用游戏玩法形式:研究它们如何以及如何很好地玩和学习玩所有可能的人类游戏,与具有相同经验水平、时间或其他资源的人类玩家进行比较。我们定义“人类游戏”为人类设计供人类玩的游戏,并认为这个所有此类游戏的空间——“人类游戏多元宇宙”——是评估的合适空间。为了实现这一愿景的第一步,我们引入了AI游戏库,这是一个使用人类在环的LLM构建的可扩展和开放性平台,通过自动获取和适应来自流行的人类数字游戏平台的标准和容器化游戏环境变体来合成新的代表性人类游戏。作为概念验证,我们基于Apple App Store和Steam的热门排行榜生成了100个此类游戏,并在短游戏片段上评估了七个前沿的视觉-语言模型(VLMs)。最好的模型在大多数游戏中的人类平均得分中仅达到了不到10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为困难。最后,我们提出了构建AI游戏库的下一步,作为一种实际的方法来衡量和推动机器向人类般通用智能的进步。
Summary / 总结
The research aims to evaluate machine intelligence more comprehensively by comparing it to human general intelligence through a wide range of games designed for humans. The method involves creating a scalable platform, AI GameStore, using LLMs and human input to generate new games from popular digital platforms. Key findings show that advanced models perform poorly, achieving less than 10% of human scores, especially in games that test world-model learning, memory, and planning.
论文提出通过一个可扩展和开放的平台AI GameStore来评估机器的一般智能,该平台使用LLM和人类辅助生成新的人类游戏。研究对来自热门数字游戏平台的100个游戏进行了七种VLM的评估,发现最佳模型在大多数游戏中的得分不到人类平均分的10%,尤其是在挑战世界模型学习、记忆和规划的游戏方面表现尤为糟糕。
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Authors: Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis
First: 2025-09-26T08:55:09+00:00 · Latest: 2026-02-19T17:30:28+00:00
Abstract
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace. This strategy is computationally efficient but the underlying constraint can be overly rigid for heterogeneous projection weights and may incur avoidable accuracy loss. We propose CoSpaDi (Compression via Sparse Dictionary Learning), a training-free framework that replaces low-rank factorization with a structured sparse decomposition in which each weight matrix is represented as a dense dictionary multiplied by a column-sparse coefficient matrix. This yields a union-of-subspaces model: the columns of the weight matrix are represented as linear combinations of different subsets of dictionary atoms, improving expressiveness at a fixed parameter budget. CoSpaDi is calibration-guided: using a small calibration set, we optimize the factorization to minimize functional reconstruction error of layer outputs rather than weight-space error. An activation-derived Gram orthonormalization reformulates this data-aware objective into a standard dictionary learning problem on transformed weights, and we support both per-layer compression and cross-layer dictionary sharing within groups of similar projections. Across Llama and Qwen model families, CoSpaDi consistently improves the accuracy--compression and perplexity--compression trade-offs over state-of-the-art SVD-based baselines and strong structured pruning baselines at 20-40\% compression ratios. The resulting structured sparsity enables sparse--dense computation and integrates with post-training quantization of the sparse coefficients.
中文标题/摘要
标题:CoSpaDi: 通过校准引导的稀疏字典学习压缩大型语言模型
大型语言模型(LLMs)的后训练压缩通常依赖于低秩权重近似,将权重矩阵的每一列表示为共享低维子空间中的表示。这种策略计算效率高,但其背后的约束对于异构投影权重来说可能过于僵硬,可能会导致不必要的准确度损失。我们提出了CoSpaDi(通过稀疏字典学习压缩),这是一种无需训练的框架,用结构化稀疏分解替代低秩分解,其中每个权重矩阵表示为一个稠密字典乘以列稀疏系数矩阵。这产生了一种子空间模型:权重矩阵的列表示为不同字典原子子集的线性组合,从而在固定参数预算下提高表达能力。CoSpaDi 是校准引导的:使用一个小的校准集,我们优化分解以最小化层输出的功能重建误差,而不是权重空间误差。激活衍生的格拉姆正交化将这种数据感知目标重新表述为转换后的权重上的标准字典学习问题,并支持层内压缩和组内相似投影之间的跨层字典共享。在Llama和Qwen模型家族中,CoSpaDi 在20-40% 压缩比下始终优于基于SVD的先进基线和强大的结构化剪枝基线,提高了准确度-压缩和困惑度-压缩的权衡。这种结构化稀疏性使稀疏-密集计算成为可能,并与稀疏系数的后训练量化集成。
Summary / 总结
CoSpaDi is a training-free compression framework for large language models that uses a structured sparse decomposition to represent weight matrices, improving expressiveness and accuracy at a fixed parameter budget. It optimizes the factorization using a small calibration set to minimize functional reconstruction error of layer outputs, and supports per-layer and cross-layer dictionary sharing. Experiments show that CoSpaDi outperforms state-of-the-art SVD-based and structured pruning baselines across Llama and Qwen model families at 20-40% compression ratios, while enabling efficient sparse-dense computation and post-training quantization of the sparse coefficients.
CoSpaDi 是一种无需训练的大型语言模型压缩框架,使用结构化稀疏分解来替代低秩因子分解。它使用校准集优化因子分解,以最小化层输出的功能重构误差,从而在 20-40% 的压缩比下比 SVD 基准和结构化剪枝基准获得更好的准确性和压缩效果。
LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
Authors: Behzad Bozorgtabar, Dwarikanath Mahapatra, Sudipta Roy, Muzammal Naseer, Imran Razzak, Zongyuan Ge
First: 2026-02-19T16:45:38+00:00 · Latest: 2026-02-19T16:45:38+00:00
Comments: 18 pages, 6 figures, 4 tables
Abstract
Medical vision-language models (VLMs) are strong zero-shot recognizers for medical imaging, but their reliability under domain shift hinges on calibrated uncertainty with guarantees. Split conformal prediction (SCP) offers finite-sample coverage, yet prediction sets often become large (low efficiency) and class-wise coverage unbalanced-high class-conditioned coverage gap (CCV), especially in few-shot, imbalanced regimes; moreover, naively adapting to calibration labels breaks exchangeability and voids guarantees. We propose \texttt{\textbf{LATA}} (Laplacian-Assisted Transductive Adaptation), a \textit{training- and label-free} refinement that operates on the joint calibration and test pool by smoothing zero-shot probabilities over an image-image k-NN graph using a small number of CCCP mean-field updates, preserving SCP validity via a deterministic transform. We further introduce a \textit{failure-aware} conformal score that plugs into the vision-language uncertainty (ViLU) framework, providing instance-level difficulty and label plausibility to improve prediction set efficiency and class-wise balance at fixed coverage. \texttt{\textbf{LATA}} is black-box (no VLM updates), compute-light (windowed transduction, no backprop), and includes an optional prior knob that can run strictly label-free or, if desired, in a label-informed variant using calibration marginals once. Across \textbf{three} medical VLMs and \textbf{nine} downstream tasks, \texttt{\textbf{LATA}} consistently reduces set size and CCV while matching or tightening target coverage, outperforming prior transductive baselines and narrowing the gap to label-using methods, while using far less compute. Comprehensive ablations and qualitative analyses show that \texttt{\textbf{LATA}} sharpens zero-shot predictions without compromising exchangeability.
中文标题/摘要
标题:LATA:拉普拉斯辅助的归纳适应性转换以提高医学VLM中的校准不确定性
医学视觉-语言模型(VLMs)在医学成像中具有强大的零样本识别能力,但它们在领域转移下的可靠性依赖于有保证的校准不确定性。分割一致预测(SCP)提供了有限样本覆盖率,但预测集往往变得很大(低效率),并且类别间的覆盖率不平衡(高类别条件覆盖率差距,CCV),特别是在少量样本、类别不平衡的情况下;此外,直接适应校准标签会破坏可交换性并使保证失效。我们提出了LATA(拉普拉斯辅助的归纳适应性转换),这是一种无需训练和标签的改进方法,通过在图像-图像k-NN图上平滑零样本概率,使用少量CCC更新,对校准和测试池进行操作,并通过确定性变换保持SCP的有效性。我们还引入了一种失败感知的校准分数,将其插入视觉-语言不确定性(ViLU)框架中,提供实例级的难度和标签可验证性,以提高固定覆盖率下的预测集效率和类别间的平衡。LATA是黑盒的(不更新VLM),计算量轻(窗口化归纳转换,无反向传播),并包括一个可选的先验旋钮,可以完全不使用标签运行,或者如果需要,可以使用校准边缘信息在标签指导下运行。在三个医学VLM和九个下游任务上,LATA始终减少集合大小和CCV,同时匹配或收紧目标覆盖率,优于先前的归纳基线,并缩小与使用标签方法的差距,同时使用更少的计算资源。全面的消融分析和定性分析表明,LATA在不损害可交换性的情况下增强了零样本预测。
Summary / 总结
LATA (Laplacian-Assisted Transductive Adaptation) is a training- and label-free method that refines the joint calibration and test pool of medical vision-language models by smoothing zero-shot probabilities using a k-NN graph and deterministic transform, while preserving the validity of split conformal prediction. It introduces a failure-aware conformal score to improve prediction set efficiency and class-wise balance. LATA consistently reduces set size and class-conditioned coverage gap while matching or tightening target coverage across three medical VLMs and nine downstream tasks, outperforming prior transductive baselines with less compute.
LATA(Laplacian-Assisted Transductive Adaptation)是一种无需训练和标签的方法,通过使用k-NN图和少量的CCCP均场更新来平滑零样本概率,对医疗视觉-语言模型的联合校准和测试池进行细化。该方法提高了预测集的效率和类别间的平衡,同时保持了有限样本覆盖率。LATA在三个医疗VLM和九个下游任务中表现出一致的改进,显著降低了计算需求,优于之前的归纳基线方法。
Visual Model Checking: Graph-Based Inference of Visual Routines for Image Retrieval
Authors: Adrià Molina, Oriol Ramos Terrades, Josep Lladós
First: 2026-02-19T14:10:55+00:00 · Latest: 2026-02-19T14:10:55+00:00
Comments: Submitted for ICPR Review
Abstract
Information retrieval lies at the foundation of the modern digital industry. While natural language search has seen dramatic progress in recent years largely driven by embedding-based models and large-scale pretraining, the field still faces significant challenges. Specifically, queries that involve complex relationships, object compositions, or precise constraints such as identities, counts and proportions often remain unresolved or unreliable within current frameworks. In this paper, we propose a novel framework that integrates formal verification into deep learning-based image retrieval through a synergistic combination of graph-based verification methods and neural code generation. Our approach aims to support open-vocabulary natural language queries while producing results that are both trustworthy and verifiable. By grounding retrieval results in a system of formal reasoning, we move beyond the ambiguity and approximation that often characterize vector representations. Instead of accepting uncertainty as a given, our framework explicitly verifies each atomic truth in the user query against the retrieved content. This allows us to not only return matching results, but also to identify and mark which specific constraints are satisfied and which remain unmet, thereby offering a more transparent and accountable retrieval process while boosting the results of the most popular embedding-based approaches.
中文标题/摘要
标题:视觉模型检查:基于图的视觉模式图推理及其在图像检索中的应用
信息检索是现代数字产业的基础。尽管近年来自然语言搜索在嵌入式模型和大规模预训练的推动下取得了显著进展,但该领域仍面临重大挑战。具体而言,涉及复杂关系、对象组合或精确约束(如身份、数量和比例)的查询往往在当前框架中仍未得到解决或可靠性不足。在本文中,我们提出了一种新的框架,通过将基于图的验证方法与神经代码生成相结合,将形式验证整合到基于深度学习的图像检索中。我们的方法旨在支持开放词汇的自然语言查询,同时产生既可信又可验证的结果。通过将检索结果置于形式推理的体系中,我们超越了向量表示中常见的模糊性和近似性。我们的框架不仅接受不确定性,还明确验证用户查询中的每个原子事实与检索内容是否一致。这不仅使我们能够返回匹配结果,还能识别并标记哪些特定约束得到满足,哪些未得到满足,从而提供一个更透明和负责任的检索过程,同时提升基于嵌入式模型的最流行方法的效果。
Summary / 总结
This paper addresses the limitations of current image retrieval systems, particularly in handling complex queries involving object compositions and precise constraints. It introduces a framework that combines graph-based verification methods with neural code generation to enhance the trustworthiness and verifiability of retrieval results. The approach supports open-vocabulary natural language queries and explicitly verifies each atomic truth in the user query against the retrieved content, providing a more transparent and accountable retrieval process.
本文针对当前图像检索系统在处理涉及复杂对象组合和精确约束的查询方面的局限性,提出了一种结合图基验证方法和神经代码生成的框架,以增强检索结果的可信度和可验证性。该方法支持开放词汇的自然语言查询,并明确验证用户查询中的每个原子事实与检索内容之间的关系,从而提供一个更透明和负责任的检索过程。
Selective Training for Large Vision Language Models via Visual Information Gain
Authors: Seulbi Lee, Sangheum Hwang
First: 2026-02-19T09:12:21+00:00 · Latest: 2026-02-19T09:12:21+00:00
Abstract
Large Vision Language Models (LVLMs) have achieved remarkable progress, yet they often suffer from language bias, producing answers without relying on visual evidence. While prior work attempts to mitigate this issue through decoding strategies, architectural modifications, or curated instruction data, they typically lack a quantitative measure of how much individual training samples or tokens actually benefit from the image. In this work, we introduce Visual Information Gain (VIG), a perplexity-based metric that measures the reduction in prediction uncertainty provided by visual input. VIG enables fine-grained analysis at both sample and token levels, effectively highlighting visually grounded elements such as colors, spatial relations, and attributes. Leveraging this, we propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens. This approach improves visual grounding and mitigates language bias, achieving superior performance with significantly reduced supervision by focusing exclusively on visually informative samples and tokens.
中文标题/摘要
标题:大型视觉语言模型的选择性训练通过视觉信息增益
大型视觉语言模型(LVLMs)取得了显著进展,但它们往往存在语言偏见,生成答案时不依赖视觉证据。尽管先前的工作通过解码策略、架构修改或精心策划的指令数据试图缓解这一问题,但它们通常缺乏一个定量衡量单个训练样本或令牌从图像中实际受益多少的指标。在本文中,我们引入了视觉信息增益(VIG),这是一种基于困惑度的度量,用于衡量视觉输入提供的预测不确定性减少量。VIG 允许在样本和令牌级别进行精细分析,有效突出显示视觉基础的元素,如颜色、空间关系和属性。利用这一点,我们提出了一种基于 VIG 的选择性训练方案,优先选择高 VIG 样本和令牌。这种方法通过专注于仅视觉信息丰富的样本和令牌,提高了视觉基础并减轻了语言偏见,实现了显著减少监督下的优越性能。
Summary / 总结
This study addresses the issue of language bias in large vision language models by introducing Visual Information Gain (VIG), a metric that quantifies the reduction in prediction uncertainty from visual input. The authors propose a VIG-guided selective training scheme that prioritizes high-VIG samples and tokens, leading to improved visual grounding and reduced language bias with less supervision.
该研究通过引入视觉信息增益(VIG)来量化视觉输入对预测不确定性减少的程度,解决大型视觉语言模型(LVLM)的语言偏见问题。作者提出了一种基于VIG的选择性训练方案,优先处理高VIG的样本和标记,从而提高视觉定位并减少语言偏见。该方法通过专注于视觉信息丰富的元素,实现了更好的性能并减少了监督需求。
Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance
Authors: Haipeng Li, Rongxuan Peng, Anwei Luo, Shunquan Tan, Changsheng Chen, Anastasia Antsiferova
First: 2026-02-06T09:32:10+00:00 · Latest: 2026-02-19T08:33:25+00:00
Comments: 17 pages, 11 figures
Abstract
The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.
中文标题/摘要
标题:针对图像伪造检测的多模态指导通用反取证攻击
AI生成内容(AIGC)技术的迅速发展对真实性评估提出了重大挑战。然而,现有的评估协议很大程度上忽视了反取证攻击,未能确保最先进的AIGC检测器在实际应用中的全面鲁棒性。为弥补这一差距,我们提出了一种名为ForgeryEraser的框架,该框架能够在不访问目标AIGC检测器的情况下执行通用反取证攻击。我们揭示了一种源自系统性依赖视觉语言模型(VLMs)作为共享骨干(例如CLIP)的对抗性漏洞,下游AIGC检测器继承了这些公开模型的特征空间。我们设计了一种多模态指导损失,而不是传统的logit优化,以驱动伪造图像嵌入向文本衍生的真实锚点靠拢,从而消除伪造痕迹,同时将它们排斥在伪造锚点之外。广泛的实验表明,ForgeryEraser在全局合成和局部编辑基准上对高级AIGC检测器造成了显著的性能下降。此外,ForgeryEraser促使可解释的取证模型为伪造图像生成与真实图像一致的解释。我们的代码将公开发布。
Summary / 总结
The paper addresses the challenge of ensuring the robustness of AI-generated content (AIGC) detectors against anti-forensics attacks. It introduces ForgeryEraser, a framework that performs universal anti-forensics attacks by leveraging a multi-modal guidance loss to manipulate image embeddings within the feature space of Vision-Language Models (VLMs). The experiments show that ForgeryEraser significantly degrades the performance of advanced AIGC detectors and causes explainable forensic models to generate misleading explanations for forged images.
论文旨在确保AI生成内容(AIGC)检测器在面对反取证攻击时的鲁棒性。它提出了ForgeryEraser框架,无需访问目标检测器即可执行通用反取证攻击。通过利用多模态指导损失,ForgeryEraser在视觉语言模型的特征空间内操纵伪造图像嵌入,使其模仿真实图像,导致高级AIGC检测器性能显著下降。实验表明,ForgeryEraser能够有效抹除伪造痕迹,并使取证模型为伪造图像生成与真实图像一致的解释。
B$^3$-Seg: Camera-Free, Training-Free 3DGS Segmentation via Analytic EIG and Beta-Bernoulli Bayesian Updates
Authors: Hiromichi Kamata, Samuel Arthur Munro, Fuminori Homma
First: 2026-02-19T07:14:52+00:00 · Latest: 2026-02-19T07:14:52+00:00
Comments: Project page: https://sony.github.io/B3-Seg-project/
Abstract
Interactive 3D Gaussian Splatting (3DGS) segmentation is essential for real-time editing of pre-reconstructed assets in film and game production. However, existing methods rely on predefined camera viewpoints, ground-truth labels, or costly retraining, making them impractical for low-latency use. We propose B$^3$-Seg (Beta-Bernoulli Bayesian Segmentation for 3DGS), a fast and theoretically grounded method for open-vocabulary 3DGS segmentation under camera-free and training-free conditions. Our approach reformulates segmentation as sequential Beta-Bernoulli Bayesian updates and actively selects the next view via analytic Expected Information Gain (EIG). This Bayesian formulation guarantees the adaptive monotonicity and submodularity of EIG, which produces a greedy $(1{-}1/e)$ approximation to the optimal view sampling policy. Experiments on multiple datasets show that B$^3$-Seg achieves competitive results to high-cost supervised methods while operating end-to-end segmentation within a few seconds. The results demonstrate that B$^3$-Seg enables practical, interactive 3DGS segmentation with provable information efficiency.
中文标题/摘要
标题:B$^3$-Seg:无需相机、无需训练的3DGS分割方法,基于贝塔-伯努利贝叶斯更新和分析期望信息增益
交互式3D高斯散点图(3DGS)分割对于电影和游戏生产中的实时编辑预重建资产至关重要。然而,现有方法依赖于预定义的相机视角、真实标签或昂贵的重新训练,使其在低延迟使用中不切实际。我们提出了B$^3$-Seg(基于贝塔-伯努利贝叶斯分割的3DGS),这是一种在无需相机和无需训练条件下进行开放词汇3DGS分割的快速且理论基础的方法。我们的方法将分割重新表述为贝塔-伯努利贝叶斯更新的序列,并通过分析期望信息增益(EIG)主动选择下一个视角。这种贝叶斯表述保证了EIG的自适应单调性和次模性,从而产生贪婪的最优视图采样策略的$(1{-}1/e)$近似。在多个数据集上的实验表明,B$^3$-Seg在几秒钟内进行端到端分割时,其结果与高成本监督方法相当。结果表明,B$^3$-Seg能够实现具有可证明信息效率的实用、交互式3DGS分割。
Summary / 总结
B$^3$-Seg is a method for camera-free and training-free 3DGS segmentation that uses Beta-Bernoulli Bayesian updates and analytic Expected Information Gain to select views. It achieves competitive results comparable to high-cost supervised methods and operates within a few seconds, enabling practical, interactive 3DGS segmentation with provable information efficiency.
B$^3$-Seg 是一种用于相机无感知和无需训练的 3D 贝塞尔散点图 (3DGS) 分割方法,对于电影和游戏制作中的实时编辑至关重要。它使用 Beta-Bernoulli 贝叶斯更新和分析期望信息增益来选择视图,确保高效且自适应的视图采样。实验表明,B$^3$-Seg 在几秒内运行时能达到与高成本监督方法相当的结果,从而实现实用且交互式的 3DGS 分割。
VFace: A Training-Free Approach for Diffusion-Based Video Face Swapping
Authors: Sanoojan Baliah, Yohan Abeysinghe, Rusiru Thushara, Khan Muhammad, Abhinav Dhall, Karthik Nandakumar, Muhammad Haris Khan
Venue: WACV 2026
First: 2026-02-08T06:13:19+00:00 · Latest: 2026-02-19T03:41:54+00:00
Comments: Accepted at WACV 2026
Abstract
We present a training-free, plug-and-play method, namely VFace, for high-quality face swapping in videos. It can be seamlessly integrated with image-based face swapping approaches built on diffusion models. First, we introduce a Frequency Spectrum Attention Interpolation technique to facilitate generation and intact key identity characteristics. Second, we achieve Target Structure Guidance via plug-and-play attention injection to better align the structural features from the target frame to the generation. Third, we present a Flow-Guided Attention Temporal Smoothening mechanism that enforces spatiotemporal coherence without modifying the underlying diffusion model to reduce temporal inconsistencies typically encountered in frame-wise generation. Our method requires no additional training or video-specific fine-tuning. Extensive experiments show that our method significantly enhances temporal consistency and visual fidelity, offering a practical and modular solution for video-based face swapping. Our code is available at https://github.com/Sanoojan/VFace.
中文标题/摘要
标题:VFace:基于扩散模型的无训练面部换脸方法
我们提出了一种无训练、即插即用的方法VFace,用于高质量的视频面部换脸。它可以无缝集成到基于扩散模型的图像面部换脸方法中。首先,我们引入了一种频谱注意力插值技术,以促进生成并保持关键身份特征的完整性。其次,我们通过即插即用的注意力注入实现目标结构引导,以更好地对齐目标帧的结构特征与生成。第三,我们提出了一种流引导注意力时空平滑机制,该机制在不修改基础扩散模型的情况下,通过增强时空一致性来减少帧间生成中常见的时间不一致性。我们的方法不需要额外的训练或视频特定的微调。大量实验表明,我们的方法显著提高了时间一致性和视觉保真度,提供了一种实用且模块化的视频面部换脸解决方案。我们的代码可在https://github.com/Sanoojan/VFace获取。
Summary / 总结
VFace is a training-free method for high-quality video face swapping, integrating seamlessly with diffusion-based image face swapping approaches. It uses Frequency Spectrum Attention Interpolation, Target Structure Guidance, and Flow-Guided Attention Temporal Smoothening to enhance identity preservation and temporal consistency. Experiments demonstrate significant improvements in temporal consistency and visual fidelity without requiring additional training or fine-tuning.
VFace 是一种无需训练的方法,用于高质量的视频面部替换,可以无缝集成到基于扩散模型的图像面部替换方法中。它使用频谱注意力插值来保持关键的身份特征,使用目标结构引导来更好地对齐目标帧的结构特征,并使用流引导注意力时空平滑机制来增强时空一致性,而无需修改基础的扩散模型。实验表明,该方法在时空一致性和视觉保真度方面有显著提升,无需额外的训练或微调。
StructCore: Structure-Aware Image-Level Scoring for Training-Free Unsupervised Anomaly Detection
Authors: Joongwon Chae, Lihui Luo, Yang Liu, Runming Wang, Dongmei Yu, Zeming Liang, Xi Yuan, Dayan Zhang, Zhenglin Chen, Peiwu Qin, Ilmoon Chae
First: 2026-02-19T03:35:24+00:00 · Latest: 2026-02-19T03:35:24+00:00
Abstract
Max pooling is the de facto standard for converting anomaly score maps into image-level decisions in memory-bank-based unsupervised anomaly detection (UAD). However, because it relies on a single extreme response, it discards most information about how anomaly evidence is distributed and structured across the image, often causing normal and anomalous scores to overlap.
We propose StructCore, a training-free, structure-aware image-level scoring method that goes beyond max pooling. Given an anomaly score map, StructCore computes a low-dimensional structural descriptor phi(S) that captures distributional and spatial characteristics, and refines image-level scoring via a diagonal Mahalanobis calibration estimated from train-good samples, without modifying pixel-level localization.
StructCore achieves image-level AUROC scores of 99.6% on MVTec AD and 98.4% on VisA, demonstrating robust image-level anomaly detection by exploiting structural signatures missed by max pooling.
中文标题/摘要
标题:StructCore:基于结构的图像级评分方法以实现无需训练的无监督异常检测
最大池化是基于记忆库的无监督异常检测(UAD)中将异常评分图转换为图像级决策的默认标准。然而,由于它依赖于单一的极端响应,因此会丢弃异常证据在图像中分布和结构的大部分信息,通常导致正常和异常评分重叠。
我们提出了一种无需训练、基于结构的图像级评分方法StructCore,该方法超越了最大池化。给定一个异常评分图,StructCore 计算一个低维结构描述符φ(S),捕捉分布性和空间特性,并通过从训练良好样本中估计的对角马氏距离校准来细化图像级评分,而不修改像素级定位。
StructCore 在 MVTec AD 上实现了 99.6% 的图像级 AUROC 分数,在 VisA 上实现了 98.4% 的图像级 AUROC 分数,通过利用最大池化忽略的结构特征,展示了稳健的图像级异常检测能力。
Summary / 总结
The paper addresses the limitations of max pooling in unsupervised anomaly detection, which often leads to overlapping scores between normal and anomalous regions. It introduces StructCore, a training-free method that computes a structural descriptor from anomaly score maps to capture distributional and spatial characteristics. This descriptor is then calibrated using a diagonal Mahalanobis distance estimated from good samples, resulting in improved image-level anomaly detection with AUROC scores of 99.6% on MVTec AD and 98.4% on VisA.
论文针对最大池化在无监督异常检测中的局限性,即它会导致正常和异常区域得分重叠。提出了一种名为StructCore的无监督方法,通过从异常得分图中计算结构描述符来捕捉分布和空间特征。该描述符通过从良好样本中估计的对角马氏距离进行校准,从而在MVTec AD上实现了99.6%的图像级AUROC分数,在VisA上实现了98.4%的分数,展示了通过利用最大池化遗漏的结构特征实现稳健的图像级异常检测。
Narrow fine-tuning erodes safety alignment in vision-language agents
Authors: Idhant Gulati, Shivam Raval
First: 2026-02-18T22:47:28+00:00 · Latest: 2026-02-18T22:47:28+00:00
Comments: 24 pages, 11 figures
Abstract
Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ($70.71 \pm 1.22$ at $r=128$) than text-only evaluation ($41.19 \pm 2.51$), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors occupy a remarkably low-dimensional subspace, with the majority of misalignment information captured in 10 principal components. To mitigate misalignment, we evaluate two strategies: benign narrow fine-tuning and activation-based steering. While both approaches substantially reduce misalignment, neither completely removes the learned harmful behaviors. Our findings highlight the need for robust continual learning frameworks, as current post-training paradigms may not sufficiently preserve alignment in post-deployment settings.
中文标题/摘要
标题:狭窄微调会侵蚀视觉语言代理的安全一致性
终身多模态代理必须通过后训练不断适应新任务,但这在获取能力与保持安全一致性之间造成了根本性的紧张关系。我们证明,对窄领域有害数据集进行对齐的视觉语言模型微调会导致严重的泛化不良对齐,这种不良对齐在与任务和模态无关的任务中广泛存在。通过在Gemma3-4B上的实验,我们表明不良对齐随LoRA秩单调增加,并且多模态评估揭示了显著更高的不良对齐(在r=128时为70.71±1.22),而仅文本评估为41.19±2.51,这表明单一模态的安全基准可能低估了视觉语言模型中对齐降解的程度。关键的是,即使训练混合数据中包含10%的有害数据,也会导致显著的对齐降解。几何分析表明,有害行为占据了一个非常低维度的子空间,大部分不良对齐信息被10个主成分捕获。为了缓解不良对齐,我们评估了两种策略:良性狭窄微调和激活基引导航。虽然这两种方法显著减少了不良对齐,但都无法完全消除学习到的有害行为。我们的研究结果强调了需要稳健的持续学习框架,因为当前的后训练范式可能无法在部署后充分保持对齐。
Summary / 总结
The study investigates the impact of narrow fine-tuning on the safety alignment of vision-language agents. It demonstrates that fine-tuning on harmful datasets leads to significant misalignment that generalizes across tasks and modalities. Experiments on Gemma3-4B show that misalignment increases with fine-tuning rank and that multimodal evaluation reveals higher misalignment compared to text-only evaluation. The research also finds that even a small amount of harmful data can cause substantial alignment degradation, and suggests strategies to mitigate misalignment but notes that complete removal is challenging.
研究关注终身多模态代理通过后训练不断适应新任务时保持安全对齐的挑战。研究显示,对狭窄领域有害数据进行微调会导致严重的对齐偏差,这种偏差会跨任务和模态泛化。实验表明,对齐偏差随微调秩增加而增加,且多模态评估检测到的对齐偏差高于仅文本评估。研究还发现,即使少量有害数据也会导致显著的对齐降解,有害行为集中在低维子空间中。测试的缓解策略,如良性狭窄微调和激活导向控制,可以减少但不能完全消除学习到的有害行为,这强调了需要稳健的持续学习框架,当前的后训练范式可能不足以在部署后保持对齐。
MALLVI: a multi agent framework for integrated generalized robotics manipulation
Authors: Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani, Babak Khalaj
First: 2026-02-18T21:28:56+00:00 · Latest: 2026-02-18T21:28:56+00:00
Abstract
Task planning for robotic manipulation with large language models (LLMs) is an emerging area. Prior approaches rely on specialized models, fine tuning, or prompt tuning, and often operate in an open loop manner without robust environmental feedback, making them fragile in dynamic settings.We present MALLVi, a Multi Agent Large Language and Vision framework that enables closed loop feedback driven robotic manipulation. Given a natural language instruction and an image of the environment, MALLVi generates executable atomic actions for a robot manipulator. After action execution, a Vision Language Model (VLM) evaluates environmental feedback and decides whether to repeat the process or proceed to the next step.Rather than using a single model, MALLVi coordinates specialized agents, Decomposer, Localizer, Thinker, and Reflector, to manage perception, localization, reasoning, and high level planning. An optional Descriptor agent provides visual memory of the initial state. The Reflector supports targeted error detection and recovery by reactivating only relevant agents, avoiding full replanning.Experiments in simulation and real world settings show that iterative closed loop multi agent coordination improves generalization and increases success rates in zero shot manipulation tasks.Code available at https://github.com/iman1234ahmadi/MALLVI.
中文标题/摘要
标题:MALLVI:一种集成通用机器人操作的多智能体框架
使用大型语言模型(LLMs)进行机器人操作的任务规划是一个新兴领域。先前的方法依赖于专门的模型、微调或提示调优,并且通常以开环方式运行,缺乏稳健的环境反馈,使其在动态环境中变得脆弱。我们提出了MALLVi,一种多智能体大型语言和视觉框架,能够实现闭环反馈驱动的机器人操作。给定自然语言指令和环境图像,MALLVi生成可执行的原子动作供机器人执行。执行动作后,视觉语言模型(VLM)评估环境反馈并决定是否重复该过程或进行下一步。MALLVi 不使用单一模型,而是协调分解器、定位器、思考者和反思者等专门智能体来管理感知、定位、推理和高级规划。可选的描述者智能体提供初始状态的视觉记忆。反思者通过重新激活相关智能体进行有针对性的错误检测和恢复,避免全面重新规划。在模拟和真实世界环境中的实验表明,迭代闭环多智能体协调可以提高通用性和零样本操作任务的成功率。代码可在 https://github.com/iman1234ahmadi/MALLVI 获取。
Summary / 总结
MALLVi is a multi-agent framework that uses large language models and vision to enable closed-loop feedback-driven robotic manipulation. Given natural language instructions and environmental images, MALLVi generates executable actions, evaluates feedback, and iteratively refines the process. Experiments show improved generalization and success rates in zero-shot manipulation tasks compared to open-loop approaches.
MALLVi 是一个多代理框架,利用大型语言和视觉模型根据自然语言指令和环境图像生成可执行的动作。通过评估环境变化并决定是否重复动作或继续,它提供了闭环反馈。实验结果显示,迭代闭环协调可以提高零样本操作任务的一般化能力和成功率。
DODO: Discrete OCR Diffusion Models
Authors: Sean Man, Roy Ganz, Roi Ronen, Shahar Tsiper, Shai Mazor, Niv Nayman
First: 2026-02-18T20:59:22+00:00 · Latest: 2026-02-18T20:59:22+00:00
Abstract
Optical Character Recognition (OCR) is a fundamental task for digitizing information, serving as a critical bridge between visual data and textual understanding. While modern Vision-Language Models (VLM) have achieved high accuracy in this domain, they predominantly rely on autoregressive decoding, which becomes computationally expensive and slow for long documents as it requires a sequential forward pass for every generated token. We identify a key opportunity to overcome this bottleneck: unlike open-ended generation, OCR is a highly deterministic task where the visual input strictly dictates a unique output sequence, theoretically enabling efficient, parallel decoding via diffusion models. However, we show that existing masked diffusion models fail to harness this potential; those introduce structural instabilities that are benign in flexible tasks, like captioning, but catastrophic for the rigid, exact-match requirements of OCR. To bridge this gap, we introduce DODO, the first VLM to utilize block discrete diffusion and unlock its speedup potential for OCR. By decomposing generation into blocks, DODO mitigates the synchronization errors of global diffusion. Empirically, our method achieves near state-of-the-art accuracy while enabling up to 3x faster inference compared to autoregressive baselines.
中文标题/摘要
标题:DODO:离散OCR扩散模型
光学字符识别(OCR)是数字化信息的基本任务,是视觉数据与文本理解之间的关键桥梁。尽管现代视觉-语言模型(VLM)在这个领域取得了高精度,但它们主要依赖于自回归解码,这在长文档中变得计算成本高昂且速度缓慢,因为每次生成一个新词都需要进行一次顺序前向传递。我们发现了一个关键机会来克服这一瓶颈:与开放生成不同,OCR是一个高度确定的任务,其中视觉输入严格决定了一个唯一的输出序列,理论上可以通过扩散模型实现高效的并行解码。然而,我们表明现有的掩码扩散模型无法利用这一潜力;这些模型在像描述图像这样的灵活任务中引入的结构不稳定性,在OCR的严格、精确匹配要求下是灾难性的。为了弥合这一差距,我们提出了DODO,这是第一个利用块离散扩散的VLM,解锁其在OCR中的加速潜力。通过将生成分解为块,DODO减轻了全局扩散的同步错误。实验证明,我们的方法在接近当前最佳精度的同时,比自回归基线快3倍的推理速度。
Summary / 总结
The research aims to improve the efficiency of Optical Character Recognition (OCR) by leveraging diffusion models, which can enable parallel decoding and speed up the process. Unlike existing masked diffusion models, DODO, the proposed method, uses block discrete diffusion to mitigate synchronization errors and achieve near state-of-the-art accuracy with up to 3x faster inference compared to autoregressive baselines.
论文针对OCR任务中自回归解码的计算效率问题,该任务高度确定性且能从并行处理中受益。提出了一种名为DODO的VLM,采用块离散扩散方法以实现高效的并行解码。实验结果显示,DODO在保持接近当前最佳准确率的同时,相比自回归基线模型可实现高达3倍的加速。
Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
Authors: Szymon Pawlonka, Mikołaj Małkiński, Jacek Mańdziuk
Venue: ICLR 2026
First: 2025-08-16T12:26:44+00:00 · Latest: 2026-02-18T20:57:18+00:00
Comments: Accepted to The Fourteenth International Conference on Learning Representations (ICLR 2026)
Abstract
Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just $60$ instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of $5\,400$ instances that represent original BP abstract concepts using real-world-like images generated via a vision language model (VLM) pipeline. Building on Bongard-RWR, we employ Pixtral-12B to describe manually curated images and generate new descriptions aligned with the underlying concepts, use Flux.1-dev to synthesize images from these descriptions, and manually verify that the generated images faithfully reflect the intended concepts. We evaluate state-of-the-art VLMs across diverse BP formulations, including binary and multiclass classification, as well as textual answer generation. Our findings reveal that while VLMs can recognize coarse-grained visual concepts, they consistently struggle with discerning fine-grained concepts, highlighting limitations in their reasoning capabilities.
中文标题/摘要
标题:邦加德-RWR+: 细分概念的现实世界表示在邦加德问题中的应用
邦加德问题(BPs)为抽象视觉推理(AVR)提供了一个具有挑战性的测试平台,要求模型仅从少量示例中识别视觉概念并用自然语言描述它们。早期的BP基准使用合成的黑白图,可能无法完全捕捉现实场景的复杂性。随后的BP数据集使用了现实世界的图像,但代表的概念可以通过高级图像特征识别,降低了任务的复杂性。不同地,最近发布的邦加德-RWR数据集旨在使用精细粒度的现实世界图像表示原始BP中的抽象概念。然而,由于其手动构建,数据集大小仅限于60个实例,限制了评估的稳健性。在本文中,我们引入了邦加德-RWR+数据集,该数据集包含5400个实例,使用视觉语言模型(VLM)管道生成的现实世界样貌图像表示原始BP中的抽象概念。基于邦加德-RWR,我们使用Pixtral-12B描述手动精选的图像并生成与底层概念对齐的新描述,使用Flux.1-dev从这些描述中合成图像,并手动验证生成的图像是否忠实反映了预期的概念。我们评估了最先进的VLMs在不同BP形式下的表现,包括二元和多分类分类以及文本答案生成。我们的研究结果表明,虽然VLMs可以识别粗粒度的视觉概念,但在区分细粒度概念方面却表现出一致的困难,突显了它们推理能力的局限性。
Summary / 总结
This work introduces Bongard-RWR+, a dataset of 5,400 real-world-like images generated via a vision language model pipeline, to represent fine-grained abstract concepts from original Bongard Problems. The study evaluates state-of-the-art VLMs on various BP formulations and finds that while these models can recognize coarse-grained concepts, they struggle with fine-grained ones, indicating limitations in reasoning capabilities.
该研究通过引入包含5,400个生成自视觉语言模型管道的现实世界样例的Bongard-RWR+数据集,解决了抽象视觉推理问题。研究评估了最先进的视觉语言模型在不同Bongard问题形式下的表现,并发现尽管这些模型可以识别粗粒度的概念,但在识别细粒度概念方面存在困难,这表明它们在推理能力上的局限性。
Can Vision-Language Models Answer Face to Face Questions in the Real-World?
Authors: Reza Pourreza, Rishit Dagli, Apratim Bhattacharyya, Sunny Panchal, Guillaume Berger, Roland Memisevic
Venue: ICLR 2026
First: 2025-03-25T05:13:12+00:00 · Latest: 2026-02-18T20:15:27+00:00
Comments: ICLR 2026 paper
Abstract
AI models have made significant strides in recent years in their ability to describe and answer questions about real-world images. They have also made progress in the ability to converse with users in real-time using audio input. This raises the question: have we reached the point where AI models, connected to a camera and microphone, can converse with users in real-time about scenes and events that are unfolding live in front of the camera? This has been a long-standing goal in AI and is a prerequisite for real-world AI assistants and humanoid robots to interact with humans in everyday situations. In this work, we introduce a new dataset and benchmark, the Qualcomm Interactive Video Dataset (IVD), which allows us to assess the extent to which existing models can support these abilities, and to what degree these capabilities can be instilled through fine-tuning. The dataset is based on a simple question-answering setup, where users ask questions that the system has to answer, in real-time, based on the camera and audio input. We show that existing models fall far behind human performance on this task, and we identify the main sources for the performance gap. However, we also show that for many of the required perceptual skills, fine-tuning on this form of data can significantly reduce this gap.
中文标题/摘要
标题:视觉-语言模型能否在现实场景中面对面回答问题?
近年来,AI模型在描述和回答关于现实世界图像的问题方面取得了显著进展。它们还在使用音频输入与用户进行实时对话方面取得了进步。这引发了一个问题:我们是否已经达到了这样的程度,即连接了摄像头和麦克风的AI模型能够实时与用户就摄像头前正在发生的场景和事件进行对话?这是AI领域长期以来的一个目标,也是现实世界中的AI助手和类人机器人与人类在日常情况下互动的前提条件。在本项工作中,我们引入了一个新的数据集和基准——高通互动视频数据集(IVD),以评估现有模型支持这些能力的程度,并确定这些能力可以通过微调到何种程度被赋予。该数据集基于一个简单的问答设置,用户提出问题,系统需要基于摄像头和音频输入实时回答。我们展示了现有模型在该任务上的表现远低于人类水平,并指出了性能差距的主要来源。然而,我们还展示了对于许多所需的感知技能,通过这种形式的数据进行微调可以显著缩小这一差距。
Summary / 总结
The research aims to evaluate whether current vision-language models can answer questions in real-time about live scenes captured by a camera. The study introduces the Qualcomm Interactive Video Dataset (IVD) to benchmark these models. Existing models perform poorly compared to humans, but fine-tuning on this dataset can improve their performance in several perceptual skills.
研究旨在评估当前的视觉-语言模型是否能够实时回答关于由摄像头捕捉到的实时场景的问题。研究引入了Qualcomm交互视频数据集(IVD)来评估这些模型的能力,并发现现有模型在处理和回应真实世界的视觉和听觉输入方面与人类相比表现较差,但仍可通过对此类数据进行微调显著提高其在许多感知技能上的表现。
Better Think Thrice: Learning to Reason Causally with Double Counterfactual Consistency
Authors: Victoria Lin, Xinnuo Xu, Rachel Lawrence, Risa Ueno, Amit Sharma, Javier Gonzalez, Niranjani Prasad
First: 2026-02-18T19:00:07+00:00 · Latest: 2026-02-18T19:00:07+00:00
Abstract
Despite their strong performance on reasoning benchmarks, large language models (LLMs) have proven brittle when presented with counterfactual questions, suggesting weaknesses in their causal reasoning ability. While recent work has demonstrated that labeled counterfactual tasks can be useful benchmarks of LLMs' causal reasoning, producing such data at the scale required to cover the vast potential space of counterfactuals is limited. In this work, we introduce double counterfactual consistency (DCC), a lightweight inference-time method for measuring and guiding the ability of LLMs to reason causally. Without requiring labeled counterfactual data, DCC verifies a model's ability to execute two important elements of causal reasoning: causal intervention and counterfactual prediction. Using DCC, we evaluate the causal reasoning abilities of various leading LLMs across a range of reasoning tasks and interventions. Moreover, we demonstrate the effectiveness of DCC as a training-free test-time rejection sampling criterion and show that it can directly improve performance on reasoning tasks across multiple model families.
中文标题/摘要
标题:三思而行:通过双重反事实一致性学习因果推理
尽管大型语言模型(LLMs)在推理基准测试中表现出色,但在面对反事实问题时却显得脆弱,这表明它们在因果推理方面存在弱点。虽然近期研究表明,标记的反事实任务可以作为评估LLMs因果推理能力的有效基准,但要生成足够规模的反事实数据以覆盖反事实的广阔潜在空间仍然有限。在本研究中,我们引入了双重反事实一致性(DCC),这是一种轻量级的推理时方法,用于衡量和引导LLMs进行因果推理的能力。DCC 不需要标记的反事实数据,而是验证模型执行因果干预和反事实预测这两种因果推理关键要素的能力。利用DCC,我们评估了各种领先LLMs在多种推理任务和干预下的因果推理能力。此外,我们展示了DCC作为无训练测试时拒绝采样标准的有效性,并证明它可以跨多个模型家族直接提高推理任务的性能。
Summary / 总结
This work addresses the brittleness of large language models (LLMs) when dealing with counterfactual questions, indicating a need for better causal reasoning. The authors introduce double counterfactual consistency (DCC), a method that evaluates LLMs' ability to perform causal intervention and counterfactual prediction without requiring labeled data. Experiments show that DCC can effectively test and enhance LLMs' causal reasoning across various tasks and model types.
该研究针对大型语言模型(LLMs)在处理反事实问题时的脆弱性,表明需要改进其因果推理能力。作者引入了双反事实一致性(DCC)方法,该方法在无需标注数据的情况下评估LLMs执行因果干预和反事实预测的能力。实验表明,DCC可以有效测试和提升LLMs在各种任务和模型类型上的因果推理能力。
Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Authors: Runpei Dong, Ziyan Li, Xialin He, Saurabh Gupta
First: 2026-02-18T18:55:02+00:00 · Latest: 2026-02-18T18:55:02+00:00
Comments: Project page: https://hero-humanoid.github.io/
Abstract
Visual loco-manipulation of arbitrary objects in the wild with humanoid robots requires accurate end-effector (EE) control and a generalizable understanding of the scene via visual inputs (e.g., RGB-D images). Existing approaches are based on real-world imitation learning and exhibit limited generalization due to the difficulty in collecting large-scale training datasets. This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots that combines the strong generalization and open-vocabulary understanding of large vision models with strong control performance from simulated training. We achieve this by designing an accurate residual-aware EE tracking policy. This EE tracking policy combines classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, c) goal adjustment, and d) replanning. Together, these innovations help us cut down the end-effector tracking error by 3.2x. We use this accurate end-effector tracker to build a modular system for loco-manipulation, where we use open-vocabulary large vision models for strong visual generalization. Our system is able to operate in diverse real-world environments, from offices to coffee shops, where the robot is able to reliably manipulate various everyday objects (e.g., mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests in simulation and the real world demonstrate the effectiveness of our proposed design. We believe the advances in this paper can open up new ways of training humanoid robots to interact with daily objects.
中文标题/摘要
标题:类人机器人开放词汇视觉移动物体末端执行器控制学习
使用类人机器人在野外对任意物体进行视觉移动物体操作需要精确的末端执行器(EE)控制和通过视觉输入(例如RGB-D图像)对场景的广泛理解。现有方法基于现实世界的模仿学习,由于难以收集大规模训练数据集,因此表现出有限的泛化能力。本文提出了一种新的范式HERO,用于类人机器人物体移动物体操作,结合了大型视觉模型的强大泛化能力和开放词汇理解与模拟训练中的强大控制性能。我们通过设计一种准确的残差感知末端执行器跟踪策略来实现这一点。该末端执行器跟踪策略结合了经典机器人学与机器学习。它使用a) 逆运动学将残差末端执行器目标转换为参考轨迹,b) 用于准确前向运动学的已学习神经前向模型,c) 目标调整,以及d) 重新规划。这些创新共同帮助我们将末端执行器跟踪误差减少了3.2倍。我们使用这种准确的末端执行器跟踪器构建了一个模块化移动物体系统,其中使用开放词汇大型视觉模型实现强大的视觉泛化。我们的系统能够在从办公室到咖啡馆等多样化的现实环境中操作,机器人能够可靠地操作各种日常物体(例如茶杯、苹果、玩具),这些物体位于43cm至92cm高度的表面上。在模拟和现实世界中的系统模块化和端到端测试表明我们提出的设计的有效性。我们认为本文中的进展可以为训练类人机器人与日常物体交互开辟新的训练方式。
Summary / 总结
This paper introduces HERO, a new paradigm for humanoid robot object manipulation that combines strong generalization from large vision models with accurate end-effector control through a residual-aware tracking policy. The policy integrates classical robotics with machine learning, using inverse kinematics, a neural forward model, goal adjustment, and replanning. This approach reduces end-effector tracking error by 3.2 times. The system, which uses open-vocabulary large vision models for visual understanding, successfully manipulates various objects in diverse real-world environments, demonstrating its effectiveness in both simulation and the real world.
本文提出了HERO,一种使类人机器人在多种环境中执行物体操作的新范式。该系统结合了大型视觉模型的强大泛化能力和精确的末端执行器控制。关键创新包括逆运动学、学习神经前向模型、目标调整和重新规划。末端执行器跟踪误差减少了3.2倍,使机器人能够在办公室和咖啡馆等真实世界环境中可靠地操作各种物体。
Saliency-Aware Multi-Route Thinking: Revisiting Vision-Language Reasoning
Authors: Mingjia Shi, Yinhan He, Yaochen Zhu, Jundong Li
First: 2026-02-18T18:49:56+00:00 · Latest: 2026-02-18T18:49:56+00:00
Comments: preprint 10 pages, 4 figures
Abstract
Vision-language models (VLMs) aim to reason by jointly leveraging visual and textual modalities. While allocating additional inference-time computation has proven effective for large language models (LLMs), achieving similar scaling in VLMs remains challenging. A key obstacle is that visual inputs are typically provided only once at the start of generation, while textual reasoning (e.g., early visual summaries) is generated autoregressively, causing reasoning to become increasingly text-dominated and allowing early visual grounding errors to accumulate. Moreover, vanilla guidance for visual grounding during inference is often coarse and noisy, making it difficult to steer reasoning over long texts. To address these challenges, we propose \emph{Saliency-Aware Principle} (SAP) selection. SAP operates on high-level reasoning principles rather than token-level trajectories, which enable stable control over discrete generation under noisy feedback while allowing later reasoning steps to re-consult visual evidence when renewed grounding is required. In addition, SAP supports multi-route inference, enabling parallel exploration of diverse reasoning behaviors. SAP is model-agnostic and data-free, requiring no additional training. Empirical results show that SAP achieves competitive performance, especially in reducing object hallucination, under comparable token-generation budgets while yielding more stable reasoning and lower response latency than CoT-style long sequential reasoning.
中文标题/摘要
标题:注意引导多路径思考:重访视觉语言推理
视觉语言模型(VLMs)旨在通过联合利用视觉和文本模态进行推理。虽然为大型语言模型(LLMs)分配额外的推理时间计算已被证明是有效的,但在VLMs中实现类似的扩展仍然具有挑战性。一个关键障碍是视觉输入通常只在生成的开始阶段提供一次,而文本推理(例如,早期视觉摘要)是自回归生成的,这使得推理变得越来越以文本为主导,并允许早期视觉定位错误累积。此外,推理期间的视觉定位指导通常粗糙且嘈杂,这使得在长文本上引导推理变得困难。为了解决这些挑战,我们提出了\emph{注意引导原则}(SAP)选择。SAP 在高层次的推理原则上操作,而不是在标记级轨迹上,这使得在嘈杂反馈下稳定控制离散生成成为可能,同时允许后续推理步骤在需要重新定位时重新咨询视觉证据。此外,SAP 支持多路径推理,允许并行探索多种推理行为。SAP 是模型无关的,不需要额外的数据,也不需要额外的训练。实验证明,SAP 在与可比的标记生成预算下实现了竞争力的表现,特别是在减少对象幻觉方面,同时提供了比CoT风格的长序列推理更稳定和更快的响应。
Summary / 总结
The paper addresses the challenge of scaling vision-language models (VLMs) by proposing Saliency-Aware Principle (SAP) selection, which operates on high-level reasoning principles to enable stable control over discrete generation under noisy feedback. SAP supports multi-route inference, allowing parallel exploration of diverse reasoning behaviors. Empirical results show that SAP reduces object hallucination, provides more stable reasoning, and has lower response latency compared to CoT-style long sequential reasoning.
论文通过提出Saliency-Aware Principle (SAP)选择来解决视觉-语言模型(VLMs)中的有效视觉定位问题,SAP在高阶推理原则上操作,以在嘈杂的反馈下稳定控制离散生成。SAP支持多路线推理,允许并行探索多种推理行为。实验证明,SAP减少了对象幻觉,提供了更稳定的推理,并且响应延迟更低,优于CoT风格的长序列推理。
MC-LLaVA: Multi-Concept Personalized Vision-Language Model
Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Ming Lu, Tianyi Jiang, Kai Zeng, Yulin Luo, Jiajun Cao, Hao Liang, Ying Chen, Qi She, Shanghang Zhang, Wentao Zhang
First: 2024-11-18T16:33:52+00:00 · Latest: 2026-02-18T18:33:19+00:00
Abstract
Current vision-language models (VLMs) show exceptional abilities across diverse tasks, such as visual question answering. To enhance user experience, recent studies have investigated VLM personalization to understand user-provided concepts. However, they mainly focus on single concepts, neglecting the existence and interplay of multiple concepts, which limits real-world applicability. This paper proposes MC-LLaVA, a multi-concept personalization paradigm. Specifically, MC-LLaVA employs a multi-concept instruction tuning strategy, effectively integrating multiple concepts in a single training step. To reduce the training costs, we propose a personalized textual prompt that uses visual token information to initialize concept tokens. Additionally, we introduce a personalized visual prompt during inference, aggregating location maps for enhanced recognition and grounding capabilities. To further push the performance upper bound, we incorporate an optional auxiliary loss, better enhancing the proposed personalized prompts. To decorate the VLM personalization research, we contribute a high-quality dataset. We carefully collect images with multiple characters and objects from movies and manually create question-answer samples for multi-concept scenarios, featuring superior diversity. Comprehensive experiments demonstrate that MC-LLaVA achieves impressive multi-concept personalized responses, paving the way for VLMs to become better user assistants. The code and dataset will be released at \href{https://github.com/arctanxarc/MC-LLaVA}{https://github.com/arctanxarc/MC-LLaVA}.
中文标题/摘要
标题:MC-LLaVA:多概念个性化视觉语言模型
当前的视觉语言模型(VLMs)在各种任务中表现出色,如视觉问答。为了提升用户体验,最近的研究探讨了VLM的个性化,以理解用户提供的概念。然而,这些研究主要集中在单一概念上,忽视了多个概念的存在及其相互作用,这限制了其在实际中的应用。本文提出了一种多概念个性化范式MC-LLaVA。具体而言,MC-LLaVA采用多概念指令调优策略,在单个训练步骤中有效整合多个概念。为了降低训练成本,我们提出了一种个性化的文本提示,利用视觉标记信息初始化概念标记。此外,在推理过程中,我们引入了个性化的视觉提示,聚合位置图以增强识别和语义关联能力。为了进一步提高性能上限,我们引入了一个可选的辅助损失,更好地增强了提出的个性化提示。为了丰富VLM个性化研究,我们贡献了一个高质量的数据集。我们精心收集了来自电影的多角色和多物体图像,并手动创建了多概念场景下的问题-答案样本,具有更高的多样性。全面的实验表明,MC-LLaVA实现了令人印象深刻的多概念个性化响应,为VLM成为更好的用户助手铺平了道路。代码和数据集将在https://github.com/arctanxarc/MC-LLaVA发布。
Summary / 总结
This paper introduces MC-LLaVA, a multi-concept personalized vision-language model that addresses the limitations of single-concept personalization by integrating multiple concepts in a single training step. It employs a multi-concept instruction tuning strategy and uses personalized textual and visual prompts to enhance recognition and grounding capabilities. Experimental results show that MC-LLaVA provides impressive multi-concept personalized responses, improving the real-world applicability of vision-language models as user assistants.
本文提出了MC-LLaVA,一种多概念个性化视觉语言模型,通过在单一步骤中整合多个概念来解决单概念个性化的问题。该模型采用了多概念指令调优策略,并使用个性化文本和视觉提示以增强识别和语义关联能力。实验结果表明,MC-LLaVA能够提供出色的多概念个性化响应,提高视觉语言模型作为用户助手的实用性。还提供了一个高质量的数据集以支持研究。
A Contrastive Learning Framework Empowered by Attention-based Feature Adaptation for Street-View Image Classification
Authors: Qi You, Yitai Cheng, Zichao Zeng, James Haworth
First: 2026-02-18T16:41:32+00:00 · Latest: 2026-02-18T16:41:32+00:00
Abstract
Street-view image attribute classification is a vital downstream task of image classification, enabling applications such as autonomous driving, urban analytics, and high-definition map construction. It remains computationally demanding whether training from scratch, initialising from pre-trained weights, or fine-tuning large models. Although pre-trained vision-language models such as CLIP offer rich image representations, existing adaptation or fine-tuning methods often rely on their global image embeddings, limiting their ability to capture fine-grained, localised attributes essential in complex, cluttered street scenes. To address this, we propose CLIP-MHAdapter, a variant of the current lightweight CLIP adaptation paradigm that appends a bottleneck MLP equipped with multi-head self-attention operating on patch tokens to model inter-patch dependencies. With approximately 1.4 million trainable parameters, CLIP-MHAdapter achieves superior or competitive accuracy across eight attribute classification tasks on the Global StreetScapes dataset, attaining new state-of-the-art results while maintaining low computational cost. The code is available at https://github.com/SpaceTimeLab/CLIP-MHAdapter.
中文标题/摘要
标题:一种基于注意力特征适应的对比学习框架用于街道视图图像分类
街道视图图像属性分类是图像分类的重要下游任务,能够支持自动驾驶、城市分析和高精度地图构建等应用。无论是从零开始训练、使用预训练权重初始化还是微调大型模型,该任务都具有较高的计算需求。尽管预训练的视觉-语言模型如CLIP提供了丰富的图像表示,但现有的适应或微调方法往往依赖于它们的全局图像嵌入,限制了它们捕捉复杂、杂乱街道场景中细微、局部化的属性的能力。为了解决这个问题,我们提出了一种CLIP-MHAdapter,它是当前轻量级CLIP适应范式的变体,通过在模型中附加一个具有多头自注意力机制的瓶颈MLP来处理补丁标记,以建模补丁之间的依赖关系。CLIP-MHAdapter大约有140万可训练参数,在Global StreetScapes数据集的八个属性分类任务上实现了优于或竞争力相当的准确性,同时保持了较低的计算成本。代码可在https://github.com/SpaceTimeLab/CLIP-MHAdapter获取。
Summary / 总结
The research aims to improve the accuracy of street-view image attribute classification for applications like autonomous driving and urban analytics. The method introduces CLIP-MHAdapter, which enhances CLIP by adding a bottleneck MLP with multi-head self-attention on patch tokens to capture local features. This approach achieves state-of-the-art results across eight attribute classification tasks on the Global StreetScapes dataset with minimal computational cost, using about 1.4 million trainable parameters.
研究旨在提高街景图像属性分类的准确性,这对于自动驾驶等应用至关重要。方法引入了CLIP-MHAdapter,通过在patch token上添加带有多头自注意力的瓶颈MLP来增强CLIP,以捕捉细粒度的属性。该方法在Global StreetScapes数据集的八个属性分类任务上取得了最先进的结果,同时计算成本较低,仅使用约140万个可训练参数。
FindAnything: Open-Vocabulary and Object-Centric Mapping for Robot Exploration in Any Environment
Authors: Sebastián Barbas Laina, Simon Boche, Sotiris Papatheodorou, Simon Schaefer, Jaehyung Jung, Stefan Leutenegger
First: 2025-04-11T15:12:05+00:00 · Latest: 2026-02-18T15:52:04+00:00
Comments: 11 pages, 5 figures
Abstract
Geometrically accurate and semantically expressive map representations have proven invaluable for robot deployment and task planning in unknown environments. Nevertheless, real-time, open-vocabulary semantic understanding of large-scale unknown environments still presents open challenges, mainly due to computational requirements. In this paper we present FindAnything, an open-world mapping framework that incorporates vision-language information into dense volumetric submaps. Thanks to the use of vision-language features, FindAnything combines pure geometric and open-vocabulary semantic information for a higher level of understanding. It proposes an efficient storage of open-vocabulary information through the aggregation of features at the object level. Pixelwise vision-language features are aggregated based on eSAM segments, which are in turn integrated into object-centric volumetric submaps, providing a mapping from open-vocabulary queries to 3D geometry that is scalable also in terms of memory usage. We demonstrate that FindAnything performs on par with the state-of-the-art in terms of semantic accuracy while being substantially faster and more memory-efficient, allowing its deployment in large-scale environments and on resourceconstrained devices, such as MAVs. We show that the real-time capabilities of FindAnything make it useful for downstream tasks, such as autonomous MAV exploration in a simulated Search and Rescue scenario. Project Page: https://ethz-mrl.github.io/findanything/.
中文标题/摘要
标题:FindAnything:任意词汇和对象中心的映射框架,用于机器人在任意环境中的探索
几何精确且语义丰富的地图表示已被证明对于在未知环境中部署机器人和任务规划至关重要。然而,实时理解大规模未知环境的开放词汇语义仍然存在挑战,主要原因是计算需求。本文提出了一种名为FindAnything的开放世界映射框架,该框架将视觉-语言信息整合到密集的体积子地图中。通过使用视觉-语言特征,FindAnything结合了纯几何和开放词汇语义信息,提高了理解水平。它通过在对象级别聚合特征来高效存储开放词汇信息。基于eSAM片段的像素级视觉-语言特征被聚合,并整合到对象中心的体积子地图中,提供了一种从开放词汇查询到三维几何的映射,该映射在内存使用方面也具有可扩展性。我们证明FindAnything在语义准确性方面与最先进的技术相当,但速度更快且更节省内存,使其能够在大规模环境中部署,并在资源受限的设备上运行,例如MAVs。我们展示了FindAnything的实时能力使其在下游任务中具有实用性,例如在模拟的搜索和救援场景中自主MAV探索。项目页面:https://ethz-mrl.github.io/findanything/
Summary / 总结
FindAnything is an open-world mapping framework that integrates vision-language information into dense volumetric submaps to achieve both geometric and semantic understanding. It uses object-level feature aggregation and eSAM segments to efficiently store open-vocabulary information, providing scalable memory usage. FindAnything matches the state-of-the-art in semantic accuracy but is faster and more memory-efficient, suitable for large-scale environments and resource-constrained devices like MAVs. It demonstrates real-time capabilities useful for tasks such as autonomous MAV exploration in simulated Search and Rescue scenarios.
FindAnything 是一种将视觉-语言信息整合到密集体素子地图中的开放世界映射框架,以增强未知环境中的语义理解。它使用对象级特征聚合和 eSAM 区段来高效存储开放词汇信息,提供可扩展的内存使用。FindAnything 在语义准确性上与最先进的技术相当,但速度更快且更省内存,适用于大规模环境和资源受限设备如 MAV。它展示了实时能力,适用于诸如模拟搜索和救援场景中的自主 MAV 探索等下游任务。
DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images
Authors: Zeng Tao, Ying Jiang, Yunuo Chen, Tianyi Xie, Huamin Wang, Yingnian Wu, Yin Yang, Abishek Sampath Kumar, Kenji Tashiro, Chenfanfu Jiang
First: 2026-02-18T14:45:15+00:00 · Latest: 2026-02-18T14:45:15+00:00
Abstract
Recent advances in garment pattern generation have shown promising progress. However, existing feed-forward methods struggle with diverse poses and viewpoints, while optimization-based approaches are computationally expensive and difficult to scale. This paper focuses on sewing pattern generation for garment modeling and fabrication applications that demand editable, separable, and simulation-ready garments. We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image. Given an input image, our method leverages vision-language models (VLMs) to normalize pose variations at the image level, then extract pose-aware, 3D-informed garment features. These features are fused through a transformer-based encoder and subsequently used to predict sewing pattern parameters, which can be directly applied to physical simulation, texture synthesis, and multi-layer virtual try-on. Extensive experiments demonstrate that our approach robustly recovers diverse sewing patterns and the corresponding 3D garments from in-the-wild images without requiring multi-view inputs or iterative optimization, offering an efficient and scalable solution for realistic garment simulation and animation.
中文标题/摘要
标题:DressWild:从野生图像中生成前馈姿态无关服装缝制模板
最近在服装模板生成方面的进展显示了令人鼓舞的成果。然而,现有的前馈方法在处理多样化的姿态和视角方面存在困难,而基于优化的方法则计算成本高且难以扩展。本文关注于服装建模和制造应用中所需的可编辑、可分离和可用于模拟的服装的缝制模板生成。我们提出了一种名为DressWild的新型前馈管道,可以从单张野生图像中重建符合物理规律的2D缝制模板及其对应的3D服装。给定输入图像,我们的方法利用视觉-语言模型(VLMs)在图像级别上归一化姿态变化,然后提取姿态感知的、3D启发式的服装特征。这些特征通过基于变换器的编码器融合后,用于预测缝制模板参数,这些参数可以直接应用于物理模拟、纹理合成和多层虚拟试穿。广泛的实验表明,我们的方法无需多视角输入或迭代优化即可稳健地从野生图像中恢复出多样化的缝制模板及其对应的3D服装,提供了一种高效且可扩展的现实服装模拟和动画解决方案。
Summary / 总结
This paper addresses the challenge of generating sewing patterns and 3D garments from diverse in-the-wild images. It introduces DressWild, a feed-forward method that uses vision-language models to normalize pose variations and extract 3D-informed garment features. These features are then used to predict sewing pattern parameters, which can be directly applied to physical simulation and texture synthesis. Experiments show that DressWild can robustly generate diverse sewing patterns and 3D garments without needing multi-view inputs or iterative optimization, providing an efficient and scalable solution for realistic garment simulation and animation.
本文解决了从多样化的在野图像生成缝制图案和3D服装的挑战。它提出了DressWild,一种前馈方法,使用视觉语言模型来归一化姿态变化并提取3D相关的服装特征。这些特征随后用于预测缝制图案参数,可以直接应用于物理模拟和纹理合成。实验表明,DressWild可以在不需要多视角输入或迭代优化的情况下稳健地生成多样化的缝制图案和3D服装,提供了一种高效且可扩展的现实服装模拟和动画解决方案。
Fast and Scalable Analytical Diffusion
Authors: Xinyi Shang, Peng Sun, Jingyu Lin, Zhiqiang Shen
First: 2026-02-18T14:41:09+00:00 · Latest: 2026-02-18T14:41:09+00:00
Abstract
Analytical diffusion models offer a mathematically transparent path to generative modeling by formulating the denoising score as an empirical-Bayes posterior mean. However, this interpretability comes at a prohibitive cost: the standard formulation necessitates a full-dataset scan at every timestep, scaling linearly with dataset size. In this work, we present the first systematic study addressing this scalability bottleneck. We challenge the prevailing assumption that the entire training data is necessary, uncovering the phenomenon of Posterior Progressive Concentration: the effective golden support of the denoising score is not static but shrinks asymptotically from the global manifold to a local neighborhood as the signal-to-noise ratio increases. Capitalizing on this, we propose Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), a training-free framework that decouples inference complexity from dataset size. Instead of static retrieval, GoldDiff uses a coarse-to-fine mechanism to dynamically pinpoint the ''Golden Subset'' for inference. Theoretically, we derive rigorous bounds guaranteeing that our sparse approximation converges to the exact score. Empirically, GoldDiff achieves a $\bf 71 \times$ speedup on AFHQ while matching or achieving even better performance than full-scan baselines. Most notably, we demonstrate the first successful scaling of analytical diffusion to ImageNet-1K, unlocking a scalable, training-free paradigm for large-scale generative modeling.
中文标题/摘要
标题:快速且可扩展的分析性扩散
分析性扩散模型通过将去噪分数形式化为经验贝叶斯后验均值,提供了一条生成建模的数学透明路径。然而,这种可解释性付出了高昂的成本:标准形式要求在每个时间步长进行整个数据集扫描,其规模与数据集大小成线性关系。在本文中,我们首次系统地研究了这一可扩展性瓶颈。我们挑战了整个训练数据必不可少的假设,揭示了后验渐进集中现象:去噪分数的有效黄金支持并非静态,而是随着信噪比增加从全局流形渐进收缩到局部邻域。利用这一发现,我们提出了动态时间感知黄金子集扩散(GoldDiff)框架,该框架在不进行训练的情况下解耦推理复杂度与数据集大小。GoldDiff 不使用静态检索,而是采用粗到细机制动态确定推理所需的“黄金子集”。理论上,我们推导出严格的界,保证我们的稀疏近似收敛到精确分数。实验上,GoldDiff 在 AFHQ 上实现了 71 倍加速,同时匹配甚至超越了全扫描基线的性能。最值得注意的是,我们首次成功将分析性扩散扩展到 ImageNet-1K,解锁了大规模生成建模的可扩展、无需训练范式。
Summary / 总结
This work addresses the scalability issue of analytical diffusion models by proposing Dynamic Time-Aware Golden Subset Diffusion (GoldDiff), which dynamically identifies a 'Golden Subset' for inference, reducing the need for a full-dataset scan. The method leverages the phenomenon of Posterior Progressive Concentration, where the effective support of the denoising score shrinks as the signal-to-noise ratio increases. Empirically, GoldDiff achieves a 71 times speedup on AFHQ while maintaining or even improving performance compared to full-scan baselines, and successfully scales analytical diffusion to ImageNet-1K for large-scale generative modeling.
该研究通过提出动态时间感知黄金子集扩散(GoldDiff)来解决分析性扩散模型的可扩展性问题。它利用了后验渐进集中现象,即随着信噪比的增加,去噪分数的有效支持会逐渐缩小。GoldDiff 动态选择用于推理的“黄金子集”,在 AFHQ 上实现了 71 倍的速度提升,同时保持或超越了全扫描基线的性能。更重要的是,它首次将分析性扩散扩展到 ImageNet-1K,开启了大规模生成建模的可扩展、无训练框架。
SurgRAW: Multi-Agent Workflow with Chain of Thought Reasoning for Robotic Surgical Video Analysis
Authors: Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhu Zhuo, Zhitao Zeng, Evangelos B. Mazomenos, Yueming Jin
Venue: IEEE Robotics and Automation Letters, 2026, pp. 1-8
First: 2025-03-13T11:23:13+00:00 · Latest: 2026-02-18T14:35:21+00:00
Abstract
Robotic-assisted surgery (RAS) is central to modern surgery, driving the need for intelligent systems with accurate scene understanding. Most existing surgical AI methods rely on isolated, task-specific models, leading to fragmented pipelines with limited interpretability and no unified understanding of RAS scene. Vision-Language Models (VLMs) offer strong zero-shot reasoning, but struggle with hallucinations, domain gaps and weak task-interdependency modeling. To address the lack of unified data for RAS scene understanding, we introduce SurgCoTBench, the first reasoning-focused benchmark in RAS, covering 14256 QA pairs with frame-level annotations across five major surgical tasks. Building on SurgCoTBench, we propose SurgRAW, a clinically aligned Chain-of-Thought (CoT) driven agentic workflow for zero-shot multi-task reasoning in surgery. SurgRAW employs a hierarchical reasoning workflow where an orchestrator divides surgical scene understanding into two reasoning streams and directs specialized agents to generate task-level reasoning, while higher-level agents capture workflow interdependencies or ground output clinically. Specifically, we propose a panel discussion mechanism to ensure task-specific agents collaborate synergistically and leverage on task interdependencies. Similarly, we incorporate a retrieval-augmented generation module to enrich agents with surgical knowledge and alleviate domain gaps in general VLMs. We design task-specific CoT prompts grounded in surgical domain to ensure clinically aligned reasoning, reduce hallucinations and enhance interpretability. Extensive experiments show that SurgRAW surpasses mainstream VLMs and agentic systems and outperforms a supervised model by 14.61% accuracy. Dataset and code is available at https://github.com/jinlab-imvr/SurgRAW.git .
中文标题/摘要
标题:SurgRAW:多智能体工作流与链式思考推理在机器人手术视频分析中的应用
机器人辅助手术(RAS)是现代手术的核心,推动了智能系统准确场景理解的需求。目前大多数现有的手术AI方法依赖于孤立的任务特定模型,导致了碎片化的管道,缺乏可解释性,且没有统一理解RAS场景。视觉-语言模型(VLMs)提供了强大的零样本推理能力,但在幻觉、领域差距和任务间依赖建模方面存在困难。为了解决RAS场景理解缺乏统一数据的问题,我们引入了SurgCoTBench,这是第一个专注于RAS的推理基准,涵盖了14256个问答对,包含五个主要手术任务的帧级注释。基于SurgCoTBench,我们提出了SurgRAW,一种临床对齐的链式思考(CoT)驱动的多智能体工作流,用于手术中的零样本多任务推理。SurgRAW采用分层推理工作流,其中协调者将手术场景理解分为两个推理流,并指导专门的智能体生成任务级推理,而高级智能体则捕捉工作流间的相互依赖或临床验证输出。具体而言,我们提出了一种讨论机制,以确保任务特定的智能体能够协同工作并利用任务间的相互依赖性。同样,我们引入了检索增强生成模块,以丰富智能体的手术知识并缓解通用VLMs的领域差距。我们设计了基于手术领域的任务特定CoT提示,以确保临床对齐的推理、减少幻觉并增强可解释性。广泛的实验表明,SurgRAW超越了主流的VLMs和智能体系统,并在准确率上比监督模型高出14.61%。数据集和代码可在https://github.com/jinlab-imvr/SurgRAW.git 获取。
Summary / 总结
The paper introduces SurgCoTBench, the first reasoning-focused benchmark for robotic-assisted surgery (RAS), and proposes SurgRAW, a multi-agent workflow with chain-of-thought reasoning to address the limitations of isolated task-specific models. SurgRAW employs a hierarchical reasoning approach, where an orchestrator directs specialized agents to generate task-level reasoning and higher-level agents capture interdependencies, reducing hallucinations and improving interpretability. Experiments show that SurgRAW outperforms mainstream vision-language models and supervised models by 14.61% accuracy.
研究通过引入SurgCoTBench推理导向基准和SurgRAW多智能体工作流,解决了RAS场景分析中统一理解的需求。SurgRAW采用层次推理工作流,通过小组讨论机制使任务特定的智能体协同工作,并通过检索增强生成模块增强手术知识。实验表明,SurgRAW在RAS场景理解任务中的准确率比主流视觉-语言模型和监督模型高出14.61%。
Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
Authors: Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin
First: 2026-02-18T13:40:53+00:00 · Latest: 2026-02-18T13:40:53+00:00
Abstract
While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
中文标题/摘要
标题:视觉自我精炼:一种基于像素的图表解析新范式
虽然大型多模态语言模型(LVLMs)在文本推理和自我修正方面表现出色,但这些优势对以视觉感知为中心的复杂任务,如图表解析,提供的帮助有限。现有模型在处理视觉密集型图表时常常遇到困难,导致数据遗漏、对齐错误和幻觉等问题。受人类在阅读复杂图表时使用手指作为“视觉锚点”以确保准确性的启发,我们提出了一种新的范式——视觉自我精炼(VSR)。VSR的核心思想是使模型能够生成像素级定位输出,可视化这些输出,并将这些可视化结果反馈给模型本身,使其能够直观地检查和修正潜在的视觉感知错误。我们通过提出ChartVSR在图表解析领域实例化了VSR范式。该模型将解析过程分解为两个阶段:在精炼阶段,它通过视觉反馈迭代确保所有数据点的像素级定位的准确性;在解码阶段,它使用这些验证过的定位作为精确的视觉锚点来解析最终的结构化数据。为了克服现有基准的局限性,我们还构建了ChartP-Bench,这是一个新的、极具挑战性的图表解析基准。我们的工作还强调了VSR作为一种通用的视觉反馈机制,为提高各种视觉为中心任务的准确性提供了有希望的新方向。
Summary / 总结
The paper addresses the challenge of accurate chart parsing by proposing Visual Self-Refine (VSR), a new paradigm that enables models to generate and visualize pixel-level localization outputs, which are then fed back to the model for self-correction. The method decomposes the parsing process into a Refine Stage and a Decode Stage, ensuring the accuracy of data points and using these localizations as visual anchors for parsing. The authors demonstrate the effectiveness of ChartVSR, their instantiation of VSR in the domain of chart parsing, and introduce ChartP-Bench, a new benchmark to evaluate chart parsing models, showing significant improvements in accuracy over existing methods.
研究旨在通过解决现有模型在处理视觉密集型图表时的局限性,提高图表解析的准确性。提出的视觉自我完善(VSR)范式使模型能够生成像素级定位输出,可视化这些输出,并利用这些可视化结果纠正潜在错误。ChartVSR是VSR在图表解析领域的具体实现,将过程分解为精炼阶段和解码阶段。精炼阶段使用视觉反馈确保所有数据点的像素级定位准确性,而解码阶段则利用这些验证的定位作为精确的视觉锚点来解析最终的结构化数据。研究还引入了ChartP-Bench,这是一个新的图表解析基准,突显了VSR作为通用视觉反馈机制的有效性。