PANC: Prior-Aware Normalized Cut for Object Segmentation
Authors: Juan Gutiérrez, Victor Gutiérrez-Garcia, José Luis Blanco-Murillo
First: 2026-02-06T18:07:20+00:00 · Latest: 2026-02-06T18:07:20+00:00
Abstract
Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics.
We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality.
Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
中文标题/摘要
标题:PANC:先验归一化切分用于对象分割
完全无监督的分割管道通常会寻找最显眼的对象,如果存在的话。因此,文献中报道的大多数方法会生成非确定性的分区,这些分区对初始化、种子顺序和阈值启发式方法敏感。
我们提出了一种弱监督的谱分割框架PANC,该框架使用少量注释的视觉标记来生成稳定、可控和可重复的对象掩码。从TokenCut方法出发,我们通过将少量先验与锚节点结合来增强标记-标记亲和图。通过操纵图的拓扑结构,我们偏向于与注释一致的谱特征空间。我们的方法保留了由密集自监督视觉特征强制执行的全局分组,用注释的标记换取更高的可重复性、用户控制和分割质量。
使用每数据集5到30个注释,我们的无需训练方法在标准基准(如DUTS-TE、ECSSD、MS COCO)上实现了弱监督和无监督方法中的最佳性能。在密集标签成本高或类内差异细微的领域,它表现出色。我们在同质、细粒度和纹理受限领域报告了强而可靠的结果,分别在CrackForest (CFD)、CUB-200-2011和HAM10000数据集上实现了96.8%(+14.43%超过SotA)、78.0%(+0.2%)和78.8%(+0.37%)的平均交并比(mIoU)。对于多对象基准,该框架展示了明确、用户可控的语义分割。
Summary / 总结
PANC is a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable and controllable object masks. By augmenting the token-token affinity graph with priors, PANC biases the spectral eigenspace towards partitions consistent with the annotations, while preserving global grouping enforced by dense self-supervised visual features. On standard benchmarks, PANC achieves state-of-the-art performance with 5 to 30 annotations per dataset, reporting 96.8% (14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mIoU on CrackForest, CUB-200-2011, and HAM10000 datasets, respectively.
PANC 是一种弱监督光谱分割框架,使用少量注释的视觉标记来生成稳定且可控的对象掩码。通过在标记标记图中添加先验知识,PANC 将谱特征空间偏向与注释一致的分割,同时保留由密集自监督视觉特征强制执行的全局分组。该方法在每个数据集使用 5 到 30 个注释时实现了最先进的性能,并在同质性、细粒度和纹理受限领域展示了强大的结果,分别在 CrackForest (CFD)、CUB-200-2011 和 HAM10000 数据集上实现了 96.8%、78.0% 和 78.8% 的平均交并比 (mIoU)。
Prompt Reinjection: Alleviating Prompt Forgetting in Multimodal Diffusion Transformers
Authors: Yuxuan Yao, Yuxuan Chen, Hui Li, Kaihui Cheng, Qipeng Guo, Yuwei Sun, Zilong Dong, Jingdong Wang, Siyu Zhu
First: 2026-02-06T17:19:53+00:00 · Latest: 2026-02-06T17:19:53+00:00
Comments: 18 pages
Abstract
Multimodal Diffusion Transformers (MMDiTs) for text-to-image generation maintain separate text and image branches, with bidirectional information flow between text tokens and visual latents throughout denoising. In this setting, we observe a prompt forgetting phenomenon: the semantics of the prompt representation in the text branch is progressively forgotten as depth increases. We further verify this effect on three representative MMDiTs--SD3, SD3.5, and FLUX.1 by probing linguistic attributes of the representations over the layers in the text branch. Motivated by these findings, we introduce a training-free approach, prompt reinjection, which reinjects prompt representations from early layers into later layers to alleviate this forgetting. Experiments on GenEval, DPG, and T2I-CompBench++ show consistent gains in instruction-following capability, along with improvements on metrics capturing preference, aesthetics, and overall text--image generation quality.
中文标题/摘要
标题:提示重注入:多模态扩散变换器中的提示遗忘缓解
多模态扩散变换器(MMDiTs)在文本到图像生成中保持独立的文本和图像分支,在去噪过程中文本令牌和视觉潜在变量之间存在双向信息流。在这种设置中,我们观察到一种提示遗忘现象:随着深度增加,文本分支中提示表示的语义逐渐被遗忘。我们进一步通过在文本分支的各层中探测表示的语言属性,验证了这种效果。受这些发现的启发,我们提出了一种无需训练的方法——提示重注入,该方法将早期层中的提示表示重新注入到后续层中,以缓解这种遗忘现象。在GenEval、DPG和T2I-CompBench++上的实验显示,在指令遵循能力方面以及在偏好、美学和整体文本-图像生成质量方面均取得了一致的改进。
Summary / 总结
The study addresses the prompt forgetting issue in Multimodal Diffusion Transformers (MMDiTs) where the semantic information of the prompt representation in the text branch is gradually lost as the depth increases. To mitigate this, a training-free method called prompt reinjection is proposed, which reintroduces prompt representations from earlier layers into later layers. The method shows consistent improvements in instruction-following capability and enhances text-image generation quality on various benchmarks including GenEval, DPG, and T2I-CompBench++.
研究关注多模态扩散变换器(MMDiTs)中提示遗忘的问题,即文本分支中的提示表示的语义信息会随着深度增加而逐渐丢失。通过在各层中探测语言属性,作者在SD3、SD3.5和FLUX.1中验证了这一现象。为了解决这一问题,他们提出了一种无需训练的方法——提示重注入,该方法将早期层的提示表示重新注入到后续层,有效缓解了遗忘现象。实验结果表明,在GenEval、DPG和T2I-CompBench++上,该方法在指令遵循能力和偏好、美学等质量指标上均取得了持续的改进。
An Evaluation of Hybrid Annotation Workflows on High-Ambiguity Spatiotemporal Video Footage
Authors: Juan Gutiérrez, Victor Gutiérrez, Ángel Mora, Silvia Rodriguez, José Luis Blanco
First: 2025-10-20T16:10:11+00:00 · Latest: 2026-02-06T16:18:07+00:00
Abstract
Manual annotation remains the gold standard for high-quality, dense temporal video datasets, yet it is inherently time-consuming. Vision-language models can aid human annotators and expedite this process. We report on the impact of automatic Pre-Annotations from a tuned encoder on a Human-in-the-Loop labeling workflow for video footage. Quantitative analysis in a study of a single-iteration test involving 18 volunteers demonstrates that our workflow reduced annotation time by 35% for the majority (72%) of the participants. Beyond efficiency, we provide a rigorous framework for benchmarking AI-assisted workflows that quantifies trade-offs between algorithmic speed and the integrity of human verification.
中文标题/摘要
标题:混合标注流程在高歧义时空视频片段上的评估
手动标注仍然是高质量密集时间视频数据集的黄金标准,但过程本身是固有的耗时。视觉-语言模型可以辅助人类标注员并加快这一过程。我们报告了自动预标注对视频片段的人机协作标注流程的影响。一项涉及18名志愿者的单迭代测试研究表明,我们的流程使大多数(72%)参与者的数据标注时间减少了35%。除了提高效率,我们还提供了一个严格的框架来评估AI辅助流程,该框架量化了算法速度与人类验证完整性之间的权衡。
Summary / 总结
The study evaluates the impact of hybrid annotation workflows using automatic pre-annotations from a tuned encoder on a human-in-the-loop labeling process for high-ambiguity spatiotemporal video footage. The workflow reduced annotation time by 35% for 72% of the participants, demonstrating improved efficiency. Additionally, the research provides a framework to benchmark AI-assisted workflows, quantifying the trade-offs between algorithmic speed and the quality of human verification.
研究评估了使用调优编码器提供的自动预注释的混合注释工作流对人类在环标注时空视频片段的影响。结果显示,72%的参与者相比手动注释,注释时间减少了35%,突出了效率的提升。此外,研究还引入了一个框架来评估AI辅助工作流,平衡算法速度与人类验证的准确性。
POP: Online Structural Pruning Enables Efficient Inference of Large Foundation Models
Authors: Yi Chen, Wonjin Shin, Shuhong Liu, Tho Mai, Jeongmo Lee, Chuanbo Hua, Kun Wang, Jun Liu, Joo-Young Kim
First: 2026-02-06T16:07:42+00:00 · Latest: 2026-02-06T16:07:42+00:00
Abstract
Large foundation models (LFMs) achieve strong performance through scaling, yet current structural pruning methods derive fixed pruning decisions during inference, overlooking sparsity patterns that emerge in the autoregressive token generation. In this paper, we propose POP (Partition-guided Online Pruning), an efficient online structural pruning framework that enables context-conditioned dynamic pruning with minimal computational overhead. POP partitions model channels into retained, candidate, and pruned regions, where prefilling defines a coarse pruning partition, and the decoding stage generates a fine-grained mask within the candidate region, avoiding full-channel re-evaluation. The coarse pruning partition preserves consistently important weights, while the fine-grained masking provides context-conditioned variation during decoding. Moreover, POP is a lightweight, plug-and-play method that requires no preprocessing, including offline calibration, retraining, or learning predictors. Extensive evaluations across diverse LFMs, including large language models (LLMs), mixture-of-experts models (MoEs), and vision-language models (VLMs), demonstrate that POP consistently delivers higher accuracy than existing pruning approaches while incurring smaller computational overhead and minimizing inference latency.
中文标题/摘要
标题:POP:在线结构剪枝使大型基础模型高效推理成为可能
大型基础模型(LFMs)通过扩展规模实现了强大的性能,但当前的结构剪枝方法在推理过程中会做出固定的剪枝决策,忽视了自回归标记生成中出现的稀疏模式。在本文中,我们提出了POP(基于分区的在线剪枝),这是一种高效的在线结构剪枝框架,能够实现上下文条件下的动态剪枝,同时具有最小的计算开销。POP将模型通道划分为保留、候选和剪枝区域,预填充定义了一个粗略的剪枝分区,解码阶段在候选区域内生成一个精细的掩码,避免了全通道重新评估。粗略的剪枝分区保留了始终重要的权重,而精细的掩码则在解码过程中提供了上下文条件下的变化。此外,POP是一种轻量级、即插即用的方法,不需要预处理,包括离线校准、重新训练或学习预测器。在包括大型语言模型(LLMs)、混合专家模型(MoEs)和视觉语言模型(VLMs)在内的多种LFMs上的广泛评估表明,POP在计算开销更小、推理延迟更短的情况下,始终能提供比现有剪枝方法更高的准确性。
Summary / 总结
POP is an online structural pruning framework that dynamically prunes large foundation models during inference to reduce computational overhead while maintaining accuracy. It partitions model channels into retained, candidate, and pruned regions, using prefilling to define a coarse pruning partition and decoding to generate a fine-grained mask. Extensive evaluations show that POP outperforms existing pruning methods in terms of accuracy and computational efficiency across various model types, including language, expert mixture, and vision-language models.
POP 是一种在线结构剪枝框架,在推理过程中基于上下文动态剪枝大型基础模型,同时减少计算开销。它将模型通道划分为保留、候选和剪枝区域,使用预填充进行粗粒度剪枝,并在解码阶段进行细粒度掩码。POP 在各种模型类型(包括大语言模型、专家混合模型和视觉-语言模型)中表现出更高的准确率,同时减少计算开销和推理延迟。
DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh
First: 2025-12-31T17:31:29+00:00 · Latest: 2026-02-06T15:25:39+00:00
Comments: This work has been submitted to the IEEE for possible publication
Abstract
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Project website: https://darkeqa-benchmark.github.io/
中文标题/摘要
标题:DarkEQA:在低光室内环境中评估视觉语言模型的实体问答能力
视觉语言模型(VLMs)越来越多地被用作实体代理的核心推理模块。现有的基准测试在理想的、光线充足的条件下评估它们的能力,但全天候24/7运行需要在广泛的视觉退化条件下表现出色,包括夜间或黑暗环境中的低光条件——这一核心需求已被很大程度上忽视。为了解决这一未被充分探索的挑战,我们提出了DarkEQA,这是一个开源基准,用于在多级低光条件下评估与实体问答(EQA)相关的感知基本能力。DarkEQA通过在受控退化条件下从第一人称观察中进行问答评估,隔离了感知瓶颈,使可归因的鲁棒性分析成为可能。DarkEQA的一个关键设计特点是其物理保真度:视觉退化在线性RAW空间中建模,模拟基于物理的照明下降和传感器噪声,随后是ISP启发式的渲染管道。我们通过评估一系列最先进的VLMs和低光图像增强(LLIE)模型,展示了DarkEQA的实用性。我们的分析系统地揭示了这些视觉条件下的操作限制。项目网站:https://darkeqa-benchmark.github.io/
Summary / 总结
DarkEQA is a benchmark designed to evaluate the performance of Vision-Language Models (VLMs) under low-light conditions, addressing the underexplored challenge of robust 24/7 operation. It isolates the perception bottleneck by degrading egocentric observations and evaluates a wide range of VLMs and Low-Light Image Enhancement (LLIE) models, revealing their limitations in such conditions. The benchmark models visual degradations in linear RAW space, simulating physics-based illumination drops and sensor noise, and demonstrates the need for improved robustness in low-light environments.
DarkEQA 是一个基准,旨在评估视觉-语言模型在低光条件下的表现,解决了24/7稳健运行的未充分探索挑战。它通过降级第一人称观察来隔离感知瓶颈,并使用物理上忠实的模型模拟低光环境。关键发现表明,当前的视觉-语言模型在低光条件下进行问题回答时存在局限性。
Not All Layers Need Tuning: Selective Layer Restoration Recovers Diversity
Authors: Bowen Zhang, Meiyi Wang, Harold Soh
First: 2026-02-06T12:49:13+00:00 · Latest: 2026-02-06T12:49:13+00:00
Comments: 16 pages, 7 figures, 12 tables
Abstract
Post-training improves instruction-following and helpfulness of large language models (LLMs) but often reduces generation diversity, which leads to repetitive outputs in open-ended settings, a phenomenon known as mode collapse. Motivated by evidence that LLM layers play distinct functional roles, we hypothesize that mode collapse can be localized to specific layers and that restoring a carefully chosen range of layers to their pre-trained weights can recover diversity while maintaining high output quality. To validate this hypothesis and decide which layers to restore, we design a proxy task -- Constrained Random Character(CRC) -- with an explicit validity set and a natural diversity objective. Results on CRC reveal a clear diversity-validity trade-off across restoration ranges and identify configurations that increase diversity with minimal quality loss. Based on these findings, we propose Selective Layer Restoration (SLR), a training-free method that restores selected layers in a post-trained model to their pre-trained weights, yielding a hybrid model with the same architecture and parameter count, incurring no additional inference cost. Across three different tasks (creative writing, open-ended question answering, and multi-step reasoning) and three different model families (Llama, Qwen, and Gemma), we find SLR can consistently and substantially improve output diversity while maintaining high output quality.
中文标题/摘要
标题:并非所有层都需要调优:选择性层恢复恢复多样性
后训练可以提高大型语言模型(LLM)的指令遵循能力和帮助性,但通常会减少生成多样性,导致在开放式设置中出现重复输出的现象,称为模式崩溃。受LLM层执行不同功能角色的证据启发,我们假设模式崩溃可以局限于特定层,并且通过恢复精心选择的层到预训练权重可以恢复多样性同时保持高质量输出。为了验证这一假设并决定恢复哪些层,我们设计了一个代理任务——约束随机字符(CRC),具有明确的有效性集和自然多样性目标。CRC上的结果揭示了恢复范围内的多样性-有效性权衡,并确定了增加多样性同时最小化质量损失的配置。基于这些发现,我们提出了选择性层恢复(SLR)方法,这是一种无需训练的方法,将后训练模型中选定的层恢复到预训练权重,生成具有相同架构和参数数量的混合模型,不增加推理成本。在三种不同任务(创造性写作、开放式问答和多步推理)和三种不同模型家族(Llama、Qwen和Gemma)上,我们发现SLR可以一致且显著地提高输出多样性同时保持高质量。
Summary / 总结
The research aims to address the issue of mode collapse in large language models (LLMs) after post-training, which leads to repetitive outputs. The authors hypothesize that mode collapse is localized to specific layers and propose Selective Layer Restoration (SLR), a method that restores selected layers to their pre-trained weights to recover diversity. Experimental results on a proxy task (Constrained Random Character) show that SLR can increase diversity with minimal quality loss. SLR is applied to three tasks and three model families, consistently improving output diversity while maintaining high quality.
研究旨在解决大型语言模型(LLMs)在后训练后出现的模式塌陷问题,导致输出重复。通过设计一个名为约束随机字符(CRC)的代理任务,研究确定了负责模式塌陷的具体层,并提出选择性层恢复(SLR)方法,将这些层恢复到预训练状态。该方法在各种任务和模型家族中恢复了输出多样性,同时保持高质量,且不增加推理成本。
Same Answer, Different Representations: Hidden instability in VLMs
Authors: Farooq Ahmad Wani, Alessandro Suglia, Rohit Saxena, Aryo Pradipta Gema, Wai-Chung Kwan, Fazl Barez, Maria Sofia Bucarelli, Fabrizio Silvestri, Pasquale Minervini
First: 2026-02-06T12:24:26+00:00 · Latest: 2026-02-06T12:24:26+00:00
Abstract
The robustness of Vision Language Models (VLMs) is commonly assessed through output-level invariance, implicitly assuming that stable predictions reflect stable multimodal processing. In this work, we argue that this assumption is insufficient. We introduce a representation-aware and frequency-aware evaluation framework that measures internal embedding drift, spectral sensitivity, and structural smoothness (spatial consistency of vision tokens), alongside standard label-based metrics. Applying this framework to modern VLMs across the SEEDBench, MMMU, and POPE datasets reveals three distinct failure modes. First, models frequently preserve predicted answers while undergoing substantial internal representation drift; for perturbations such as text overlays, this drift approaches the magnitude of inter-image variability, indicating that representations move to regions typically occupied by unrelated inputs despite unchanged outputs. Second, robustness does not improve with scale; larger models achieve higher accuracy but exhibit equal or greater sensitivity, consistent with sharper yet more fragile decision boundaries. Third, we find that perturbations affect tasks differently: they harm reasoning when they disrupt how models combine coarse and fine visual cues, but on the hallucination benchmarks, they can reduce false positives by making models generate more conservative answers.
中文标题/摘要
标题:相同答案,不同表示:VLMs 中隐藏的不稳定性
视觉语言模型(VLMs)的鲁棒性通常通过输出级不变性来评估,隐含地假设稳定的预测反映了稳定的多模态处理。本文中,我们认为这种假设是不足的。我们引入了一种基于表示和频率的评估框架,该框架衡量内部嵌入漂移、频谱敏感性和结构平滑性(视觉标记的空间一致性),同时使用标准标签基线指标。将该框架应用于 SEEDBench、MMMU 和 POPE 数据集中的现代 VLMs 表现出三种不同的失败模式。首先,模型在经历显著的内部表示漂移的同时经常保持预测答案;对于文本覆盖等扰动,这种漂移接近图像间变异性的量级,表明表示移动到通常由无关输入占据的区域,尽管输出未变。其次,鲁棒性并不随规模提高;更大规模的模型获得更高的准确率,但表现出相同或更大的敏感性,这与更尖锐但更脆弱的决策边界一致。第三,我们发现扰动对任务的影响不同:当它们破坏模型如何结合粗略和精细视觉线索时,它们损害了推理,但在幻觉基准测试中,它们可以通过使模型生成更保守的答案来减少假阳性。
Summary / 总结
This study challenges the assumption that stable predictions in VLMs reflect stable multimodal processing. It introduces a framework to evaluate internal embedding drift, spectral sensitivity, and structural smoothness alongside label-based metrics. The research reveals three failure modes: models preserve answers while undergoing significant internal representation changes, robustness does not improve with scale, and perturbations affect tasks differently, harming reasoning but reducing false positives in hallucination benchmarks.
这项工作质疑了VLMs中稳定预测反映稳定多模态处理的常见假设。它引入了一个新的评估框架,用于测量内部嵌入漂移、频谱敏感性和结构平滑性。研究揭示了三种失败模式:模型可以在内部表示发生显著变化的同时保持答案,模型规模的增加并不会提高鲁棒性,且不同的扰动对任务的影响不同,有时通过使模型生成更保守的答案来减少假阳性。
D-SCoRE: Document-Centric Segmentation and CoT Reasoning with Structured Export for QA-CoT Data Generation
Authors: Weibo Zhou, Lingbo Li, Shangsong Liang
First: 2025-08-02T10:45:05+00:00 · Latest: 2026-02-06T11:50:51+00:00
Abstract
The scarcity and high cost of high-quality domain-specific question-answering (QA) datasets limit supervised fine-tuning of large language models (LLMs). We introduce $\textbf{D-SCoRE}$, a training-free framework that leverages LLMs and prompt engineering to automatically generate diverse, rich QA datasets with Chain-of-Thought (CoT) from arbitrary textual sources. By integrating $\textbf{D}$ocument-centric processing, $\textbf{S}$egmentation, $\textbf{Co}$T $\textbf{R}$easoning, and structured $\textbf{E}$xport - along with multi-dimensional controls such as semantic role transformation, question type balancing, and counterfactual augmentation - D-SCoRE produces tailored QA pairs with enhanced diversity and relevance. LLMs fine-tuned on D-SCoRE-generated datasets outperform those trained on human-annotated QA data across most evaluated domains. Its efficiency and scalability enable rapid, high-performance domain-adaptive fine-tuning on consumer-grade hardware, generating over 1,100 high-quality QA pairs per GPU-hour end-to-end.
中文标题/摘要
标题:D-SCoRE: 文档中心化分割与CoT推理及结构化导出用于QA-CoT数据生成
高质量领域特定问答(QA)数据集的稀缺性和高成本限制了大型语言模型(LLMs)的监督微调。我们引入了**D-SCoRE**,一种无需训练的框架,利用LLMs和提示工程从任意文本源自动生成具有CoT(推理链)的多样化、丰富的QA数据集。通过整合文档中心化处理、分割、CoT推理和结构化导出——以及语义角色转换、问题类型平衡和反事实增强等多维度控制——D-SCoRE 生成了具有增强多样性和相关性的定制QA对。使用D-SCoRE生成的数据集微调的LLMs在大多数评估领域中优于使用人工标注QA数据训练的LLMs。其高效性和可扩展性使其能够在消费级硬件上快速、高性能地实现领域适应性微调,每GPU小时生成超过1,100个高质量的QA对。
Summary / 总结
D-SCoRE is a training-free framework that uses large language models and prompt engineering to automatically generate diverse QA datasets with Chain-of-Thought from arbitrary textual sources. By integrating document-centric processing, segmentation, CoT reasoning, and structured export, D-SCoRE produces QA pairs with enhanced diversity and relevance. Fine-tuned LLMs on D-SCoRE-generated datasets outperform those trained on human-annotated data across most domains, demonstrating its efficiency and scalability for rapid domain-adaptive fine-tuning.
D-SCoRE 是一个无需训练的框架,利用大型语言模型和提示工程从各种文本来源自动生成具有链式思考推理的 QA 数据集。通过集成文档中心处理、切分、链式思考推理和结构化导出,以及语义角色转换、问题类型平衡和反事实增强等控制,D-SCoRE 生成高质量的 QA 对。使用 D-SCoRE 生成的数据集微调的模型在大多数领域中优于使用人工标注数据微调的模型,展示了其高效性和可扩展性,能够在消费级硬件上快速进行高性能领域适应性微调,每 GPU 小时可生成超过 1,100 个高质量的 QA 对。
CauCLIP: Bridging the Sim-to-Real Gap in Surgical Video Understanding via Causality-Inspired Vision-Language Modeling
Authors: Yuxin He, An Li, Cheng Xue
First: 2026-02-06T11:23:17+00:00 · Latest: 2026-02-06T11:23:17+00:00
Abstract
Surgical phase recognition is a critical component for context-aware decision support in intelligent operating rooms, yet training robust models is hindered by limited annotated clinical videos and large domain gaps between synthetic and real surgical data. To address this, we propose CauCLIP, a causality-inspired vision-language framework that leverages CLIP to learn domain-invariant representations for surgical phase recognition without access to target domain data. Our approach integrates a frequency-based augmentation strategy to perturb domain-specific attributes while preserving semantic structures, and a causal suppression loss that mitigates non-causal biases and reinforces causal surgical features. These components are combined in a unified training framework that enables the model to focus on stable causal factors underlying surgical workflows. Experiments on the SurgVisDom hard adaptation benchmark demonstrate that our method substantially outperforms all competing approaches, highlighting the effectiveness of causality-guided vision-language models for domain-generalizable surgical video understanding.
中文标题/摘要
标题:CauCLIP: 通过因果启发的视觉语言建模在手术视频理解中弥合模拟到现实的差距
手术阶段识别是智能手术室情境感知决策支持的关键组成部分,但由于临床标注视频有限以及合成数据与真实手术数据之间的巨大领域差距,训练稳健模型受到阻碍。为了解决这一问题,我们提出了一种因果启发的视觉语言框架CauCLIP,该框架利用CLIP学习手术阶段识别的领域不变表示,而无需访问目标领域数据。我们的方法结合了基于频率的增强策略,以扰动领域特定属性同时保留语义结构,并引入因果抑制损失以减轻非因果偏差并强化因果手术特征。这些组件在统一的训练框架中结合,使模型能够关注手术工作流程下的稳定因果因素。在SurgVisDom硬适应基准测试上的实验表明,我们的方法显著优于所有竞争方法,突显了因果引导的视觉语言模型在领域泛化手术视频理解中的有效性。
Summary / 总结
The research aims to improve surgical phase recognition in intelligent operating rooms by addressing the challenges of limited annotated clinical videos and domain gaps between synthetic and real surgical data. CauCLIP, a causality-inspired vision-language framework, is proposed to learn domain-invariant representations without access to target domain data. It uses frequency-based augmentation to perturb domain-specific attributes and a causal suppression loss to mitigate non-causal biases. Experiments show that CauCLIP outperforms existing methods on the SurgVisDom benchmark, demonstrating the effectiveness of causality-guided models for domain-generalizable surgical video understanding.
研究旨在通过解决有限标注临床视频和合成与真实手术数据之间领域差距的问题,提高智能手术室中的手术阶段识别。提出的CauCLIP框架使用因果启发的视觉-语言模型,在无需目标领域数据的情况下学习领域不变的表示。它采用基于频率的增强策略来保留语义结构,并使用因果抑制损失来强化因果手术特征。实验表明,CauCLIP在SurgVisDom基准测试中显著优于现有方法,展示了因果引导模型在手术视频理解中的领域泛化能力。
SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs
Authors: Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti
First: 2026-02-06T10:05:25+00:00 · Latest: 2026-02-06T10:05:25+00:00
Abstract
Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses "thinking with images" by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.
中文标题/摘要
标题:SPARC:分离感知和推理电路以实现视觉语言模型测试时的扩展
尽管取得了近期的成功,测试时扩展——即在推理过程中根据需要动态扩展令牌预算——对于视觉语言模型(VLMs)仍然脆弱:关于图像的无序思维链将感知与推理纠缠在一起,导致长而杂乱的上下文,其中细微的感知错误可能会导致完全错误的答案。此外,为了获得良好的性能,需要昂贵的手工设计奖励的强化学习。在这里,我们提出了SPARC(分离感知和推理电路),这是一种模块化框架,明确地将视觉感知与推理分离。受大脑顺序感觉认知处理的启发,SPARC 实现了一个两阶段流水线,其中模型首先进行显式的视觉搜索以定位与问题相关区域,然后基于这些区域进行推理以生成最终答案。这种分离使得可以独立地进行测试时扩展,并且可以异构地分配计算资源(例如,在分布转移时优先处理感知处理),支持选择性优化(例如,仅改进感知阶段以解决端到端性能瓶颈),并且可以通过在较低图像分辨率下运行全局搜索并在选定区域分配高分辨率处理来适应压缩的上下文,从而减少总的视觉令牌计数和计算量。在具有挑战性的视觉推理基准测试中,SPARC 超过了单一模块基线和强大的视觉定位方法。例如,SPARC 将 Qwen3VL-4B 在 $V^*$ VQA 基准测试中的准确性提高了 6.7 个百分点,并且在一项具有挑战性的 OOD 任务中,尽管需要 200 倍更低的令牌预算,但其表现超过了“用图像思考”。
Summary / 总结
SPARC is a modular framework that decouples visual perception from reasoning in vision-language models to improve test-time scaling. It uses a two-stage pipeline where the model first localizes question-relevant regions and then reasons about them independently. This approach enables asymmetric compute allocation and selective optimization, reducing the token budget by 200 times while outperforming monolithic baselines and visual-grounding approaches on benchmarks like VQA and OOD tasks.
SPARC 是一个模块化框架,将视觉感知与推理分离,以提高 VLMs 的测试时扩展性。它采用两阶段流水线,首先进行视觉搜索以定位相关区域,然后基于这些区域进行推理。SPARC 在具有较低 token 预算的情况下,在具有挑战性的基准测试中显著超越了单一模型基线和强大的视觉定位方法。
Universal Anti-forensics Attack against Image Forgery Detection via Multi-modal Guidance
Authors: Haipeng Li, Rongxuan Peng, Anwei Luo, Shunquan Tan, Changsheng Chen, Anastasia Antsiferova
First: 2026-02-06T09:32:10+00:00 · Latest: 2026-02-06T09:32:10+00:00
Comments: 17 pages, 11 figures
Abstract
The rapid advancement of AI-Generated Content (AIGC) technologies poses significant challenges for authenticity assessment. However, existing evaluation protocols largely overlook anti-forensics attack, failing to ensure the comprehensive robustness of state-of-the-art AIGC detectors in real-world applications. To bridge this gap, we propose ForgeryEraser, a framework designed to execute universal anti-forensics attack without access to the target AIGC detectors. We reveal an adversarial vulnerability stemming from the systemic reliance on Vision-Language Models (VLMs) as shared backbones (e.g., CLIP), where downstream AIGC detectors inherit the feature space of these publicly accessible models. Instead of traditional logit-based optimization, we design a multi-modal guidance loss to drive forged image embeddings within the VLM feature space toward text-derived authentic anchors to erase forgery traces, while repelling them from forgery anchors. Extensive experiments demonstrate that ForgeryEraser causes substantial performance degradation to advanced AIGC detectors on both global synthesis and local editing benchmarks. Moreover, ForgeryEraser induces explainable forensic models to generate explanations consistent with authentic images for forged images. Our code will be made publicly available.
中文标题/摘要
标题:针对图像伪造检测的多模态引导通用反取证攻击
AI生成内容(AIGC)技术的迅速发展对真实性评估提出了重大挑战。然而,现有的评估协议很大程度上忽视了反取证攻击,未能确保最先进的AIGC检测器在实际应用中的全面鲁棒性。为弥补这一差距,我们提出了一种名为ForgeryEraser的框架,该框架能够在不访问目标AIGC检测器的情况下执行通用反取证攻击。我们揭示了一种源自系统性依赖视觉语言模型(VLMs)作为共享骨干(例如CLIP)的对抗性漏洞,下游AIGC检测器继承了这些公开可访问模型的特征空间。我们设计了一种多模态引导损失,而不是传统的logit优化,以驱动伪造图像嵌入向文本衍生的真实锚点靠拢,从而消除伪造痕迹,同时将它们排斥在伪造锚点之外。广泛的实验表明,ForgeryEraser在全局合成和局部编辑基准上对高级AIGC检测器造成了显著的性能下降。此外,ForgeryEraser促使可解释的法医模型为伪造图像生成与真实图像一致的解释。我们的代码将公开发布。
Summary / 总结
The paper addresses the challenge of ensuring the robustness of image forgery detection in the era of AI-generated content. It introduces ForgeryEraser, a framework that performs universal anti-forensics attacks by leveraging multi-modal guidance to manipulate image embeddings within the feature space of Vision-Language Models. The experiments show that ForgeryEraser significantly degrades the performance of advanced AIGC detectors and causes forensic models to generate explanations consistent with authentic images for forged images.
论文旨在确保AI生成内容(AIGC)检测器在面对反取证攻击时的鲁棒性。提出了ForgeryEraser框架,无需访问目标AIGC检测器即可执行通用反取证攻击。通过利用多模态引导损失,ForgeryEraser在视觉-语言模型(VLM)的特征空间内操纵伪造图像嵌入,以消除伪造痕迹。大量实验表明,ForgeryEraser显著降低了高级AIGC检测器的性能,并导致取证模型为伪造图像生成与真实图像一致的解释。
AdaptOVCD: Training-Free Open-Vocabulary Remote Sensing Change Detection via Adaptive Information Fusion
Authors: Mingyu Dou, Shi Qiu, Ming Hu, Yifan Chen, Huping Ye, Xiaohan Liao, Zhe Sun
First: 2026-02-06T09:30:23+00:00 · Latest: 2026-02-06T09:30:23+00:00
Abstract
Remote sensing change detection plays a pivotal role in domains such as environmental monitoring, urban planning, and disaster assessment. However, existing methods typically rely on predefined categories and large-scale pixel-level annotations, which limit their generalization and applicability in open-world scenarios. To address these limitations, this paper proposes AdaptOVCD, a training-free Open-Vocabulary Change Detection (OVCD) architecture based on dual-dimensional multi-level information fusion. The framework integrates multi-level information fusion across data, feature, and decision levels vertically while incorporating targeted adaptive designs horizontally, achieving deep synergy among heterogeneous pre-trained models to effectively mitigate error propagation. Specifically, (1) at the data level, Adaptive Radiometric Alignment (ARA) fuses radiometric statistics with original texture features and synergizes with SAM-HQ to achieve radiometrically consistent segmentation; (2) at the feature level, Adaptive Change Thresholding (ACT) combines global difference distributions with edge structure priors and leverages DINOv3 to achieve robust change detection; (3) at the decision level, Adaptive Confidence Filtering (ACF) integrates semantic confidence with spatial constraints and collaborates with DGTRS-CLIP to achieve high-confidence semantic identification. Comprehensive evaluations across nine scenarios demonstrate that AdaptOVCD detects arbitrary category changes in a zero-shot manner, significantly outperforming existing training-free methods. Meanwhile, it achieves 84.89\% of the fully-supervised performance upper bound in cross-dataset evaluations and exhibits superior generalization capabilities. The code is available at https://github.com/Dmygithub/AdaptOVCD.
中文标题/摘要
标题:AdaptOVCD:基于自适应信息融合的无训练开放词汇遥感变化检测
遥感变化检测在环境监测、城市规划和灾害评估等领域发挥着关键作用。然而,现有方法通常依赖预定义的类别和大规模像素级注释,这限制了它们在开放世界场景中的泛化能力和适用性。为了解决这些限制,本文提出了一种基于双维度多级信息融合的无训练开放词汇变化检测(OVCD)架构——AdaptOVCD。该框架在垂直方向上整合了数据、特征和决策三个层面的多级信息融合,同时在水平方向上引入了针对性的自适应设计,实现了异构预训练模型之间的深度协同,有效缓解了错误传播。具体而言,(1) 在数据层面,自适应辐射度对齐(ARA)融合辐射度统计与原始纹理特征,并与SAM-HQ协同实现辐射度一致的分割;(2) 在特征层面,自适应变化阈值(ACT)结合全局差异分布与边缘结构先验,并利用DINOv3实现稳健的变化检测;(3) 在决策层面,自适应置信度筛选(ACF)整合语义置信度与空间约束,并与DGTRS-CLIP协作实现高置信度的语义识别。在九个场景中的全面评估表明,AdaptOVCD能够在零样本情况下检测任意类别变化,显著优于现有无训练方法。同时,在跨数据集评估中,它达到了84.89%的全监督性能上限,并展示了优越的泛化能力。代码可在https://github.com/Dmygithub/AdaptOVCD获取。
Summary / 总结
AdaptOVCD is a training-free open-vocabulary change detection method that uses adaptive information fusion across data, feature, and decision levels to detect arbitrary category changes in remote sensing images. It integrates Adaptive Radiometric Alignment, Adaptive Change Thresholding, and Adaptive Confidence Filtering to achieve radiometrically consistent segmentation, robust change detection, and high-confidence semantic identification, respectively. Comprehensive evaluations show that AdaptOVCD outperforms existing training-free methods and achieves 84.89% of the fully-supervised performance upper bound in cross-dataset evaluations, demonstrating strong generalization capabilities.
AdaptOVCD 是一种无需训练的开放词汇量变化检测方法,通过在数据、特征和决策层面进行自适应信息融合来检测遥感图像中的任意类别变化。它结合了自适应辐射度对齐、自适应变化阈值和自适应置信度筛选,分别实现辐射度一致分割、稳健变化检测和高置信度语义识别。综合评估表明,AdaptOVCD 在跨数据集评估中达到了完全监督性能上限的 84.89%,并展示了强大的泛化能力。
HyPER: Bridging Exploration and Exploitation for Scalable LLM Reasoning with Hypothesis Path Expansion and Reduction
Authors: Shengxuan Qiu, Haochen Huang, Shuzhang Zhong, Pengfei Zuo, Meng Li
First: 2026-02-06T09:27:54+00:00 · Latest: 2026-02-06T09:27:54+00:00
Abstract
Scaling test-time compute with multi-path chain-of-thought improves reasoning accuracy, but its effectiveness depends critically on the exploration-exploitation trade-off. Existing approaches address this trade-off in rigid ways: tree-structured search hard-codes exploration through brittle expansion rules that interfere with post-trained reasoning, while parallel reasoning over-explores redundant hypothesis paths and relies on weak answer selection. Motivated by the observation that the optimal balance is phase-dependent and that correct and incorrect reasoning paths often diverge only at late stages, we reformulate test-time scaling as a dynamic expand-reduce control problem over a pool of hypotheses. We propose HyPER, a training-free online control policy for multi-path decoding in mixture-of-experts models that reallocates computation under a fixed budget using lightweight path statistics. HyPER consists of an online controller that transitions from exploration to exploitation as the hypothesis pool evolves, a token-level refinement mechanism that enables efficient generation-time exploitation without full-path resampling, and a length- and confidence-aware aggregation strategy for reliable answer-time exploitation. Experiments on four mixture-of-experts language models across diverse reasoning benchmarks show that HyPER consistently achieves a superior accuracy-compute trade-off, improving accuracy by 8 to 10 percent while reducing token usage by 25 to 40 percent.
中文标题/摘要
标题:HyPER:通过假设路径扩展与缩减平衡探索与利用,实现可扩展的大模型推理
通过多路径链式思考扩展推理计算,可以提高推理准确性,但其效果取决于探索与利用之间的权衡。现有方法以僵硬的方式解决这一权衡:树状搜索通过脆弱的扩展规则硬编码探索,干扰后训练推理;并行推理过度探索冗余假设路径,依赖于较弱的答案选择。鉴于观察到最优平衡依赖于推理阶段,并且正确的和错误的推理路径通常仅在后期才分叉,我们将测试时的扩展视为假设池中动态扩展与缩减控制问题。我们提出了HyPER,一种无需训练的在线控制策略,用于混合专家模型中的多路径解码,在固定预算下使用轻量级路径统计重新分配计算。HyPER 包含一个在线控制器,随着假设池的变化从探索过渡到利用,一个在生成时利用而不需重新采样完整路径的标记级细化机制,以及一种长度和置信度感知的聚合策略,以实现可靠的答案时利用。在四个不同混合专家语言模型上的多种推理基准测试中,HyPER 一致地实现了更好的准确性和计算权衡,提高了8%到10%的准确性,同时减少了25%到40%的标记使用量。
Summary / 总结
The research aims to improve the accuracy-compute trade-off in large language model reasoning by addressing the exploration-exploitation trade-off. HyPER, a training-free online control policy, dynamically reallocates computation among hypothesis paths in mixture-of-experts models. Experiments show that HyPER enhances accuracy by 8 to 10 percent and reduces token usage by 25 to 40 percent across various reasoning benchmarks.
论文解决了大规模语言模型(LLM)多路径链式推理中探索与利用之间的平衡问题。它提出了HyPER,一种基于轻量路径统计的无训练在线控制策略,能够动态重新分配计算预算。HyPER随着假设池的变化从探索过渡到利用,并包含一个基于token的精炼机制以及长度和置信度感知的聚合策略。实验表明,HyPER在各种推理基准测试中将准确率提高了8到10个百分点,并将token使用量减少了25到40个百分点。
FloorplanVLM: A Vision-Language Model for Floorplan Vectorization
Authors: Yuanqing Liu, Ziming Yang, Yulong Li, Yue Yang
First: 2026-02-06T08:57:52+00:00 · Latest: 2026-02-06T08:57:52+00:00
Abstract
Converting raster floorplans into engineering-grade vector graphics is challenging due to complex topology and strict geometric constraints. To address this, we present FloorplanVLM, a unified framework that reformulates floorplan vectorization as an image-conditioned sequence modeling task. Unlike pixel-based methods that rely on fragile heuristics or query-based transformers that generate fragmented rooms, our model directly outputs structured JSON sequences representing the global topology. This 'pixels-to-sequence' paradigm enables the precise and holistic constraint satisfaction of complex geometries, such as slanted walls and curved arcs. To support this data-hungry approach, we introduce a scalable data engine: we construct a large-scale dataset (Floorplan-2M) and a high-fidelity subset (Floorplan-HQ-300K) to balance geometric diversity and pixel-level precision. We then employ a progressive training strategy, using Supervised Fine-Tuning (SFT) for structural grounding and quality annealing, followed by Group Relative Policy Optimization (GRPO) for strict geometric alignment. To standardize evaluation on complex layouts, we establish and open-source FPBench-2K. Evaluated on this rigorous benchmark, FloorplanVLM demonstrates exceptional structural validity, achieving $\textbf{92.52%}$ external-wall IoU and robust generalization across non-Manhattan architectures.
中文标题/摘要
标题:FloorplanVLM:一种用于楼层平面图矢量化的眼动-语言模型
将栅格楼层平面图转换为工程级矢量图形具有挑战性,因为存在复杂的拓扑结构和严格的几何约束。为了解决这一问题,我们提出了FloorplanVLM,这是一种统一框架,将楼层平面图矢量化重新定义为基于图像的序列建模任务。与依赖脆弱启发式方法的基于像素的方法或生成碎片化房间的基于查询的变压器不同,我们的模型直接输出表示全局拓扑结构的结构化JSON序列。这种“像素到序列”的范式使我们能够精确地满足复杂几何形状的约束,例如倾斜的墙壁和曲线弧。为了支持这种数据密集型方法,我们引入了一个可扩展的数据引擎:我们构建了一个大规模数据集(Floorplan-2M)和一个高保真子集(Floorplan-HQ-300K),以平衡几何多样性与像素级精度。然后,我们采用了一种渐进式训练策略,使用监督微调(SFT)进行结构定位,然后使用组相对策略优化(GRPO)进行严格的几何对齐。为了在复杂布局上标准化评估,我们建立了并开源了FPBench-2K。在这一严格的基准测试上评估,FloorplanVLM展示了卓越的结构有效性,实现了92.52%的外部墙IoU,并且在非曼哈顿结构中具有稳健的泛化能力。
Summary / 总结
FloorplanVLM is a vision-language model that converts raster floorplans into vector graphics by reformulating the task as an image-conditioned sequence modeling problem. It directly outputs structured JSON sequences, enabling precise constraint satisfaction for complex geometries. The model is trained using a progressive strategy and evaluated on a new benchmark, FPBench-2K, achieving 92.52% external-wall IoU and robust performance across non-Manhattan architectures.
FloorplanVLM 是一个统一框架,通过将任务重新表述为图像条件下的序列建模问题,将栅格化平面图转换为矢量图形。它直接输出结构化的 JSON 序列,能够精确满足复杂的几何约束。该模型使用渐进式训练策略进行训练,并在新的基准上进行评估,实现了 92.52% 的外部墙体 IoU,并且在非曼哈顿结构中具有很强的泛化能力。
LAB-Det: Language as a Domain-Invariant Bridge for Training-Free One-Shot Domain Generalization in Object Detection
Authors: Xu Zhang, Zhe Chen, Jing Zhang, Dacheng Tao
First: 2026-02-06T08:03:04+00:00 · Latest: 2026-02-06T08:03:04+00:00
Abstract
Foundation object detectors such as GLIP and Grounding DINO excel on general-domain data but often degrade in specialized and data-scarce settings like underwater imagery or industrial defects. Typical cross-domain few-shot approaches rely on fine-tuning scarce target data, incurring cost and overfitting risks. We instead ask: Can a frozen detector adapt with only one exemplar per class without training? To answer this, we introduce training-free one-shot domain generalization for object detection, where detectors must adapt to specialized domains with only one annotated exemplar per class and no weight updates. To tackle this task, we propose LAB-Det, which exploits Language As a domain-invariant Bridge. Instead of adapting visual features, we project each exemplar into a descriptive text that conditions and guides a frozen detector. This linguistic conditioning replaces gradient-based adaptation, enabling robust generalization in data-scarce domains. We evaluate on UODD (underwater) and NEU-DET (industrial defects), two widely adopted benchmarks for data-scarce detection, where object boundaries are often ambiguous, and LAB-Det achieves up to 5.4 mAP improvement over state-of-the-art fine-tuned baselines without updating a single parameter. These results establish linguistic adaptation as an efficient and interpretable alternative to fine-tuning in specialized detection settings.
中文标题/摘要
标题:LAB-Det:语言作为不变领域桥梁的训练免费单次领域泛化在目标检测中的应用
基础目标检测器如GLIP和Grounding DINO在通用领域数据上表现出色,但在如水下图像或工业缺陷等专门化和数据稀缺的环境中往往会退化。典型的跨领域少量样本方法依赖于对稀缺目标数据进行微调,这会带来成本和过拟合的风险。相反,我们提出的问题是:一个冻结的检测器是否可以在没有训练的情况下,仅通过每个类别的一个示例进行适应?为了回答这个问题,我们引入了目标检测中的训练免费单次领域泛化,其中检测器必须仅使用每个类别的一个标注示例和不更新权重的情况下适应专门化的领域。为了应对这一任务,我们提出了LAB-Det,它利用语言作为领域不变的桥梁。我们不适应视觉特征,而是将每个示例投影到描述性文本中,该文本条件并指导一个冻结的检测器。这种语言条件替代了基于梯度的适应,使检测器在数据稀缺领域中具有鲁棒的泛化能力。我们在UODD(水下)和NEU-DET(工业缺陷)两个广泛采用的数据稀缺检测基准上进行了评估,其中目标边界往往模糊不清,LAB-Det在不更新任何参数的情况下,相对于最先进的微调基线,实现了高达5.4个mAP的改进。这些结果确立了语言适应在专门化检测设置中作为一种高效且可解释的替代微调的方法。
Summary / 总结
The research addresses the challenge of object detection in specialized and data-scarce domains where typical detectors perform poorly. LAB-Det, a training-free one-shot domain generalization method, is introduced to adapt detectors using only one exemplar per class without updating weights. By projecting exemplars into descriptive text and using this text to condition a frozen detector, LAB-Det achieves up to 5.4 mAP improvement over fine-tuned baselines on UODD and NEU-DET benchmarks without any parameter updates.
研究旨在解决基础目标检测器在专业化和数据稀缺领域中的性能下降问题。提出了LAB-Det,一种无需训练的一次性领域泛化方法,使检测器能够在每个类别只有一个示例的情况下进行适应。通过将示例投影到描述性文本中,并使用该文本来条件化一个冻结的检测器,LAB-Det在数据稀缺领域如水下图像和工业缺陷中实现了高达5.4 mAP的改进,而无需更新任何参数,证明了其在专业化检测设置中的有效性。
T$^3$-S2S: Training-free Triplet Tuning for Sketch to Scene Synthesis in Controllable Concept Art Generation
Authors: Zhenhong Sun, Yifu Wang, Yonhon Ng, Yongzhi Xu, Daoyi Dong, Hongdong Li, Pan Ji
First: 2024-12-18T04:01:32+00:00 · Latest: 2026-02-06T06:29:47+00:00
Comments: https://openreview.net/forum?id=lyn2BgKQ8F
Abstract
2D concept art generation for 3D scenes is a crucial yet challenging task in computer graphics, as creating natural intuitive environments still demands extensive manual effort in concept design. While generative AI has simplified 2D concept design via text-to-image synthesis, it struggles with complex multi-instance scenes and offers limited support for structured terrain layout. In this paper, we propose a Training-free Triplet Tuning for Sketch-to-Scene (T3-S2S) generation after reviewing the entire cross-attention mechanism. This scheme revitalizes the ControlNet model for detailed multi-instance generation via three key modules: Prompt Balance ensures keyword representation and minimizes the risk of missing critical instances; Characteristic Priority emphasizes sketch-based features by highlighting TopK indices in feature channels; and Dense Tuning refines contour details within instance-related regions of the attention map. Leveraging the controllability of T3-S2S, we also introduce a feature-sharing strategy with dual prompt sets to generate layer-aware isometric and terrain-view representations for the terrain layout. Experiments show that our sketch-to-scene workflow consistently produces multi-instance 2D scenes with details aligned with input prompts.
中文标题/摘要
标题:T$^3$-S2S: 无需训练的三重调谐以实现草图到场景合成的可控概念艺术生成
在计算机图形学中,2D概念艺术生成对于3D场景至关重要但极具挑战性,因为创建自然直观的环境仍然需要大量的手工努力进行概念设计。虽然生成式AI通过文本到图像合成简化了2D概念设计,但它在处理复杂多实例场景方面存在困难,并且对结构化地形布局的支持有限。在本文中,我们提出了一种无需训练的三重调谐以实现草图到场景(T3-S2S)生成方案,该方案通过回顾整个交叉注意力机制。该方案通过三个关键模块重新激活了ControlNet模型以实现详细的多实例生成:提示平衡确保关键词表示并最小化遗漏关键实例的风险;特征优先级通过突出特征通道中的TopK索引强调基于草图的特征;密集调谐在注意力图的相关实例区域中细化轮廓细节。利用T3-S2S的可控性,我们还引入了一种基于双提示集的特征共享策略,以生成地形布局的分层等轴视图和地形视图表示。实验表明,我们的草图到场景工作流始终能够生成与输入提示对齐的多实例2D场景。
Summary / 总结
This paper addresses the challenge of generating 2D concept art for 3D scenes, which requires significant manual effort. It proposes T$^3$-S2S, a training-free method that enhances the ControlNet model for detailed multi-instance generation. Key modules include Prompt Balance, Characteristic Priority, and Dense Tuning, which ensure detailed and aligned multi-instance scenes. Experiments demonstrate that T$^3$-S2S can generate multi-instance 2D scenes with precise alignment to input prompts.
论文旨在解决生成用于3D场景的详细2D概念艺术需要大量人工努力的问题。为此,作者提出了一种训练免费的T$^3$-S2S方法,用于草图到场景的合成。该方法使用三个关键模块:Prompt Balance、Characteristic Priority和Dense Tuning,以增强多实例生成的控制和细节。实验表明,T$^3$-S2S可以从输入提示中生成多实例2D场景,并且细节与输入提示一致,优于现有的生成AI技术。
MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing
Authors: Wenjie Wang, Wei Wu, Ying Liu, Yuan Zhao, Xiaole Lv, Liang Diao, Zengjian Fan, Wenfeng Xie, Ziling Lin, De Shi, Lin Huang, Kaihe Xu, Hong Li
First: 2026-02-06T05:47:40+00:00 · Latest: 2026-02-06T05:47:40+00:00
Comments: 20 pages, 8 figures. Technical report
Abstract
Medical document OCR is challenging due to complex layouts, domain-specific terminology, and noisy annotations, while requiring strict field-level exact matching. Existing OCR systems and general-purpose vision-language models often fail to reliably parse such documents. We propose MeDocVL, a post-trained vision-language model for query-driven medical document parsing. Our framework combines Training-driven Label Refinement to construct high-quality supervision from noisy annotations, with a Noise-aware Hybrid Post-training strategy that integrates reinforcement learning and supervised fine-tuning to achieve robust and precise extraction. Experiments on medical invoice benchmarks show that MeDocVL consistently outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance under noisy supervision.
中文标题/摘要
标题:MeDocVL:一种用于医学文档理解和解析的视觉语言模型
医学文档OCR由于复杂的布局、领域特定的术语和嘈杂的注释而具有挑战性,同时需要严格的字段级精确匹配。现有的OCR系统和通用的视觉-语言模型往往无法可靠地解析此类文档。我们提出了一种名为MeDocVL的后训练视觉-语言模型,用于查询驱动的医学文档解析。我们的框架结合了训练驱动的标签精炼,从嘈杂的注释中构建高质量的监督,以及一种噪声感知的混合后训练策略,该策略结合了强化学习和监督微调,以实现稳健和精确的提取。在医学发票基准测试上的实验表明,MeDocVL在嘈杂监督下始终优于传统的OCR系统和强大的VLM基线,实现了最先进的性能。
Summary / 总结
The research aims to address the challenges of medical document OCR, including complex layouts, domain-specific terminology, and noisy annotations. MeDocVL, a post-trained vision-language model, is proposed for query-driven medical document parsing. It combines training-driven label refinement to improve supervision from noisy annotations and a noise-aware hybrid post-training strategy that integrates reinforcement learning and supervised fine-tuning. Experiments on medical invoice benchmarks demonstrate that MeDocVL outperforms conventional OCR systems and strong VLM baselines, achieving state-of-the-art performance even under noisy supervision.
研究旨在解决医疗文档OCR的挑战,包括复杂布局、领域特定术语和嘈杂的标注。作者提出了MeDocVL,一种用于查询驱动的医疗文档解析的后训练视觉语言模型。该模型使用Training-driven Label Refinement来改进从嘈杂标注中获得的监督,并结合强化学习和监督微调的Noise-aware Hybrid Post-training策略。实验表明,MeDocVL在嘈杂监督下优于传统OCR系统和强大的VLM基线,实现了最先进的性能。
POINTS-GUI-G: GUI-Grounding Journey
Authors: Zhongyin Zhao, Yuan Liu, Yikun Liu, Haicheng Wang, Le Tian, Xiao Zhou, Yangxiu You, Zilin Yu, Yang Yu, Jie Zhou
First: 2026-02-06T05:14:11+00:00 · Latest: 2026-02-06T05:14:11+00:00
Abstract
The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive digital workflows. As a foundational capability, GUI grounding is typically established as a prerequisite for end-to-end task execution. It enables models to precisely locate interface elements, such as text and icons, to perform accurate operations like clicking and typing. Unlike prior works that fine-tune models already possessing strong spatial awareness (e.g., Qwen3-VL), we aim to master the full technical pipeline by starting from a base model with minimal grounding ability, such as POINTS-1.5. We introduce POINTS-GUI-G-8B, which achieves state-of-the-art performance with scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision. Our model's success is driven by three key factors: (1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards. While RL is traditionally used to bolster reasoning, we demonstrate that it significantly improves precision in the perception-intensive GUI grounding task. Furthermore, GUI grounding provides a natural advantage for RL, as rewards are easily verifiable and highly accurate.
中文标题/摘要
标题:POINTS-GUI-G: GUI-定位之旅
视觉语言模型的迅速发展催生了GUI代理,它们在自动化复杂任务(如在线购物和航班预订)方面具有巨大潜力,从而减轻了重复数字工作流程的负担。作为一项基础能力,GUI定位通常作为端到端任务执行的前提条件。它使模型能够精确定位界面元素,如文本和图标,以执行准确的操作,如点击和输入。与之前对已有强大空间意识的模型(如Qwen3-VL)进行微调的工作不同,我们旨在从一个基本模型开始,该模型的定位能力极低,例如POINTS-1.5。我们引入了POINTS-GUI-G-8B,该模型在ScreenSpot-Pro上的得分为59.9,在OSWorld-G上的得分为66.0,在ScreenSpot-v2上的得分为95.7,在UI-Vision上的得分为49.9。我们的模型的成功得益于三个关键因素:(1) 精细化的数据工程,涉及统一多种开源数据集格式以及复杂的增强、过滤和难度分级策略;(2) 改进的训练策略,包括持续微调视觉编码器以提高感知准确性,并在训练和推理之间保持分辨率一致性;以及(3) 可验证奖励的强化学习(RL)。虽然RL传统上用于增强推理,但我们证明它在感知密集型GUI定位任务中显著提高了精度。此外,GUI定位为RL提供了自然优势,因为奖励易于验证且非常准确。
Summary / 总结
This paper introduces POINTS-GUI-G-8B, a model that achieves state-of-the-art performance in GUI grounding with scores on various benchmarks. The motivation is to automate complex tasks through vision-language models, reducing repetitive digital workflows. The method involves refined data engineering, improved training strategies, and reinforcement learning with verifiable rewards. Key findings include scores of 59.9 on ScreenSpot-Pro, 66.0 on OSWorld-G, 95.7 on ScreenSpot-v2, and 49.9 on UI-Vision, with RL enhancing precision in perception-intensive tasks.
该论文介绍了POINTS-GUI-G-8B模型,在多种基准测试中取得了最先进的性能。动机是通过视觉-语言模型自动化复杂任务,减少重复的数字工作流。方法包括精细的数据工程、改进的训练策略以及带有可验证奖励的强化学习。关键发现包括在ScreenSpot-Pro上的得分为59.9,在OSWorld-G上的得分为66.0,在ScreenSpot-v2上的得分为95.7,在UI-Vision上的得分为49.9,强化学习在感知密集型GUI定位任务中显著提高了精度。
Probing Perceptual Constancy in Large Vision-Language Models
Authors: Haoran Sun, Bingyang Wang, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Lianyu Huang, Zelong Hong, Jiahui Ge, Qianli Ma, Hang He, Yifan Zhou, Lingzi Guo, Lantao Mei, Maijunxian Wang, Dezhi Luo, Hokin Deng
First: 2025-02-14T16:31:43+00:00 · Latest: 2026-02-06T05:00:31+00:00
Comments: Under Review
Abstract
Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.
中文标题/摘要
标题:探究大型视觉语言模型的知觉恒常性
知觉恒常性是指在感官输入发生变化(如距离、角度或光照变化)的情况下,维持对物体稳定感知的能力。这种能力对于在动态世界中进行视觉理解至关重要。在这里,我们探讨了当前视觉语言模型(VLMs)的这种能力。在本研究中,我们使用236项实验评估了155个VLM在三个领域中的表现:颜色恒常性、大小恒常性和形状恒常性。实验包括对经典认知任务的单张图像和视频改编,以及在野外条件下的新型任务。我们发现这些领域中VLM的表现存在显著差异,形状恒常性的表现与颜色和大小恒常性表现明显不同。
Summary / 总结
This study investigates the ability of Vision Language Models (VLMs) to maintain stable perceptions of objects despite changes in sensory input, such as variations in color, size, and shape. Using 236 experiments across three domains, the researchers evaluated 155 VLMs and found significant variability in performance, with shape constancy showing distinct differences from color and size constancy.
研究通过评估155个视觉语言模型在颜色、大小和形状恒常性三个领域的表现,使用236项实验(包括单张图像和视频任务)来考察视觉语言模型的恒常性能力。研究发现,视觉语言模型在这三个领域的表现存在显著差异,特别是在形状恒常性方面与颜色和大小恒常性表现不同。
Enhancing Features in Long-tailed Data Using Large Vision Model
Authors: Pengxiao Han, Changkun Ye, Jinguang Tong, Cuicui Jiang, Jie Hong, Li Fang, Xuesong Li
First: 2025-04-15T04:21:50+00:00 · Latest: 2026-02-06T03:58:09+00:00
Abstract
Language-based foundation models, such as large language models (LLMs) or large vision-language models (LVLMs), have been widely studied in long-tailed recognition. However, the need for linguistic data is not applicable to all practical tasks. In this study, we aim to explore using large vision models (LVMs) or visual foundation models (VFMs) to enhance long-tailed data features without any language information. Specifically, we extract features from the LVM and fuse them with features in the baseline network's map and latent space to obtain the augmented features. Moreover, we design several prototype-based losses in the latent space to further exploit the potential of the augmented features. In the experimental section, we validate our approach on two benchmark datasets: ImageNet-LT and iNaturalist2018.
中文标题/摘要
标题:使用大型视觉模型增强长尾数据特征
语言基础模型,如大型语言模型(LLMs)或大型视觉语言模型(LVLMs),在长尾识别方面得到了广泛研究。然而,并非所有实际任务都需要语言数据。在本研究中,我们旨在探索使用大型视觉模型(LVMs)或视觉基础模型(VFMs)来增强长尾数据特征,而不使用任何语言信息。具体而言,我们从LVM中提取特征,并将其与基线网络映射和潜在空间中的特征融合,以获得增强特征。此外,我们在潜在空间中设计了几种原型损失,以进一步利用增强特征的潜力。在实验部分,我们在两个基准数据集:ImageNet-LT和iNaturalist2018上验证了我们的方法。
Summary / 总结
This study aims to enhance features in long-tailed data using large vision models (LVMs) without relying on linguistic data. The approach involves extracting features from the LVM and fusing them with the baseline network's features, followed by designing prototype-based losses in the latent space to further exploit the augmented features. Experiments on ImageNet-LT and iNaturalist2018 demonstrate the effectiveness of this method in improving feature representation for long-tailed data.
本研究旨在利用大型视觉模型(LVMs)增强长尾数据的特征,而不依赖于语言数据。方法包括从LVM中提取特征并与基线网络的特征融合,然后在潜在空间中设计原型损失以进一步利用增强后的特征。实验在ImageNet-LT和iNaturalist2018数据集上验证了该方法在提高长尾数据特征表示方面的有效性。
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Authors: Chang Zou, Changlin Li, Yang Li, Patrol Li, Jianbing Wu, Xiao He, Songtao Liu, Zhao Zhong, Kailin Huang, Linfeng Zhang
First: 2026-02-05T08:45:08+00:00 · Latest: 2026-02-06T03:54:23+00:00
Comments: 17 pages, 7 figures; cvpr2026 submission
Abstract
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. The code will be made publicly available soon.
中文标题/摘要
标题:DisCa: 加速视频扩散变换器的蒸馏兼容可学习特征缓存
虽然扩散模型在视频生成领域取得了巨大成功,但这一进展伴随着计算负担的迅速增加。现有的加速方法中,特征缓存因其无需训练和显著的加速性能而广受欢迎,但进一步压缩时不可避免地会面临语义和细节的丢失。另一种广泛应用的方法,训练感知的步骤蒸馏,在图像生成中虽然取得了成功,但在视频生成中却面临严重的性能下降,尤其是仅将无需训练的特征缓存简单应用于步骤蒸馏模型时,质量损失更为严重,因为采样步骤更为稀疏。本文首次引入了蒸馏兼容的可学习特征缓存机制。我们采用轻量级的可学习神经预测器代替传统的无需训练启发式方法,能够更准确地捕捉高维特征演化过程。此外,我们探讨了高度压缩蒸馏在大规模视频模型中的挑战,并提出了一种保守的受限均值流方法,以实现更稳定和无损的蒸馏。通过这些努力,我们在保持生成质量的同时将加速边界进一步推至$11.8\times$。大量实验表明了我们方法的有效性。代码将很快公开。
Summary / 总结
This paper addresses the computational challenges in video generation using diffusion models by introducing a novel distillation-compatible learnable feature caching mechanism. The method uses a lightweight learnable neural predictor to capture the high-dimensional feature evolution process more accurately, and proposes a conservative Restricted MeanFlow approach for more stable and lossless distillation. The experiments show that the proposed method can accelerate video generation by up to 11.8 times while maintaining quality.
该论文通过引入一种新型的蒸馏兼容可学习特征缓存机制,解决了视频生成中扩散模型的计算挑战。该方法使用轻量级的可学习神经预测器更准确地捕捉高维特征演化过程,并提出了一种保守的受限均值流方法以实现更稳定和无损的蒸馏。实验表明,所提出的方法可以在保持生成质量的同时将视频扩散变换器加速11.8倍。
Adaptive Rank, Reduced Forgetting: Continual Learning with Dynamic Rank-Selective LoRA
Authors: Haodong Lu, Chongyang Zhao, Jason Xue, Lina Yao, Kristen Moore, Dong Gong
First: 2024-12-01T23:41:42+00:00 · Latest: 2026-02-06T03:07:46+00:00
Comments: Preprint
Abstract
Continual learning (CL) aims to accumulate knowledge from sequential tasks without catastrophic forgetting. Vision-language models such as CLIP, with strong generalization, are widely used for CL. Existing methods often adapt isolated PTM components, increasing inference complexity and limiting model improvement, or rely on replay, stored data, or assumptions, leading to high costs and limited applicability. To advance models as continual learners, we explore CL through natural and efficient PTM updates rather than complex task-specific additions. We study continual low-rank learning and analyze how LoRA ranks and placements affect learning and forgetting. A higher-rank LoRA improves task learning (plasticity) but increases forgetting, while a lower-rank LoRA enhances stability but limits adaptation. We observe a plasticity-stability balance tied to rank across parameters and tasks, with moderately small ranks maximizing CL benefits. Motivated by this, we propose Continual Dynamic Rank-Selective LoRA (CoDyRA), which continually updates PTMs with LoRA adapters of adaptively optimized ranks. The new-task objective drives learning, while sparsity-promoting regularization minimizes ranks to reduce interference and forgetting, achieving a balance tailored to each parameter and task. Although all parameters are updated, the minimized ranks keep the model close to its prior state while enabling effective new-task learning. CoDyRA performs efficient CL as a sequence of LoRA-based updates without storing past data or relying on assumptions, preserving the original model architecture and adding no inference overhead. Experiments show CoDyRA improves new representations while retaining old knowledge, achieving state-of-the-art results. Code is available at https://github.com/jeff024/codyra.
中文标题/摘要
标题:自适应秩,减少遗忘:动态秩选择LoRA的持续学习
持续学习(CL)旨在从顺序任务中积累知识而不发生灾难性遗忘。视觉-语言模型如CLIP,因其强大的泛化能力而广泛用于持续学习。现有方法通常适应孤立的预训练模型组件,增加推理复杂性并限制模型改进,或者依赖重放、存储数据或假设,导致高成本和有限的应用范围。为了使模型成为持续学习者,我们探索通过自然和高效的预训练模型更新来实现持续学习,而不是复杂的任务特定添加。我们研究持续低秩学习,并分析LoRA的排名和放置如何影响学习和遗忘。较高的LoRA秩提高了任务学习能力(可塑性),但增加了遗忘,而较低的LoRA秩增强了稳定性但限制了适应性。我们观察到可塑性和稳定性的平衡与参数和任务的秩相关,适度的小秩最大化持续学习的好处。受此启发,我们提出了持续动态秩选择LoRA(CoDyRA),该方法持续更新预训练模型,使用自适应优化秩的LoRA适配器。新的任务目标驱动学习,而稀疏性促进正则化最小化秩以减少干扰和遗忘,实现针对每个参数和任务量身定制的平衡。尽管所有参数都被更新,但最小化的秩使模型保持接近其初始状态,同时允许有效的新的任务学习。CoDyRA作为一系列基于LoRA的更新高效地实现持续学习,无需存储过去的数据或依赖假设,保持原始模型架构并增加零推理开销。实验表明,CoDyRA在保留旧知识的同时提高了新的表示能力,达到了最先进的结果。代码可在https://github.com/jeff024/codyra/ 获取。
Summary / 总结
The paper addresses the challenge of continual learning (CL) in vision-language models like CLIP, focusing on reducing catastrophic forgetting. It proposes Continual Dynamic Rank-Selective LoRA (CoDyRA), which adaptively optimizes the rank of LoRA adapters for each parameter and task, balancing plasticity and stability. Experiments demonstrate that CoDyRA effectively retains old knowledge while learning new tasks, achieving state-of-the-art results without storing past data or making assumptions about the data distribution.
论文针对视觉-语言模型如CLIP在持续学习(CL)中的挑战,特别是减少灾难性遗忘的问题,提出了持续动态秩选择性LoRA(CoDyRA)方法,该方法通过自适应优化LoRA适配器的秩来平衡可塑性和稳定性。CoDyRA以优化的秩更新参数,最小化干扰和遗忘,无需存储过去的数据或依赖假设,实验结果表明CoDyRA在保留旧知识的同时能够有效地学习新任务。
Taming SAM3 in the Wild: A Concept Bank for Open-Vocabulary Segmentation
Authors: Gensheng Pei, Xiruo Jiang, Yazhou Yao, Xiangbo Shu, Fumin Shen, Byeungwoo Jeon
First: 2026-02-06T02:59:11+00:00 · Latest: 2026-02-06T02:59:11+00:00
Abstract
The recent introduction of \texttt{SAM3} has revolutionized Open-Vocabulary Segmentation (OVS) through \textit{promptable concept segmentation}, which grounds pixel predictions in flexible concept prompts. However, this reliance on pre-defined concepts makes the model vulnerable: when visual distributions shift (\textit{data drift}) or conditional label distributions evolve (\textit{concept drift}) in the target domain, the alignment between visual evidence and prompts breaks down. In this work, we present \textsc{ConceptBank}, a parameter-free calibration framework to restore this alignment on the fly. Instead of adhering to static prompts, we construct a dataset-specific concept bank from the target statistics. Our approach (\textit{i}) anchors target-domain evidence via class-wise visual prototypes, (\textit{ii}) mines representative supports to suppress outliers under data drift, and (\textit{iii}) fuses candidate concepts to rectify concept drift. We demonstrate that \textsc{ConceptBank} effectively adapts \texttt{SAM3} to distribution drifts, including challenging natural-scene and remote-sensing scenarios, establishing a new baseline for robustness and efficiency in OVS. Code and model are available at https://github.com/pgsmall/ConceptBank.
中文标题/摘要
标题:在野外驯服SAM3:一种开放词汇分割的概念银行
最近引入的\texttt{SAM3}通过可提示的概念分割彻底改变了开放词汇分割(OVS),这种技术将像素预测与灵活的概念提示联系起来。然而,这种对预定义概念的依赖使模型变得脆弱:当目标域中的视觉分布发生变化(数据漂移)或条件标签分布演变(概念漂移)时,视觉证据与提示之间的对齐就会失效。在本文中,我们提出了\textsc{概念银行},这是一种无需参数的校准框架,可以在运行时恢复这种对齐。我们不依赖于静态提示,而是从目标统计数据中构建一个数据集特定的概念银行。我们的方法(i)通过类别级视觉原型锚定目标域的证据,(ii)挖掘代表性支持以在数据漂移下抑制异常值,(iii)融合候选概念以纠正概念漂移。我们证明\textsc{概念银行}能够有效适应分布漂移,包括具有挑战性的自然场景和遥感场景,为OVS的鲁棒性和效率设立了新的基准。代码和模型可在https://github.com/pgsmall/ConceptBank/获得。
Summary / 总结
This paper addresses the vulnerability of SAM3 in Open-Vocabulary Segmentation (OVS) due to data and concept drifts. The authors propose ConceptBank, a parameter-free framework that constructs a dataset-specific concept bank to restore alignment between visual evidence and prompts. ConceptBank uses class-wise visual prototypes, mines representative supports to handle data drift, and fuses candidate concepts to address concept drift, demonstrating robust performance in challenging scenarios such as natural scenes and remote sensing.
本文通过引入无参数校准框架ConceptBank来解决SAM3在开放词汇分割(OVS)中的脆弱性,该脆弱性源于数据和概念漂移。ConceptBank从目标统计数据中构建一个特定的数据集概念库,以锚定目标域的证据、挖掘代表性支持以抑制异常值,并融合候选概念以纠正概念漂移。该方法在自然场景和遥感场景中的分布漂移中有效适应了SAM3,为OVS的鲁棒性和效率设定了新的基准。
Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control
Authors: Zhe Wu, Hongjin Lu, Junliang Xing, Changhao Zhang, Yuxuan Li, Yin Zhu, Yuhao Yang, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Jun Wang, Yuanchun Shi
First: 2025-10-16T07:38:21+00:00 · Latest: 2026-02-06T02:04:25+00:00
Abstract
Building agents that autonomously operate mobile devices has attracted increasing attention. While Vision-Language Models (VLMs) show promise, most existing approaches rely on direct state-to-action mappings, which lack structured reasoning and planning, and thus generalize poorly to novel tasks or unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control, featuring a high-level reasoning model and a low-level action model that are jointly optimized. For efficient training, we reformulate multi-step decision-making as a sequence of single-step subgoals and propose a foresight advantage function, which leverages execution feedback from the low-level model to guide high-level optimization. This design alleviates the path explosion issue encountered by Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark, significantly outperforming prior methods across three paradigms: prompt-based (AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot generalization on the ScreenSpot-v2 benchmark. On the more challenging AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones, showing strong adaptability in high-complexity mobile control scenarios.
中文标题/摘要
标题:Hi-Agent:移动设备控制的分层视觉语言代理
构建能够自主操作移动设备的代理引起了越来越多的关注。尽管视觉语言模型(VLMs)显示出潜力,但大多数现有方法依赖于直接的状态到动作映射,缺乏结构化的推理和规划,因此在新任务或未见过的UI布局上泛化能力较差。我们引入了Hi-Agent,这是一种用于移动控制的可训练分层视觉语言代理,具备高层推理模型和低层动作模型,并且两者联合优化。为了高效训练,我们将多步决策问题重新表述为一系列单步子目标,并提出了一种前瞻优势函数,该函数利用低层模型的执行反馈来指导高层优化。这种设计缓解了在长期任务中遇到的组相对策略优化(GRPO)路径爆炸问题,并使稳定、无批评家的联合训练成为可能。Hi-Agent在Android-in-the-Wild(AitW)基准测试中达到了新的最佳成功率87.9%,显著优于基于提示(AppAgent:17.7%)、监督(过滤后的BC:54.5%)和强化学习(DigiRL:71.9%)的先前方法。它还在ScreenSpot-v2基准测试中展示了竞争力的零样本泛化能力。在更具挑战性的AndroidWorld基准测试中,Hi-Agent也随着更大模型规模的有效扩展,展示了在高复杂度移动控制场景中的强大适应性。
Summary / 总结
The research aims to develop agents capable of autonomously controlling mobile devices, addressing the limitations of existing Vision-Language Models (VLMs) that lack structured reasoning and planning. Hi-Agent, a hierarchical vision-language agent, features a high-level reasoning model and a low-level action model that are jointly optimized. By reformulating multi-step decision-making as a sequence of single-step subgoals and using a foresight advantage function, Hi-Agent enables stable, critic-free joint training and achieves a new SOTA task success rate of 87.9% on the Android-in-the-Wild benchmark, outperforming previous methods across different paradigms.
研究旨在开发能够自主控制移动设备并具备结构化推理和规划能力的代理。引入了Hi-Agent,这是一种具有高层推理模型和低层动作模型的分层视觉-语言代理,这些模型是联合优化的。通过将多步决策制定重新表述为一系列单步子目标,并使用前瞻性优势函数,Hi-Agent解决了路径爆炸问题,并实现了稳定的无批评联合训练。该代理在Android-in-the-Wild基准测试中实现了87.9%的任务成功率,超越了之前的多种方法,并展示了强大的零样本泛化能力。
Cross-Modal Redundancy and the Geometry of Vision-Language Embeddings
Authors: Grégoire Dhimoïla, Thomas Fel, Victor Boutin, Agustin Picard
Venue: ICLR 2026
First: 2026-02-05T21:56:26+00:00 · Latest: 2026-02-05T21:56:26+00:00
Comments: Published as a conference paper at ICLR 2026
Abstract
Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.
中文标题/摘要
标题:跨模态冗余与视觉-语言嵌入的空间几何
视觉-语言模型(VLMs)在图像和文本对齐方面取得了显著成功,但它们共享嵌入空间的几何结构仍然知之甚少。为了探究这种几何结构,我们从等能性假设出发,该假设利用了跨模态冗余:真正共享的概念应该在不同模态中表现出相同的平均能量。我们通过一种对齐稀疏自编码器(SAE)来实现这一假设,该自编码器在训练过程中鼓励能量一致性,同时保持重建。我们发现这种归纳偏置改变了SAE的解决方案,而不会损害重建,从而为我们提供了一种用于几何分析的表示。在具有已知真实值的受控数据上进行的合理性检查表明,当等能性成立时,对齐会改善,而当它不成立时,对齐保持中立。将该框架应用于基础VLMs,揭示了清晰的结构及其实际后果:(i)稀疏双模态原子承载了全部跨模态对齐信号;(ii)单模态原子作为模态特定的偏差,完全解释了模态差距;(iii)移除单模态原子会消除差距而不损害性能;(iv)将向量算术限制在双模态子空间内产生同分布编辑并提高检索效果。这些发现表明,正确的归纳偏置不仅可以保持模型的准确性,还可以使潜在的几何结构变得可解释和可操作。
Summary / 总结
The research aims to understand the geometry of the shared embedding space in vision-language models (VLMs) by leveraging the Iso-Energy Assumption, which exploits cross-modal redundancy. The study uses an Aligned Sparse Autoencoder (SAE) to encourage energy consistency during training while preserving reconstruction. Key findings include that sparse bimodal atoms carry the entire cross-modal alignment signal, unimodal atoms act as modality-specific biases, and removing unimodal atoms collapses the modality gap without harming performance. Additionally, restricting vector arithmetic to the bimodal subspace improves retrieval and in-distribution edits.
研究旨在通过利用跨模态冗余的等能性假设来理解视觉-语言模型(VLMs)共享嵌入空间的几何结构。使用对齐的稀疏自编码器(SAE)来鼓励训练过程中的能量一致性同时保持重构。研究发现,这种归纳偏置提供了可用于几何分析的表示,揭示了稀疏双模态原子承载了全部跨模态对齐信号,单模态原子作为模态特定偏差,移除单模态原子可以消除模态差距而不损害性能。此外,将向量算术限制在双模态子空间内可以改善检索并产生同分布编辑。
DeDPO: Debiased Direct Preference Optimization for Diffusion Models
Authors: Khiem Pham, Quang Nguyen, Tung Nguyen, Jingsen Zhu, Michele Santacatterina, Dimitris Metaxas, Ramin Zabih
First: 2026-02-05T21:11:00+00:00 · Latest: 2026-02-05T21:11:00+00:00
Abstract
Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.
中文标题/摘要
标题:DeDPO:去偏置直接偏好优化在扩散模型中的应用
直接偏好优化(DPO)已成为扩散模型的主要对齐方法,促进了无需显式奖励建模的离策训练。然而,其对大规模高质量人类偏好标签的依赖造成了严重的成本和可扩展性瓶颈。为克服这一问题,我们提出了一种半监督框架,该框架通过低成本合成AI反馈对有限的人类数据进行扩充,并使用大量未标注的配对数据。我们的论文引入了去偏置DPO(DeDPO),该方法独特地将因果推断中的去偏置估计技术整合到DPO目标中。通过明确识别并纠正合成标注者固有的系统偏差和噪声,DeDPO 确保了从不完美的反馈源中获得稳健的学习,包括自我训练和视觉-语言模型(VLMs)。实验表明,DeDPO 对合成标注方法的变体具有鲁棒性,其性能与完全基于人类标注数据训练的模型相当,甚至有时超过其理论上限。这确立了DeDPO作为使用廉价合成监督进行人类-AI对齐的可扩展解决方案的地位。
Summary / 总结
DeDPO is a semi-supervised framework that addresses the limitations of Direct Preference Optimization (DPO) by integrating a debiased estimation technique from causal inference. It uses limited human preference labels augmented with synthetic AI feedback to train diffusion models. Experiments show that DeDPO is robust to variations in synthetic labeling methods and achieves performance comparable to or better than models trained on fully human-labeled data, making it a scalable solution for human-AI alignment with inexpensive synthetic supervision.
DeDPO 是一个半监督框架,通过将因果推理中的去偏差估计技术集成到直接偏好优化(DPO)目标中,解决了 DPO 对大规模高质量人工偏好标签的依赖问题。它使用有限的人工偏好标签和合成 AI 反馈进行模型训练。实验表明,DeDPO 对合成标签方法的变化具有鲁棒性,并且其性能与或优于使用完全人工标注数据训练的模型,使其成为使用廉价合成监督进行人工-AI 对齐的可扩展解决方案。
PhenoLIP: Integrating Phenotype Ontology Knowledge into Medical Vision-Language Pretraining
Authors: Cheng Liang, Chaoyi Wu, Weike Zhao, Ya Zhang, Yanfeng Wang, Weidi Xie
First: 2026-02-05T20:44:07+00:00 · Latest: 2026-02-05T20:44:07+00:00
Abstract
Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
中文标题/摘要
标题:PhenoLIP:将表型本体知识整合到医疗视觉-语言预训练中
大规模CLIP-like视觉-语言模型(VLMs)的近期进展极大地推动了医学图像分析。然而,现有的大多数医学VLMs仍然依赖粗略的图像-文本对比目标,未能捕捉到在明确定义的医学表型本体中编码的系统视觉知识。为解决这一差距,我们构建了PhenoKG,这是首个大规模的、以表型为中心的多模态知识图谱,包含超过52万个高质量的图像-文本对,链接到超过3,000个表型。基于PhenoKG,我们提出PhenoLIP,这是一种新颖的预训练框架,通过两阶段过程显式地将结构化的表型知识整合到医学VLMs中。我们首先从文本本体数据中学习知识增强的表型嵌入空间,然后通过教师引导的知识蒸馏目标将这种结构化的知识蒸馏到多模态预训练中。为了支持评估,我们进一步引入了PhenoBench,这是一个由专家验证的基准,用于表型识别,包含超过7,800个图像-描述对,覆盖超过1,000个表型。广泛的实验表明,PhenoLIP 在表型分类准确性上优于之前的最先进的基线,比BiomedCLIP 提高了8.85%,比BIOMEDICA 在跨模态检索上提高了15.03%,突显了将表型为中心的先验整合到医学VLMs中以实现结构化和可解释的医学图像理解的价值。
Summary / 总结
This paper addresses the limitation of existing medical vision-language models in capturing systematic visual knowledge from medical phenotype ontologies. It introduces PhenoKG, a large-scale multimodal knowledge graph, and PhenoLIP, a pretraining framework that integrates structured phenotype knowledge. Experimental results show that PhenoLIP outperforms previous state-of-the-art models in phenotype classification and cross-modal retrieval, highlighting the importance of incorporating phenotype-centric knowledge into medical VLMs for better medical image understanding.
研究旨在通过将表型本体知识集成到视觉-语言模型中来提升医学图像分析。方法包括构建PhenoKG大规模多模态知识图谱,并提出PhenoLIP,这是一种将结构化表型知识整合到医学视觉-语言模型中的预训练框架。实验结果表明,PhenoLIP在表型分类准确性和跨模态检索方面分别比BiomedCLIP和BIOMEDICA提高了8.85%和15.03%。
CORE: Context-Robust Remasking for Diffusion Language Models
Authors: Kevin Zhai, Sabbir Mollah, Zhenyi Wang, Mubarak Shah
First: 2026-02-04T00:12:30+00:00 · Latest: 2026-02-05T20:29:29+00:00
Abstract
Standard decoding in Masked Diffusion Models (MDMs) is hindered by context rigidity: tokens are retained based on transient high confidence, often ignoring that early predictions lack full context. This creates cascade effects where initial inconsistencies misguide the remaining generation. Existing revision strategies attempt to mitigate this by relying on static confidence scores, but these signals are inherently myopic; inconsistent tokens can appear confident to the model itself. We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision. Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations. We formalize revision as a robust optimization objective over context shifts and efficiently approximate this objective to prioritize unstable tokens for revision. On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
中文标题/摘要
标题:CORE:面向上下文的稳健重遮盖以提升扩散语言模型
在掩码扩散模型(MDMs)中,标准解码受限于上下文刚性:基于短暂的高置信度保留令牌,往往忽视早期预测缺乏完整上下文的情况。这导致初始不一致性的累积效应,误导后续生成。现有修订策略试图通过依赖静态置信分数来缓解这一问题,但这些信号本质上是短视的;不一致的令牌可能在模型自身看来显得自信。我们提出了面向上下文的稳健重遮盖(CORE),一种无需训练的推理时修订框架。CORE 不依赖静态令牌概率,而是通过探测目标遮盖上下文扰动对令牌的敏感性来识别上下文脆弱的令牌。我们将修订形式化为上下文转换下的鲁棒优化目标,并高效地近似此目标以优先修订不稳定的令牌。在LLaDA-8B-Base上,CORE 在推理和代码基准测试中提供了持续改进,超越了计算匹配的基线,并将MBPP提高了高达9.2个百分点。
Summary / 总结
The research addresses the issue of context rigidity in Masked Diffusion Models (MDMs) where early predictions can misguide the subsequent generation process. CORE, a training-free framework, identifies context-brittle tokens by probing their sensitivity to masked-context perturbations and prioritizes these tokens for revision. On LLaDA-8B-Base, CORE improves performance across reasoning and code benchmarks, outperforming compute-matched baselines and enhancing MBPP by up to 9.2 percentage points.
论文针对Masked Diffusion Models (MDMs)中的上下文僵化问题,即早期预测可能会误导整个生成过程。它提出了Context-Robust Remasking (CORE)框架,通过探查敏感于目标遮蔽上下文扰动的上下文脆弱标记来识别这些标记。CORE然后优先对这些标记进行修订以提高整体生成的一致性。在LLaDA-8B-Base上,CORE在推理和代码基准测试中表现出一致的改进,超越了计算匹配的基线,并将MBPP得分提高了最多9.2个百分点。
Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing
Authors: Xinyi Lin, Yuyang Zhang, Yuanhang Gan, Juntao Chen, Hao Shen, Yichun He, Lijun Li, Ze Yuan, Shuang Wang, Chaohao Wang, Rui Zhang, Na Li, Jia Liu
First: 2025-11-03T21:12:48+00:00 · Latest: 2026-02-05T20:23:12+00:00
Abstract
Scientific experimentation and manufacturing rely on prolonged protocol development and complex, multi-step implementation, which require continuous human expertise for precise execution and decision-making, limiting interpretability and scalability. Here, we introduce human-artificial intelligence (AI) co-embodied intelligence, a new form of physical AI that unites human researchers, agentic AI, and wearable hardware. In this paradigm, humans provide precise execution, while agentic AI contributes contextual reasoning, adaptive planning, and analysis. The wearable interface continuously captures experimentation and manufacturing, facilitating seamless communication between humans and AI. We instantiate this paradigm in a microfabrication cleanroom, leading to the agentic-physical experimentation (APEX) system which understands fabrication procedure with accuracy 51% higher than state-of-the-art multimodal large language models/vision language models (LLMs/VLMs), detects and corrects fabrication errors in real-time, and transfers procedural expertise to novice users. Critically, APEX system enables the co-development of fabrication protocols in cleanrooms, overcoming the incompatibility of elastomeric materials in standard microfabrication processes and enabling previously unattainable fabrication outcomes, as demonstrated by the wafer-scale realization of brain-level soft neural probe capable of single-unit-resolution neural recording. These results establish the human-AI co-embodied intelligence that extends agentic reasoning beyond computation into the physical domain, transforming scientific experimentation and manufacturing into autonomous, traceable, interpretable and scalable processes.
中文标题/摘要
标题:人类-人工智能共融智能在科学研究与制造中的应用
科学研究与制造依赖于长时间的协议开发和复杂的多步骤实施,这需要持续的人类专业知识以实现精确执行和决策,限制了可解释性和可扩展性。在此,我们介绍人类-人工智能共融智能,这是一种新的物理人工智能形式,结合了人类研究人员、自主人工智能和可穿戴硬件。在此范式中,人类提供精确执行,而自主人工智能则贡献上下文推理、自适应规划和分析。可穿戴界面持续捕捉实验和制造过程,促进人类与人工智能之间的无缝通信。我们在此微制造洁净室中实例化此范式,导致了自主物理实验(APEX)系统,该系统在准确度方面比最先进的多模态大型语言模型/视觉语言模型(LLMs/VLMs)高出51%,能够实时检测和纠正制造错误,并将程序性专业知识传授给新手用户。关键的是,APEX系统使洁净室中的制造协议共同开发成为可能,克服了标准微制造过程中弹性材料的不兼容性,实现了前所未有的制造成果,如晶圆级实现的脑级软神经探针,能够实现单单元分辨率的神经记录。这些结果确立了人类-人工智能共融智能,将自主推理扩展到物理领域,将科学研究与制造转变为自主、可追溯、可解释和可扩展的过程。
Summary / 总结
This research aims to enhance scientific experimentation and manufacturing by integrating human expertise with AI, leading to a co-embodied intelligence system. The method involves using agentic AI for contextual reasoning and adaptive planning, while humans provide precise execution. Key findings include a 51% higher accuracy in understanding fabrication procedures compared to state-of-the-art models, real-time error detection and correction, and the successful fabrication of a wafer-scale brain-level soft neural probe, demonstrating the system's capability to extend AI reasoning into the physical domain and improve scalability and interpretability.
该研究引入了结合人类专业知识与智能代理AI及穿戴式硬件的人机共融智能,用于科学研究与制造。系统名为APEX,相比最先进的模型提高了51%的制造精度,能够实时检测并纠正错误,并将操作知识传授给新手。APEX实现了晶圆级软神经探针的制造,展示了其将科学研究过程转变为自主、可解释和可扩展系统的潜力。
M3: High-fidelity Text-to-Image Generation via Multi-Modal, Multi-Agent and Multi-Round Visual Reasoning
Authors: Bangji Yang, Ruihan Guo, Jiajun Fan, Chaoran Cheng, Ge Liu
First: 2026-02-05T20:10:27+00:00 · Latest: 2026-02-05T20:10:27+00:00
Abstract
Generative models have achieved impressive fidelity in text-to-image synthesis, yet struggle with complex compositional prompts involving multiple constraints. We introduce \textbf{M3 (Multi-Modal, Multi-Agent, Multi-Round)}, a training-free framework that systematically resolves these failures through iterative inference-time refinement. M3 orchestrates off-the-shelf foundation models in a robust multi-agent loop: a Planner decomposes prompts into verifiable checklists, while specialized Checker, Refiner, and Editor agents surgically correct constraints one at a time, with a Verifier ensuring monotonic improvement. Applied to open-source models, M3 achieves remarkable results on the challenging OneIG-EN benchmark, with our Qwen-Image+M3 surpassing commercial flagship systems including Imagen4 (0.515) and Seedream 3.0 (0.530), reaching state-of-the-art performance (0.532 overall). This demonstrates that intelligent multi-agent reasoning can elevate open-source models beyond proprietary alternatives. M3 also substantially improves GenEval compositional metrics, effectively doubling spatial reasoning performance on hardened test sets. As a plug-and-play module compatible with any pre-trained T2I model, M3 establishes a new paradigm for compositional generation without costly retraining.
中文标题/摘要
标题:M3:通过多模态、多代理和多轮视觉推理实现高保真文本到图像生成
生成模型在文本到图像合成中实现了令人印象深刻的保真度,但在处理涉及多个约束的复杂组合提示方面却存在困难。我们引入了**M3(多模态、多代理、多轮次)**,这是一种无需训练的框架,通过迭代推理时的逐步改进系统地解决了这些问题。M3 在一个稳健的多代理循环中协调现成的基础模型:规划者将提示分解为可验证的清单,而专门的检查员、修正器和编辑器代理则依次纠正每个约束,验证者确保逐步改进。应用于开源模型,M3 在具有挑战性的 OneIG-EN 基准测试中取得了显著成果,我们的 Qwen-Image+M3 超过了包括 Imagen4(0.515)和 Seedream 3.0(0.530)在内的商用旗舰系统,达到最先进的性能(总体 0.532)。这表明智能多代理推理可以将开源模型提升到专有替代品之上。M3 还大幅提高了 GenEval 的组合指标,有效将硬化测试集的空间推理性能翻倍。作为一个与任何预训练的 T2I 模型兼容的即插即用模块,M3 为组合生成确立了一个无需昂贵重新训练的新范式。
Summary / 总结
The research aims to address the limitations of generative models in handling complex compositional prompts with multiple constraints. M3, a training-free framework, employs a multi-agent system with a Planner, Checker, Refiner, and Editor to iteratively refine images based on prompt decomposition. The system achieves superior results on the OneIG-EN benchmark, surpassing commercial systems like Imagen4 and Seedream 3.0, and significantly improving compositional metrics in GenEval tests.
研究旨在解决生成模型在处理包含多个约束的复杂组合提示时的局限性。M3 是一个无需训练的框架,使用由规划者、检查者、修正者和编辑者组成的多智能体系统,基于提示逐步细化图像。该系统在 OneIG-EN 基准测试中取得了优于商业系统的如 Imagen4 和 Seedream 3.0 的结果,并显著提高了 GenEval 的组合指标。