Training-Free Multimodal Large Language Model Orchestration
无需训练即可编排多模态大模型,实现零样本协调多种能力,降低落地门槛
arXiv:2508.10016v4 Announce Type: replace Abstract: Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse…
无需训练即可编排多模态大模型,实现零样本协调多种能力,降低落地门槛
arXiv:2508.10016v4 Announce Type: replace Abstract: Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse…
打破多模态数据对齐瓶颈,论文提出仅用成对模态训练MLLM,显著提升跨域可扩展性。
arXiv:2605.21059v1 Announce Type: cross Abstract: Despite the impressive results achieved by multimodal large language models (MLLMs), their training …
让图像分词更接近文本语义,提出新方法优化多模态大语言模型的融合效果。
arXiv:2605.17954v1 Announce Type: cross Abstract: Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a …
评估多模态大模型操作中心链式思维推理能力的新基准,强调接地与可验证性。
arXiv:2605.19559v1 Announce Type: new Abstract: The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egoce…
ECG-R1:协议引导、模态无关的多模态大语言模型,让心电图解读更可靠,医学AI新突破
arXiv:2602.04279v2 Announce Type: replace Abstract: Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet exi…
提出“推理可移植性”新概念,为多模态大模型在强化学习时代的持续学习指明方向。
arXiv:2605.18903v1 Announce Type: new Abstract: Vision-Language Models in Continual Learning (VLM-CL) aim to continuously adapt to new multimodal task…
提出Vision Inference Former方法,解决多模态大模型视觉一致性难题,为视觉-语言融合提供新范式。
arXiv:2605.18160v1 Announce Type: new Abstract: In recent years, multimodal large language models (MLLMs) have achieved remarkable progress, primarily…
研究发现视频MLLMs的音频理解实际依赖视觉线索,揭示模型幻觉问题,挑战多模态真实性。
arXiv:2605.16403v1 Announce Type: new Abstract: Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in vide…
梳理视觉-语言模型持续学习最新综述,超越遗忘视角解读多模态大模型演进挑战
arXiv:2508.04227v2 Announce Type: replace-cross Abstract: Vision-language models (VLMs) and the recent surge of Multimodal Large Language Models (MLLM…
多模态大模型结合多语言OCR与提示引导的思维链推理,提升图文文字理解能力
arXiv:2605.16409v1 Announce Type: cross Abstract: Optical character recognition (OCR) and multilingual text understanding remain major failure modes o…
新型多模态大语言模型CG-MLLM实现3D内容描述与高分辨率生成,突破细粒度几何建模瓶颈
arXiv:2601.21798v2 Announce Type: replace Abstract: Large Language Models(LLMs) have revolutionized text generation and multimodal perception,but thei…
多模态大模型突破上下文瓶颈,精准分割全球南方农业卫星图像,填补数据稀缺与域对齐难题。
arXiv:2605.16179v1 Announce Type: new Abstract: Agricultural landscape segmentation in the Global South is challenging as it is characterized by fragm…
揭秘多人物视频中非语言社交推理:GRASP数据集让AI学会谁在和谁互动
arXiv:2605.15764v1 Announce Type: cross Abstract: Understanding social interactions requires reasoning over subtle non-verbal cues, yet current multim…
多模态大模型遗忘新方法ASRU,结合激活引导与强化学习,提升遗忘后生成质量,更符合实际需求。
arXiv:2605.15687v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretr…
首个聚焦室内视频小物体空间理解的数据集与评测基准,直击多模态大模型在细小物体感知上的短板
arXiv:2604.08991v2 Announce Type: replace-cross Abstract: Small object-centric spatial understanding in indoor videos remains a significant challenge …
选择性预测遇上视觉证据打分!SIEVES让多模态大模型在零样本领域外基准上覆盖率提升3倍,且无需模型内部信号,闭源模型也能直接用——这才是可靠部署的真正解法。
arXiv:2604.25855v2 Announce Type: replace-cross Abstract: Multimodal large language models (MLLMs) achieve ever-stronger performance on visual-languag…