1
When Vision Speaks for Sound
研究发现视频MLLMs的音频理解实际依赖视觉线索,揭示模型幻觉问题,挑战多模态真实性。
arXiv:2605.16403v1 Announce Type: new Abstract: Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in vide…
研究发现视频MLLMs的音频理解实际依赖视觉线索,揭示模型幻觉问题,挑战多模态真实性。
arXiv:2605.16403v1 Announce Type: new Abstract: Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in vide…