ProbeLLM: Automating Principled Diagnosis of LLM Failures
自动化诊断大模型失败原因的新框架,用原理性方法定位LLM错误根源。
arXiv:2602.12966v2 Announce Type: replace Abstract: Understanding how and why large language models (LLMs) fail is becoming a central challenge as mod…
自动化诊断大模型失败原因的新框架,用原理性方法定位LLM错误根源。
arXiv:2602.12966v2 Announce Type: replace Abstract: Understanding how and why large language models (LLMs) fail is becoming a central challenge as mod…
用掷骰子测试LLM的“概率直觉”,揭示大模型在简单随机任务中的可靠性短板。
arXiv:2606.07515v1 Announce Type: cross Abstract: We investigate the probabilistic reasoning capabilities of large language models through a controlle…
语音大模型推理中实体绑定出错了怎么办?这篇论文系统诊断问题并提出思维链干预方案。
arXiv:2606.04474v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We rev…
Claude直接执行任务完美,但生成的代码却屡屡翻车,揭示AI模型在“做”与“教”之间的行为鸿沟。
I asked Claude to do some heavy work, and it was done perfectly. When I asked Claude to write a Python script to do the same task with proper prompts …