Tell HN: Gemini 3.5 Flash breaks in stupid ways
谷歌轻量AI模型,响应快,擅长文本生成与推理,评分时需避免复杂标准以防中心化偏差。
I thought I was going crazy, trying to use Gemini 3.5 Flash to rate some answers, but it kept giving 7 instead of 10 for correct answers. Apparently o…
谷歌轻量AI模型,响应快,擅长文本生成与推理,评分时需避免复杂标准以防中心化偏差。
I thought I was going crazy, trying to use Gemini 3.5 Flash to rate some answers, but it kept giving 7 instead of 10 for correct answers. Apparently o…
手写数学也能自动批改?视觉大模型让AI教育再进一步,来自AIED 2026的实证研究。
arXiv:2605.19043v1 Announce Type: cross Abstract: Automated grading systems have enabled scalable assessment for many response types, but handwritten …
提出逐步评分奖励机制,优化LLM推理的中间步骤监督,突破传统仅奖励最终答案的局限。
arXiv:2605.17291v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large lan…
统一API实现文本参数优化,单LLM系统跨领域媲美专业工具,无需定制即可高效搜索
arXiv:2605.19633v1 Announce Type: cross Abstract: Can a single LLM-based optimization system match specialized tools across fundamentally different do…
粘贴GitHub仓库就能秒出AI使用效率评分,无需注册,适合工程团队自检
Article URL: https://costlens.dev/score Comments URL: https://news.ycombinator.com/item?id=48194736 Points: 2 # Comments: 0
研究发现多模态大模型在临床序数评分中易出现中心趋势偏差,影响评估准确性。
arXiv:2605.16386v1 Announce Type: new Abstract: Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical …
Node.js 提高 HackerOne 报告门槛:新研究者须通过 Slack 联系安全团队,应对低质量报告激增。
提出记忆增强的评分标准改进系统,提升基于评分标准的强化学习效果。
arXiv:2605.18592v1 Announce Type: new Abstract: Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubri…
任天堂Switch 2新作《耀西与不可思议的图鉴》5月21日发售,IGN仅6分,创意亮点频现却深度不足。
IT之家 5 月 19 日消息,任天堂旗下第一方游戏《耀西与不可思议的图鉴(Yoshi and the Mysterious Book)》将于 5 月 21 日发售,目前游戏媒体评分已解禁,游戏在 Metacritic 上获 81 分(64 家媒体)。 IGN 给予这款游戏 6 分评价,认为本作创意…
新型L-PACT框架揭示语言模型与大脑对齐的证据不足,预测评分并非可靠指标
arXiv:2605.14025v1 Announce Type: cross Abstract: Brain-language model comparisons often interpret neural prediction scores as evidence that model rep…
探讨主观体验与媒体评测的落差,从厂商、媒体和社区三方视角剖析背后原因,引发对产品评价体系的深度思考。
为什么我觉得好玩的游戏,你却要给出这么低的分?为什么最后在使用中出了问题的产品,你却要在首发评测中大肆吹捧? 查看全文
AI代理信任评分新利器!AgentRisk推出可嵌入徽章,展示六大维度信任分数,提升用户信任度,轻松脱颖而出。
Introducing AgentRisk Trust Badges for AI Agents 2026-05-16 · 4 min read If you've ever published a bot or tool agent on an agent platform, you know t…
AI正让大学成绩“通货膨胀”,谁受益?谁受害?
Article URL: https://www.axios.com/2026/05/16/ai-grade-inflation-college-classes Comments URL: https://news.ycombinator.com/item?id=48161402 Points: 5…