牛哥精选 · 本周

1

🤖 AI 工具 Hacker News Ask 2026-05-22

Tell HN: Gemini 3.5 Flash breaks in stupid ways

谷歌轻量AI模型，响应快，擅长文本生成与推理，评分时需避免复杂标准以防中心化偏差。

I thought I was going crazy, trying to use Gemini 3.5 Flash to rate some answers, but it kept giving 7 instead of 10 for correct answers. Apparently o…

ai模型文本生成评分对话谷歌

2

🤖 AI·大模型 arXiv AI 2026-05-21

Automated Grading of Handwritten Mathematics Using Vision-Capable LLMs

手写数学也能自动批改？视觉大模型让AI教育再进一步，来自AIED 2026的实证研究。

arXiv:2605.19043v1 Announce Type: cross Abstract: Automated grading systems have enabled scalable assessment for many response types, but handwritten …

手写数学自动评分多模态大语言模型教育科技

3

📝 深度技术 arXiv 机器学习 2026-05-20

Step-wise Rubric Rewards for LLM Reasoning

提出逐步评分奖励机制，优化LLM推理的中间步骤监督，突破传统仅奖励最终答案的局限。

arXiv:2605.17291v1 Announce Type: new Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is widely used to improve reasoning in large lan…

llm推理强化学习分步奖励 rlvr 研究论文

4

🤖 AI·大模型 arXiv 机器学习 2026-05-20

optimize_anything: A Universal API for Optimizing any Text Parameter

统一API实现文本参数优化，单LLM系统跨领域媲美专业工具，无需定制即可高效搜索

arXiv:2605.19633v1 Announce Type: cross Abstract: Can a single LLM-based optimization system match specialized tools across fundamentally different do…

llm 文本优化通用api 跨领域评分函数

5

🔓 开源项目 Hacker News Show 2026-05-20

Show HN: AI Efficiency Score – paste any GitHub repo, get a score in seconds

粘贴GitHub仓库就能秒出AI使用效率评分，无需注册，适合工程团队自检

Article URL: https://costlens.dev/score Comments URL: https://news.ycombinator.com/item?id=48194736 Points: 2 # Comments: 0

github ai效率评分仓库分析无注册

6

🤖 AI·大模型 arXiv 计算机视觉 2026-05-20

Auditing Multimodal LLM Raters: Central Tendency Bias in Clinical Ordinal Scoring

研究发现多模态大模型在临床序数评分中易出现中心趋势偏差，影响评估准确性。

arXiv:2605.16386v1 Announce Type: new Abstract: Multimodal large language models (LLMs) are increasingly explored as automated evaluators in clinical …

多模态大模型临床评分中心趋势偏差 cdt 序数评分

7

🎨 设计创意 Node.js Blog 2026-05-20

New HackerOne Signal Requirement for Vulnerability Reports

Node.js 提高 HackerOne 报告门槛：新研究者须通过 Slack 联系安全团队，应对低质量报告激增。

node.js hackerone 漏洞报告安全策略 signal评分

8

📝 深度技术 arXiv 机器学习 2026-05-20

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

提出记忆增强的评分标准改进系统，提升基于评分标准的强化学习效果。

arXiv:2605.18592v1 Announce Type: new Abstract: Rubric-based reward shaping is an effective method for fine-tuning LLMs via RL, where structured rubri…

强化学习评分标准改进记忆增强系统设计学术论文

9

🚀 产品观察 IT 之家 2026-05-20

任天堂 Switch 2 游戏《耀西与不可思议的图鉴》5 月 21 日发售，获 IGN 6 分“创意深度不佳”

任天堂Switch 2新作《耀西与不可思议的图鉴》5月21日发售，IGN仅6分，创意亮点频现却深度不足。

IT之家 5 月 19 日消息，任天堂旗下第一方游戏《耀西与不可思议的图鉴（Yoshi and the Mysterious Book）》将于 5 月 21 日发售，目前游戏媒体评分已解禁，游戏在 Metacritic 上获 81 分（64 家媒体）。 IGN 给予这款游戏 6 分评价，认为本作创意…

任天堂游戏耀西与不可思议的图鉴日发售

10

📝 深度技术 arXiv AI 2026-05-19

Do Language Models Align with Brains? Prediction Scores Are Not Enough

新型L-PACT框架揭示语言模型与大脑对齐的证据不足，预测评分并非可靠指标

arXiv:2605.14025v1 Announce Type: cross Abstract: Brain-language model comparisons often interpret neural prediction scores as evidence that model rep…

语言模型大脑对齐 l-pact 神经科学预测评分

11

🚀 产品观察少数派 2026-05-19

为什么我觉得好的东西，媒体评测却给了低分？

探讨主观体验与媒体评测的落差，从厂商、媒体和社区三方视角剖析背后原因，引发对产品评价体系的深度思考。

为什么我觉得好玩的游戏，你却要给出这么低的分？为什么最后在使用中出了问题的产品，你却要在首发评测中大肆吹捧？查看全文

主观体验媒体评测产品评价数码评测游戏评测

12

🤖 AI·大模型 Dev.to 2026-05-19

Introducing AgentRisk Trust Badges for AI Agents

AI代理信任评分新利器！AgentRisk推出可嵌入徽章，展示六大维度信任分数，提升用户信任度，轻松脱颖而出。

Introducing AgentRisk Trust Badges for AI Agents 2026-05-16 · 4 min read If you've ever published a bot or tool agent on an agent platform, you know t…

ai代理信任评分信任徽章认证安全

13

🚀 产品观察 Hacker News AI 2026-05-19

AI Sends a Grades into Overdrive

AI正让大学成绩“通货膨胀”，谁受益？谁受害？

Article URL: https://www.axios.com/2026/05/16/ai-grade-inflation-college-classes Comments URL: https://news.ycombinator.com/item?id=48161402 Points: 5…

ai 大学教育成绩膨胀评分系统学术诚信

🐂 牛哥精选