Design and Report Benchmarks for Knowledge Work
知识工作的设计与报告基准,为AI系统在真实办公场景中的表现提供量化评估新维度。
arXiv:2605.23262v1 Announce Type: new Abstract: The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding…
知识工作的设计与报告基准,为AI系统在真实办公场景中的表现提供量化评估新维度。
arXiv:2605.23262v1 Announce Type: new Abstract: The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding…
AI评估不能只靠基准测试,要基于系统的能力理论来构建更可靠的评价体系
arXiv:2509.19590v2 Announce Type: replace-cross Abstract: Evaluations of generative models are now ubiquitous, and their outcomes critically shape pub…
利用社会对齐的合成数据,让AI评估更贴近真实社会场景,提升模型敏感性与可信度。
arXiv:2605.14381v2 Announce Type: replace Abstract: Recent advancements in generative AI facilitate large-scale synthetic data generation for model ev…
从企业视角解析AI评估如何成为提升性能、降低风险的关键工具,OpenAI官方深度解读。
Learn how evals help businesses define, measure, and improve AI performance—reducing risk, boosting productivity, and driving strategic advantage.
首个评估AI从零到一构建完整Web应用的基准,涵盖100个规范,直击现有评测短板。
arXiv:2603.04601v3 Announce Type: replace-cross Abstract: Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks…
OpenAI发布PaperBench,评估AI复制前沿AI研究的能力,考验智能体从论文到代码实现的完整流程。
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.
首个专门评测LLM生成SVG宝可梦能力的基准测试,看各大模型在图形生成上的真实排名。
Article URL: https://svg-bench.fenx.work/ Comments URL: https://news.ycombinator.com/item?id=48138312 Points: 2 # Comments: 0