一句话摘要
V2-0628 在数学和推理 benchmark 上大幅提升,Arena-Hard 对 GPT-4 胜率接近翻倍,是 V2 时代最重要的能力跃升。
详细描述
deepseek-chat upgraded to DeepSeek-V2-0628. HumanEval 79.88%→84.76%, MATH 55.02%→71.02%, BBH 78.56%→83.40%. Arena-Hard win rate vs GPT-4-0314 increased from 41.6% to 68.3%. Role-playing capabilities significantly enhanced.
HumanEval 升至 84.76%,MATH 升至 71.02%,Arena-Hard 对 GPT-4-0314 胜率从 41.6% 升至 68.3%,角色扮演能力显著增强。
原文摘录
HumanEval Pass@1 79.88% -> 84.76%, MATH ACC@1 55.02% -> 71.02%, BBH 78.56% -> 83.40%. In the Arena-Hard evaluation, the win rate against GPT-4-0314 increased from 41.6% to 68.3%.