(LLM)大语言模型

(LLM)Large language model

Tool
R
Python
Author

Tony Duan

Published

March 18, 2025

(LLM)Large language model

Code
library(tidyverse)
library(openxlsx)
library(readxl)
Code
data001=read_excel('AI model.xlsx')
head(data001)

LLM model performance

math

AIME

https://en.wikipedia.org/wiki/American_Invitational_Mathematics_Examination

MATH-500

Code

Codeforces

LiveCodeBench

English

MMLU

Measuring Massive Multitask Language Understanding (MMLU)

https://en.wikipedia.org/wiki/MMLU

Science

GPQA-Diamond

Graduate-Level Google-Proof Q&A

Description: GPQA consists of 448 multiple-choice questions meticulously crafted by domain experts in biology, physics, and chemistry. These questions are intentionally designed to be high-quality and extremely difficult.

Expert Accuracy: Even experts who hold or are pursuing PhDs in the corresponding domains achieve only 65% accuracy on these questions (or 74% when excluding clear mistakes identified in retrospect).

Google-Proof: The questions are “Google-proof,” meaning that even with unrestricted access to the web, highly skilled non-expert validators only reach an accuracy of 34% despite spending over 30 minutes searching for answers.

AI Systems Difficulty: State-of-the-art AI systems, including our strongest GPT-4 based baseline, achieve only 39% accuracy on this challenging dataset.

SimpleBench

https://simple-bench.com/

# Compare online

https://lmarena.ai/