A Benchmark of SQL Translation for Evaluating Large
Language Models
#
Model
DM
EM
EX
1
gpt-4o
0.34
0.10
0.73
2
gemini-2.5-flash
0.40
0.10
0.71
3
gpt-3.5-turbo
0.43
0.12
0.64
4
deepseek-coder-6.7b-instruct
0.25
0.06
0.39
5
codellama-7b-instruct
0.24
0.06
0.23
6
sqlcoder-7b
0.13
0.01
0.16
7
deepseek-r1-8b-llama-distill-q8_0
0.24
0.01
0.15
#
Model
DM
EM
EX
1
gpt-4o
0.70
0.45
0.56
2
gpt-3.5-turbo
0.57
0.09
0.56
3
gemini-2.5-flash
0.59
0.46
0.53
4
deepseek-coder-6.7b-instruct
0.27
0.07
0.26
5
deepseek-r1-8b-llama-distill-q8_0
0.29
0.09
0.17
6
codellama-7b-instruct
0.21
0.04
0.14
7
sqlcoder-7b
0.16
0.01
0.08
Evaluation Metrics
Dialect Matching (DM): Measures how accurately the model translates dialect-specific SQL features from the source dialect to the target dialect.
Exact Matching (EM): Measures the percentage of predictions where the generated SQL statement exactly matches the ground truth statement.
Execution Accuracy (EX): Measures the proportion of cases where the execution results of the predicted and ground-truth SQL statements are identical.
🤗 Acknowledgement
Thanks for the
EvalPlus for
sharing the leaderboard template. In addition to JavaBench leaderboards,
it is recommended to comprehensively understand LLM coding ability
through a diverse set of benchmarks and leaderboards, such as: