A Benchmark of SQL Translation for Evaluating Large
Language Models
#
Model
DM
EM
EX
1
gpt-3.5-turbo
0.43
0.12
0.64
2
deepseek-coder-6.7b-instruct
0.25
0.06
0.39
3
codellama-7b-instruct
0.24
0.06
0.23
4
sqlcoder-7b
0.13
0.01
0.16
5
deepseek-r1-8b-llama-distill-q8_0
0.24
0.01
0.15
#
Model
DM
EM
EX
1
gpt-3.5-turbo
0.57
0.09
0.56
2
deepseek-coder-6.7b-instruct
0.27
0.07
0.26
3
deepseek-r1-8b-llama-distill-q8_0
0.29
0.09
0.17
4
codellama-7b-instruct
0.21
0.04
0.14
5
sqlcoder-7b
0.16
0.01
0.08
Evaluation Metrics
Dialect Matching (DM): Measures how accurately the model translates dialect-specific SQL features from the source dialect to the target dialect.
Exact Matching (EM): Measures the percentage of predictions where the generated SQL statement exactly matches the ground truth statement.
Execution Accuracy (EX): Measures the proportion of cases where the execution results of the predicted and ground-truth SQL statements are identical.
🤗 Acknowledgement
Thanks for the
EvalPlus for
sharing the leaderboard template. In addition to JavaBench leaderboards,
it is recommended to comprehensively understand LLM coding ability
through a diverse set of benchmarks and leaderboards, such as: