DLBench Leaderboard

☕︎ DLBench Leaderboard ☕️

A Benchmark of SQL Translation for Evaluating Large Language Models

BIRDTrans

#	Model	DM	EM	EX
1	gpt-3.5-turbo	0.43	0.12	0.64
2	deepseek-coder-6.7b-instruct	0.25	0.06	0.39
3	codellama-7b-instruct	0.24	0.06	0.23
4	sqlcoder-7b	0.13	0.01	0.16
5	deepseek-r1-8b-llama-distill-q8_0	0.24	0.01	0.15

ButterTrans

#	Model	DM	EM	EX
1	gpt-3.5-turbo	0.57	0.09	0.56
2	deepseek-coder-6.7b-instruct	0.27	0.07	0.26
3	deepseek-r1-8b-llama-distill-q8_0	0.29	0.09	0.17
4	codellama-7b-instruct	0.21	0.04	0.14
5	sqlcoder-7b	0.16	0.01	0.08

Evaluation Metrics

Dialect Matching (DM): Measures how accurately the model translates dialect-specific SQL features from the source dialect to the target dialect.
Exact Matching (EM): Measures the percentage of predictions where the generated SQL statement exactly matches the ground truth statement.
Execution Accuracy (EX): Measures the proportion of cases where the execution results of the predicted and ground-truth SQL statements are identical.

🤗 Acknowledgement

Thanks for the EvalPlus for sharing the leaderboard template. In addition to JavaBench leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: