☕︎ DLBench Leaderboard ☕️

A Benchmark of SQL Translation for Evaluating Large Language Models

github arxiv
# Model DM EM EX
1 gpt-4o 0.34 0.10 0.73
2 gemini-2.5-flash 0.40 0.10 0.71
3 gpt-3.5-turbo 0.43 0.12 0.64
4 deepseek-coder-6.7b-instruct 0.25 0.06 0.39
5 codellama-7b-instruct 0.24 0.06 0.23
6 sqlcoder-7b 0.13 0.01 0.16
7 deepseek-r1-8b-llama-distill-q8_0 0.24 0.01 0.15
# Model DM EM EX
1 gpt-4o 0.70 0.45 0.56
2 gpt-3.5-turbo 0.57 0.09 0.56
3 gemini-2.5-flash 0.59 0.46 0.53
4 deepseek-coder-6.7b-instruct 0.27 0.07 0.26
5 deepseek-r1-8b-llama-distill-q8_0 0.29 0.09 0.17
6 codellama-7b-instruct 0.21 0.04 0.14
7 sqlcoder-7b 0.16 0.01 0.08

Evaluation Metrics

🤗 Acknowledgement

Thanks for the EvalPlus for sharing the leaderboard template. In addition to JavaBench leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: