☕︎ DLBench Leaderboard ☕️

A Benchmark of SQL Translation for Evaluating Large Language Models

github arxiv
# Model DM EM EX
1 gpt-3.5-turbo 0.43 0.12 0.64
2 deepseek-coder-6.7b-instruct 0.25 0.06 0.39
3 codellama-7b-instruct 0.24 0.06 0.23
4 sqlcoder-7b 0.13 0.01 0.16
5 deepseek-r1-8b-llama-distill-q8_0 0.24 0.01 0.15
# Model DM EM EX
1 gpt-3.5-turbo 0.57 0.09 0.56
2 deepseek-coder-6.7b-instruct 0.27 0.07 0.26
3 deepseek-r1-8b-llama-distill-q8_0 0.29 0.09 0.17
4 codellama-7b-instruct 0.21 0.04 0.14
5 sqlcoder-7b 0.16 0.01 0.08

Evaluation Metrics

🤗 Acknowledgement

Thanks for the EvalPlus for sharing the leaderboard template. In addition to JavaBench leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: