Software, Vol. 4, Pages 17: Investigating Reproducibility Challenges in LLM Bugfixing on the HumanEvalFix Benchmark
Software doi: 10.3390/software4030017
Authors:
Balázs Szalontai
Balázs Márton
Balázs Pintér
Tibor Gregorics
Benchmark results for large language models often show inconsistencies across different studies. This paper investigates the challenges of reproducing these results in automatic bugfixing using LLMs, on the HumanEvalFix benchmark. To determine the cause of the differing results in the literature, we attempted to reproduce a subset of them by evaluating 12 models in the DeepSeekCoder, CodeGemma, CodeLlama, and WizardCoder model families, in different sizes and tunings. A total of 35 unique results were reported for these models across studies, of which we successfully reproduced 12. We identified several relevant factors that influenced the results. The base models can be confused with their instruction-tuned variants, making their results better than expected. Incorrect prompt templates or generation length can decrease benchmark performance, as well as using 4-bit quantization. Using sampling instead of greedy decoding can increase the variance, especially with higher temperature values. We found that precision and 8-bit quantization have less influence on benchmark results.
Source link
Balázs Szalontai www.mdpi.com