Investigating Code Hallucinations in LLMs via Execution-based Verification (2024)

Yuchen Tian¹Weixiang Yan²¹¹footnotemark: 1Qian Yang^3,4Xuandong Zhao⁵Qian Chen⁶Wen Wang⁶Ziyang Luo⁷Lei Ma¹Dawn Song⁵
¹The University of Tokyo²UC, Santa Barbara³Mila-Québec AI Institute⁴Université de Montréal
⁵UC, Berkeley⁶Alibaba Group⁷Hong Kong Baptist University
yuchentovo@gmail.comweixiangyan@ucsb.eduqian.yang@mila.quebec
xuandongzhao@berkeley.edu{tanqing.cq,w.wang}@alibaba-inc.com
cszyluo@comp.hkbu.edu.hkma.lei@acm.orgdawnsong@cs.berkeley.edu
Equal contribution. Corresponding to: weixiangyan@ucsb.edu

Abstract

Large Language Models (LLMs) have made significant progress in code generation, offering developers groundbreaking automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible, but may not execute as expected or fulfill specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To advance the community’s understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification.We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity.Additionally, we present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs.The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.

CodeHalu: Investigating Code Hallucinations in LLMs via
Execution-based Verification

Yuchen Tian¹^†^†thanks: Equal contribution. Corresponding to: weixiangyan@ucsb.eduWeixiang Yan²¹¹footnotemark: 1Qian Yang^3,4Xuandong Zhao⁵Qian Chen⁶Wen Wang⁶Ziyang Luo⁷Lei Ma¹Dawn Song⁵¹The University of Tokyo²UC, Santa Barbara³Mila-Québec AI Institute⁴Université de Montréal⁵UC, Berkeley⁶Alibaba Group⁷Hong Kong Baptist Universityyuchentovo@gmail.comweixiangyan@ucsb.eduqian.yang@mila.quebecxuandongzhao@berkeley.edu{tanqing.cq,w.wang}@alibaba-inc.comcszyluo@comp.hkbu.edu.hkma.lei@acm.orgdawnsong@cs.berkeley.edu

1 Introduction

Investigating Code Hallucinations in LLMs via Execution-based Verification (1)

Deep neural networks often generate erroneous information that contradicts the original content, cannot be verified, or conflicts with real-world knowledge. This phenomenon, commonly known as model hallucination, attracts widespread attention in the fields of natural language processing and multimodal learning(Ji etal., 2023; Zhang etal., 2023; Liu etal., 2024), with the community actively exploring methods to mitigate hallucinations(Peng etal., 2023; Elaraby etal., 2023; Liu etal., 2023). However, the issue of model hallucination in the code generation domain remains unexplored.

Conducting a thorough and dedicated study on code hallucinations is crucial for improving the quality of code generated by LLMs. Firstly, the purpose of code is to solve problems, and its value is realized only when the code executes successfully and passes tests(Chen etal., 2021; Austin etal., 2021; Yan etal., 2023). This necessitates that the generated code not only maintain strict logic and precision but also undergoes execution verification to confirm its correctness. Therefore, the practical use and verification of code differ significantly from Natural Language(NL) texts, meaning we cannot directly apply the definitions and methods used for NL hallucinations to code. Secondly, code snippets containing hallucinations may trigger runtime errors,or exhibit functional defects, which hinder the reliable deployment of LLMs in automated software development scenarios. Lastly, by exploring and verifying code hallucinations in a targeted manner, we can effectively uncover their causes and contribute to improving the architecture and training methods of LLMs.

To fill this gap, we define the concept of code hallucination in LLMs, based on the unique purpose and function of the code. Code hallucinations refer to the phenomenon where code generated by LLMs is syntactically correct or even semantically plausible but ultimately cannot execute as expected or fails to meet specified requirements.¹¹1We test 16 LLMs using 105,958 code samples. The experimental results demonstrate that only 9 models occasionally exhibit syntactic errors in the generated code, with an exceptionally low average error rate of 0.0020. These findings support our initial hypothesis that the code generated by LLMs is generally syntactically correct and even semantically plausible or appropriate. Detailed statistical data and experimental results are shown in AppendixA. This phenomenon typically arises from various factors, such as errors or outdated information in the training data, an inadequate grasp of the syntax rules and programming paradigms of the programming languages, and limitations in the logical processing capabilities of the models.In contrast to previous methods that passively explore hallucinations in NLP through a Q&A framework or by prompting LLMs to generate hallucinated answers(Lin etal., 2021; Cheng etal., 2023), we employ an active strategy to detect hallucinations during the code generation process by LLMs. This approach is crucial as the ultimate goal of the generated code is to execute correctly and fulfill specific tasks.

To detect and quantify hallucinations in LLMs during code generation, we develop a dynamic detection algorithm named CodeHalu. This algorithm employs a statistical induction method based on execution validation to identify specific patterns that frequently occur in code generated by multiple LLMs, such as error types, syntax interruptions, or unexpected execution results. When a pattern consistently appears across multiple LLMs, it is recognized as a common code hallucination. Based on the CodeHalu algorithm 1, we employ an execution-based validation approach for hallucination detection, combined with a two-stage heuristic identification method. By conducting statistical quantification on 17 mainstream LLMs, we categorize code hallucinations into four major categories: Mapping, Naming, Resource, and Logical Hallucinations. These categories are further divided into eight subcategories, as illustrated in Figure1. We analyze 17 LLMs for cross-task occurrence rates in eight categories of code hallucinations. The low average rate of 2.04% confirms the independence and validity of our classification.

To effectively measure and compare code hallucinations across different LLMs, we introduce an evaluation benchmark named CodeHaluEval, which is based on the incidence rate of hallucinations. It follows a structured process of Validation-Identification-Construction as shown in Figure4 to detect and evaluate code hallucinations in LLMs, closely tied to real-world programming scenarios, ensuring that the generated code correctly achieves the expected functionality. CodeHaluEval encompasses eight types of code hallucinations as illustrated in Figure1, covering 699 distinct tasks and corresponding to 8,883 samples. Additionally, we systematically evaluate 17 mainstream LLMs to reveal the distribution and behavior patterns of their code hallucinations. We also analyze the potential causes of various code hallucinations, providing detailed insights for further improving the code generation capabilities of LLMs. Our contributions can be summarized as follows:

•
Code Hallucination: We introduce the concept of code hallucination in LLMs and propose an execution-based verification method to define code hallucination, addressing a gap in the research on hallucination within the code generation domain.
•
CodeHalu Algorithm: We develop a dynamic detection algorithm, CodeHalu, to identify and quantify the types of hallucinations that occur in LLMs during code generation. We categorize code hallucinations into four main categories based on a two-stage heuristic approach, discussing their theoretical implications and potential causes.
•
CodeHaluEval Benchmark: We propose the CodeHaluEval benchmark to systematically evaluate 17 popular LLMs, revealing the distribution and patterns of code hallucinations across these models, and providing insights for developing more robust and reliable LLMs.

2 Related Work

Hallucination

In the field of NLP, hallucination is initially defined as the phenomenon where the text generated by a model is fluent and natural but either lacks substantive meaning or is inconsistent with the provided source content(Ji etal., 2023). Recently, Zhang etal. (2023) standardize the definition of NL hallucinations in LLMs into three categories: input-conflicting hallucinations, where the content generated by LLMs diverges from the user’s input; context-conflicting hallucinations, in which the generated content contradicts previously generated content; and fact-conflicting hallucinations, where the generated content conflicts with established world knowledge.These hallucinations are attributed to various factors, such as poor-quality data samples in the training dataset or the use of sampling algorithms with high uncertainty.

In the multimodal domain, Zhai etal. (2023) classify types of hallucinations in image-to-text scenarios, such as image captioning and visual question answering. They define three main types of hallucinations: object existence hallucinations, object attribute hallucinations, and object relationship hallucinations. In text-to-image scenarios, such as image generation, hallucinations refer to the creation of factually incorrect details by the image generation model in response to the given text input. Huang etal. (2024) introduce VHTest, which evaluates hallucinations across eight dimensions in images, including the existence, shape, color, orientation, OCR, size, position, and counting of visual objects.In text-to-video scenarios, such as video generation, Chu etal. (2024) define three types of hallucinations: prompt consistency hallucinations, static hallucinations, and dynamic hallucinations. Detailed definitions and descriptions of each type of hallucination are provided in AppendixA.Although the issue of hallucinations receives extensive attention in NLP and multimodal domains, it remains unexplored in the code domain. Therefore, we propose CodeHalu to systematically define, identify, classify, and quantify code hallucinations in LLMs.

Existing Coding Benchmarks

In recent years, numerous studies focus on evaluating the capability of LLMs to handle various programming tasks. Among these, the HumanEval(Chen etal., 2021), includes 164 Python programming problems, each with an average of 6.7 unit tests. The MBPP(Austin etal., 2021) benchmark contains 974 Python programming tasks.Compared to these, the APPS(Hendrycks etal., 2021) benchmark presents more challenging programming questions, with each problem averaging 293.2 words in length. CodeScope(Yan etal., 2023) covers 43 programming languages and eight coding tasks to comprehensively evaluate LLMs in code understanding and generation.MMCode(Li etal., 2024) is designed to evaluate the programming capability of code models in multimodal scenarios.SWE-bench(Jimenez etal., 2023) evaluates the capability of LLMs to edit code repositories to solve problems with a level of complexity similar to that faced by human programmers in real-world programming situations.Overall, existing code benchmarks focus on evaluating the performance of LLMs on various programming tasks. However, there is still a lack of effective methods to detect and quantify potential hallucinations that may occur in code generation. Therefore, we propose CodeHaluEval to detect and quantify code hallucinations in LLMs.

3 Code Hallucination

As a tool, code aims to achieve specific objectives through correct execution. This inherent characteristic motivates our use of an execution-based verification method to explore and identify code hallucinations. In this section, we define the concept of code hallucination and distinguish it from code errors, clarifying the relationship and differences between these two phenomena.

Definition 1 (Code Hallucinations).

Code hallucinations refer to the code generated by large language models that is syntactically correct or even semantically plausible, but ultimately cannot execute as expected or fails to meet specified requirements.

Definition 2 (Code Errors).

Code errors refer to issues in a program that cause it to stop executing.

Remark 3 (Code Hallucinations vs. Code Errors).

In multiple domains, existing work(Ji etal., 2023; Zhang etal., 2023; Huang etal., 2024; Zhai etal., 2023; Chu etal., 2024) often equates errors with hallucinations, or considers errors as a specific subset of hallucinations. We follow this perspective and regard code errors as a specific subset of code hallucinations. In other words, errors manifest as a form of hallucination, but not all hallucinations can be adequately expressed through errors. Figure 2 illustrates the distinction between code errors and code hallucinations. The code on the left exhibits a typical code error due to the use of an undefined variable “N”, resulting in a NameError. On the right, the code repeatedly calls the same function due to a logical collapse during generation, eventually exceeding the maximum token limit and leading to a SyntaxError. However, the underlying issue is a latent logical hallucination, rather than the observed syntactic error.

Overall, although there is some slightly overlap between code hallucinations and code errors, their meanings, research objects, and scopes differ significantly. Code hallucinations focus on why the model produces hallucinations, while code errors focus on what grammatical rules the code violates. Code errors form a proper subset of code hallucinations, while code hallucinations encompass a broader range of potential logical and functional issues, representing a finer-grained and more comprehensive evaluation of the overall quality and functionality of the code.

Investigating Code Hallucinations in LLMs via Execution-based Verification (2)

4 CodeHalu Algorithm

1:Input: Code Generation Dataset $\alpha$ , Language models $\mathsf{\pi}$

2:Output: HaluTypes $\xi$

3:Let $\xi\leftarrow$ empty list

4:for $\alpha_{i}$ , where $i\in\{1,\dots,k\}$ do

5:for $\pi_{j}$ , where $j\in\{1,\dots,m\}$ do

6: $\mathsf{GC}_{j}^{\alpha_{i}}\leftarrow$ $\pi_{j}(\mathsf{GI}_{j},\mathsf{Q})$

7:if $\mathsf{GC}_{j}^{\alpha_{i}}$ is stuttering, infinite enumeration, or gibberish

8: $\xi\leftarrow\xi\cup\operatorname{State}(\mathsf{GC}_{j}^{\alpha_{i}})$

9:else

10:for $t_{n}$ , where $n\in\{1,\dots,N\}$ do

11:if Execute( $\mathsf{GC}_{j}^{\alpha_{i}}(t_{n}))$

5 Code Hallucinations Classification

In this section, we analyze the hallucination states detected by the CodeHalu algorithm, classify and define four main types of hallucinations, and discuss the rationale behind the classification method.

According to the TIOBE Index²²2https://www.tiobe.com/tiobe-index/, a metric of programming language popularity, we primarily investigate code hallucinations within the Python. By applying the CodeHalu algorithm on the complex APPS dataset(Hendrycks etal., 2021) and 17 widely-used LLMs, we identify and validate 18 types of hallucination states that violate human expectations during the code generation, including inconsistent code context, ambiguous logic and data flow, conflicting intentions, among others. Using the two-stage heuristic classification method introduced in Remark8, we categorize code hallucinations into four main types based on the nature and origin of these phenomena: mapping hallucinations, naming hallucinations, resource hallucinations, and logical hallucinations, as illustrated in Figure 1.

Definition 4 (Mapping Hallucinations).

Mapping Hallucinations refer to the ambiguity and confusion that occur in LLMs’ perception and mapping of data types, values, and structures during data operations. This phenomenon is further divided into two sub-categories: data compliance hallucinations and structure access hallucinations.

Definition 5 (Naming Hallucinations).

Naming Hallucinations refer to the memory-related issues and factual inaccuracies exhibited by LLMs when handling the naming, scope, and existence of variables, attributes, and modules. This phenomenon is further divided into two subcategories: identity hallucinations and external source hallucinations.

Identity hallucinations occur when LLMs possess biased memories or lack sufficient understanding of the context, leading to generated code that references undefined variables, accesses non-existent object properties, or uses unassigned variables in local scopes.

Definition 6 (Resource Hallucinations).

Resource Hallucinations occur when LLMs lack an adequate perception and prediction of resource consumption and control flow of the generated code during execution. This phenomenon is further divided into physical constraint hallucinations and computational boundary hallucinations.

Physical constraint hallucinations arise when LLMs underestimate resource consumption during data processing operations, causing code failure due to exceeding memory capacity, stack depth, or other physical constraints.

Definition 7 (Logic Hallucinations).

Logic Hallucinations refer to the discrepancies between the expected results and the actual outcomes after executing the code generated by LLMs, or outputs with low semantic density or even complete chaos. This phenomenon is further divided into logic deviation and logic breakdown.

Logic deviation occurs when LLMs generate code that lacks sufficient logical consideration or contradicts the intended instructions. While this hallucination may not cause errors during execution, logical deviations or confusion result in outcomes that fail to meet the expected results.

Remark 8 (Discussion of Rationality).

To ensure the rationality and effectiveness of our code hallucination classification method, we conduct in-depth analyses.

Firstly, we extensively reference classification methods for hallucinations in the fields of NLP and multimodal research(Zhang etal., 2023; Ji etal., 2023; Huang etal., 2024; Zhai etal., 2023; Chu etal., 2024), as well as methods for classifying code errors and vulnerabilities in software engineering(Pan etal., 2023; Wang etal., 2024; Huang etal., 2023). We adopt a two-stage heuristic classification strategy. Initially, team members independently review failed code cases and develop preliminary classification frameworks; then, we reach a consensus through collaborative discussions. This widely used approach ensures the adaptability and accuracy of our framework, enabling a systematic understanding of code hallucinations in LLMs.

Secondly, we analyze the cross-task occurrence rates of each model across eight categories of code hallucinations, detailed in Appendix Table 3. The results show that the average cross-task occurrence rate for these categories is only 2.04%, confirming the independence and rationality of our classification. For the Gemma-7B model, which exhibits the most severe hallucinations in Table 2, only 1.07% of task samples show cross-task hallucinations, as illustrated in Figure 3.

Lastly, we conduct an empirical investigation of our classification results and design a questionnaire to evaluate the rationality of our method, detailed in the Appendix Figure8. The survey receives 23 responses, and after excluding seven respondents with less than three years of experience, we analyzed 16 valid responses. The survey results indicate a rationality rating of 91.08% for our classification method, further supporting its validity.

Investigating Code Hallucinations in LLMs via Execution-based Verification (3)

Investigating Code Hallucinations in LLMs via Execution-based Verification (4)

6 Cause Analysis of Code Hallucinations

In this section, we explore the potential causes of various hallucinations generated by LLMs, aiming to provide valuable insights for optimizing training data, training methods, model architecture, and alignment strategies.

Mapping hallucinations stem from the model’s misunderstanding of data types and structures. This phenomenon arises due to several factors: (1) The model generates code based on tokens, lacking insight into higher-level structures such as statements and functions(Yang etal., 2021); (2) When dealing with long-distance data dependencies, especially within complex code blocks, the model fails to continuously track the structure and state of variables, overly relying on local information while neglecting the importance of the overall context(Zhang etal., 2024); (3) The model does not explicitly perform type checking and structure matching during code generation, lacking static checking and error correction mechanisms.

Naming hallucinations reflect the limitations of models in tracking information and utilizing external knowledge. This issue arises from several factors: (1) Token-based feature representation makes it difficult to accurately model long-distance dependencies, leading to model misjudgments regarding variable scope, lifecycle, and visibility(Xu etal., 2020); (2) The code generation process lacks consistency checks for identifiers and does not perform global tracking of variable definitions and usage; (3) Knowledge of external libraries is not effectively and timely integrated into the model’s knowledge system, making it difficult for the model to accurately understand the names, functions, and invocation methods of libraries(Jesse etal., 2023).

Resource hallucinations highlight the model’s lack of deep understanding of code execution mechanisms and physical constraints. These issues arise from several factors: (1) The training data lacks information related to resource consumption and performance optimization, making it difficult for the model to learn about complexity analysis and resource evaluation; (2) As the model generates code based on probabilities, it lacks a module for calculating and estimating the resource consumption of the generated code, making it unable to simulate the real-world operating environment and resource limits; (3) During the model training process, the focus is on the correctness of the code’s functionality, often overlooking its complexity and resource constraints in actual execution environments.

Logic hallucinations reveal the model’s deficiencies in semantic understanding and reasoning about code. This issue arises due to several factors: (1) The model relies on pattern matching and statistical rules to generate code, lacking a fundamental understanding of symbolic systems and rigorous verification of program logic; (2) The training data is often not rigorously verified for accuracy and may contain code with very similar functions. Since models sometimes imitate and memorize previous examples(Yan and Li, 2022), this can result in the model directly replicating similar logic in the code or even learning incorrect logic from the outset; (3) When the model generates code, repetition at the line level has a self-reinforcing effect, causing the model to become increasingly confident in the code it generates, which may lead to a stuttering phenomenon(Xu etal., 2022).

7 The CodeHaluEval Benchmark

We construct the CodeHaluEval benchmark, a unified evaluation method for comparing various types and frequencies of hallucinations in code generation across different LLMs. We develop the CodeHaluEval based on the APPS testing set, following a structured process of Validation-Identification-Construction, as shown in Figure 4.

8 Experiments

Models. To comprehensively analyze the different hallucinations of various competitive LLMs in CodeHaluEval, we evaluate 12 general LLMs, including GPT-4(OpenAI, 2023), GPT-3.5(OpenAI, 2023), Gemini-Pro-1.0(Gemini, 2023), Claude-3-haiku(Anthropic, 2024), LLaMA-2 & 3(Touvron etal., 2023), Vicuna(Chiang etal., 2023), Qwen-turbo(Bai etal., 2023), ChatGLM3-6B(Du etal., 2021), Ernie-3.5(Baidu, 2023), Mistral-7B(Jiang etal., 2023), Gemma (Team etal., 2024). We also evaluate 5 coding LLMs, including Code LLaMA (Roziere etal., 2023), DeepSeek Coder (Guo etal., 2024), CodeGeeX-2 (Zheng etal., 2023), StarCoder-2 (Li etal., 2023a), MagicCoder-7B (Wei etal., 2023), WizardCoder-7B (Luo etal., 2023).The experimental evaluation is conducted using API calls or 8 NVIDIA A6000 GPUs.

Metrics. Given the limited exploration of code hallucinations, no dedicated metrics currently exist for evaluating them in LLMs. To address this gap, we propose an evaluation metric called Hallucination Rate (HR). Specifically, HR is defined as the percentage of hallucination samples detected in the test set among all samples, with the formula: $\operatorname{HR}=\frac{1}{N}\sum_{i=1}^{N}S(i,K)$ , where $S(i,K)$ is an indicator function. If the $i^{th}$ sample satisfies the hallucination condition, then $S(i,K)$ = 1; otherwise, $S(i,K)$ = 0. Ideally, a lower HR indicates a lower likelihood of hallucinations during code generation by the LLM, thus demonstrating greater robustness and reliability. To our knowledge, HR is the first metric that accurately reflects the hallucination phenomenon in LLMs during code generation tasks through actual execution tests.

Result & Analysis

Investigating Code Hallucinations in LLMs via Execution-based Verification (5)

The experimental results are presented in Table 2 and Figure 5, with case studies of various models across different hallucination categories detailed in Appendix A.

Mapping hallucination: GPT-4 and GPT-3.5 consistently identify and follow rules related to data types, values, and structures, demonstrating strong context sensitivity.

Naming hallucination: Claude-3 reliably remembers and references entity names from the context and external knowledge bases. In contrast, LLaMA-2 exhibits significant memory bias when processing external knowledge and occasionally fabricates information.

Resource hallucination: GPT-4, Qwen, and LLaMA-2 effectively account for actual resource constraints when generating code, showing an understanding of computational boundaries and limitations, which leads them to produce code with lower complexity.

Logical hallucination: Although all models face challenges in maintaining logical coherence, LLaMA-3 and GPT-4 perform relatively well in reducing repetition. Most models rarely generate code with stuttering or infinite loops, but such issues are more common in Gemma, CodeGeeX-2, and WizardCoder, indicating a tendency to lose semantic and logical consistency during code generation.

Overall, GPT-4 and LLaMA-3 perform well across all hallucination categories, displaying stability and robustness in various scenarios. Logical hallucinations remain the most prevalent issue across all models, while naming and resource hallucinations are relatively less common. The performance of different models varies significantly across hallucination types, likely due to differences in their training data, methods, and architectures. The average hallucination rate ranges from approximately 20% to 60%.

We view mitigating code hallucination as future work. Based on a detailed analysis of experimental results and generated cases, we provide insights into strategies for mitigating code hallucinations in LLMs. In terms of training data, improving the quality and increasing the diversity of data sources enhances the model’s generalization ability. In terms of training methods, employing alignment strategies based on compilation and execution verification, as well as setting multiple objectives during training, enables the model to better understand the data flow and control flow of code. In terms of model architecture, introducing a static code verification module provides real-time feedback on verification results, thereby enhancing the model’s robustness. Additionally, incorporating a code graph module allows the model to construct and utilize graph structure information when generating code, deepening its understanding of patterns and logical relationships in the generated code.

9 Conclusion

We introduce the concept of code hallucination and propose an execution-based verification method to classify code hallucinations. We develop the dynamic detection algorithm, CodeHalu, which categorizes code hallucinations into four main types and eight subtypes, providing a comprehensive understanding of the various challenges faced by LLMs in code generation. Additionally, we establish the CodeHaluEval benchmark and evaluate 17 widely-used LLMs, revealing significant differences in their hallucination patterns during code generation, and providing detailed insights for further improving the code generation capabilities of LLMs. Overall, we lay the theoretical foundation for understanding the hallucination phenomenon of LLMs in code generation, and provide a complete set of tools for detecting and evaluating code hallucinations.

10 Limitations

Python is our focus for exploring code hallucination, as it is the most widely used programming language according to the TIOBE Index. Furthermore, many existing studies, such as HumanEval and MBPP benchmarks, concentrate on Python. Thus, we do not extend our investigation to other languages.

CodeHalu focuses on ensuring the correctness of generated code to meet the needs of developers and users. In contrast, identifying and preventing security risks is a higher-level concern, effectively addressed through sandbox environments.

We focus on code hallucination specifically within the code generation task, excluding other programming tasks such as code translation, and code repair. This is because code generation is currently the most widely studied task in the community. It is important to emphasize that our hallucination detection and evaluation methods can be easily adapted to other tasks, which we consider as future work.

References

Anthropic (2024)Anthropic. 2024.The Claude 3 Model Family: Opus, Sonnet, Haiku.https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
Austin etal. (2021)Jacob Austin, Augustus Odena, MaxwellI. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, CarrieJ. Cai, Michael Terry, QuocV. Le, and Charles Sutton. 2021.Program synthesis with large language models.CoRR, abs/2108.07732.
Bai etal. (2023)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, YuHan, Fei Huang, etal. 2023.Qwen technical report.arXiv preprint arXiv:2309.16609.
Baidu (2023)Baidu. 2023.Introducing ernie 3.5: Baidu’s knowledge-enhanced foundation model takes a giant leap forward.
Chen etal. (2021)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, HenriquePondé deOliveiraPinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, FelipePetroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, WilliamHebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, AndrewN. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021.Evaluating large language models trained on code.CoRR, abs/2107.03374.
Cheng etal. (2023)Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, etal. 2023.Evaluating hallucinations in chinese large language models.arXiv preprint arXiv:2310.03368.
Chiang etal. (2023)Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE Gonzalez, etal. 2023.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023).
Chu etal. (2024)Zhixuan Chu, Lei Zhang, Yichen Sun, Siqiao Xue, Zhibo Wang, Zhan Qin, and Kui Ren. 2024.Sora detector: A unified hallucination detection for large text-to-video models.CoRR, abs/2405.04180.
Dhuliawala etal. (2023)Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023.Chain-of-verification reduces hallucination in large language models.CoRR, abs/2309.11495.
Du etal. (2021)Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021.Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360.
Elaraby etal. (2023)Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, YuWang, and Shizhu Liu. 2023.Halo: Estimation and reduction of hallucinations in open-source weak large language models.CoRR, abs/2308.11764.
Gardent etal. (2017)Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017.Creating training corpora for NLG micro-planners.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.
Gemini (2023)Gemini. 2023.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805.
Guo etal. (2024)Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, YWu, YKLi, etal. 2024.Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196.
Hendrycks etal. (2021)Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, etal. 2021.Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938.
Huang etal. (2023)Kai Huang, Xiangxin Meng, Jian Zhang, Yang Liu, Wenjie Wang, Shuhao Li, and Yuqing Zhang. 2023.An empirical study on fine-tuning large language models of code for automated program repair.In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1162–1174.
Huang etal. (2024)Wen Huang, Hongbin Liu, Minxin Guo, and NeilZhenqiang Gong. 2024.Visual hallucinations of multi-modal large language models.CoRR, abs/2402.14683.
Jesse etal. (2023)K.Jesse, T.Ahmed, P.T. Devanbu, and E.Morgan. 2023.Large language models and simple, stupid bugs.In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), pages 563–575, Los Alamitos, CA, USA. IEEE Computer Society.
Ji etal. (2023)Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, YeJin Bang, Andrea Madotto, and Pascale Fung. 2023.Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38.
Jiang etal. (2023)AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etal. 2023.Mistral 7b.arXiv preprint arXiv:2310.06825.
Jimenez etal. (2023)CarlosE Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023.Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770.
Li etal. (2024)Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, and Jing Ma. 2024.Mmcode: Evaluating multi-modal code large language models with visually rich programming problems.Preprint, arXiv:2404.09486.
Li etal. (2023a)Raymond Li, LoubnaBen Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, etal. 2023a.Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161.
Li etal. (2023b)Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and GeLi. 2023b.Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852.
Lin etal. (2021)Stephanie Lin, Jacob Hilton, and Owain Evans. 2021.Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958.
Liu etal. (2023)f*ckiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2023.Aligning large multi-modal model with robust instruction tuning.arXiv preprint arXiv:2306.14565.
Liu etal. (2024)Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, KeWang, Liping Hou, Rongjun Li, and Wei Peng. 2024.A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253.
Luo etal. (2023)Ziyang Luo, Can Xu, PuZhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023.Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568.
OpenAI (2023)OpenAI. 2023.GPT-4 technical report.
Pan etal. (2023)Rangeet Pan, AliReza Ibrahimzada, Rahul Krishna, Divya Sankar, LambertPouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2023.Understanding the effectiveness of large language models in code translation.arXiv preprint arXiv:2308.03109.
Peng etal. (2023)Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, YuHu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, etal. 2023.Check your facts and try again: Improving large language models with external knowledge and automated feedback.arXiv preprint arXiv:2302.12813.
Roziere etal. (2023)Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, XiaoqingEllen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, etal. 2023.Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950.
Team etal. (2024)Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale, Juliette Love, etal. 2024.Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295.
Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal. 2023.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
Wang etal. (2024)Zhijie Wang, Zijie Zhou, DaSong, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. 2024.Where do large language models fail when generating code?arXiv preprint arXiv:2406.08731.
Wei etal. (2023)Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023.Magicoder: Source code is all you need.arXiv preprint arXiv:2312.02120.
Xu etal. (2020)Hongfei Xu, Josef van Genabith, Deyi Xiong, Qiuhui Liu, and Jingyi Zhang. 2020.Learning source phrase representations for neural machine translation.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 386–396, Online. Association for Computational Linguistics.
Xu etal. (2022)Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. 2022.Learning to break the loop: Analyzing and mitigating repetitions for neural text generation.Advances in Neural Information Processing Systems, 35:3082–3095.
Yan and Li (2022)Weixiang Yan and Yuanchun Li. 2022.Whygen: explaining ml-powered code generation by referring to training examples.In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 237–241.
Yan etal. (2023)Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, LiZhu, Shuiguang Deng, etal. 2023.Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation.arXiv preprint arXiv:2311.08588.
Yang etal. (2021)Chen Yang, Yan Liu, and Changqing Yin. 2021.Recent advances in intelligent source code generation: A survey on natural language based studies.Entropy, 23(9).
Zhai etal. (2023)Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. 2023.Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779.
Zhang etal. (2024)Kechi Zhang, GeLi, Huangzhao Zhang, and Zhi Jin. 2024.Hirope: Length extrapolation for code models.arXiv preprint arXiv:2403.19115.
Zhang etal. (2023)Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, YuZhang, Yulong Chen, etal. 2023.Siren’s song in the ai ocean: a survey on hallucination in large language models.arXiv preprint arXiv:2309.01219.
Zheng etal. (2023)Qinkai Zheng, Xiao Xia, XuZou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, etal. 2023.Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.arXiv preprint arXiv:2303.17568.

Appendix A Appendix

SyntaxError Analysis

Investigating Code Hallucinations in LLMs via Execution-based Verification (6)

Related Work

In the field of NLP, hallucination is initially defined as the phenomenon where the text generated by the model is fluent and natural but lacks actual meaning or is inconsistent with the provided source content(Ji etal., 2023). Recently, Zhang etal. (2023) standardize the definition of NL hallucinations in LLMs into three categories: Input-conflicting hallucinations, where the content generated by LLMs diverges from the user’s input; Context-conflicting hallucinations, in which the generated content contradicts previously generated content; and Fact-conflicting hallucinations, where the generated content conflicts with established world knowledge.These hallucinations may be attributed to various factors, such as poor-quality data samples in the training dataset or the use of sampling algorithms with high uncertainty.To mitigate the impact of these hallucinations, recent studies primarily focus on addressing the issues of data preparation phase(Gardent etal., 2017), training phase(Elaraby etal., 2023), and inference phase(Dhuliawala etal., 2023) to alleviate the hallucination problem in LLMs.

In the multimodal domain, Zhai etal. (2023) classify types of hallucinations in image-to-text scenarios, such as image captioning and visual question answering. They define three main types of hallucinations: object existence hallucinations, where an object mentioned in the description does not exist in the corresponding image; object attribute hallucinations, where the description of the object’s attributes does not match the actual attributes in the image; and object relationship hallucinations, where the relationships between objects are incorrectly described. In text-to-video scenarios, such as video generation, Chu etal. (2024) define three types of hallucinations: prompt consistency hallucinations, which means the text description does not match the visual output; static hallucinations, which refers to errors in the spatial relationships, physical properties, and semantic consistency of objects and scenes within a single frame; and dynamic hallucinations, which indicates inconsistencies or abnormalities in the motion and behavior of objects or entities across frames.

CodeHaluEval Power Analysis

To ensure our study has sufficient statistical power to detect hallucinations commonly triggered by different LLMs during code generation tasks, we conduct a power analysis to determine the minimum sample size required. The power analysis is based on the following parameters, which are standard in social and behavioral science research:

•
Power Level: We set the power level at 0.8 to ensure adequate statistical strength for detecting actual effects.
•
Significance Level (Alpha): We set the significance level at 0.05 to balance the study’s sensitivity with the false positive rate.
•
Effect Size: We select a Cohen’s d value of 0.2, a widely accepted standard for small effect sizes, to ensure reliable detection of even relatively minor effects.

Using the statsmodels library for calculations, we determine that a minimum of 394 samples is necessary. This indicates that at least 394 code task samples are required to ensure the statistical reliability of our research findings.

In selecting the $k$ -value (defined as the minimum number of models that must identify the same hallucination for a task to be classified as prone to triggering hallucinations), we perform the following analysis:

As illustrated in Figure 7, we summarize the number of samples included in CodeHaluEval for different $k$ -values. At $k$ = 5, we find that 699 tasks are identified by at least five models as potentially triggering hallucinations, which far exceeds our minimum sample size of 394 derived from power analysis. This ensures the reliability and robustness of our statistical analysis.

If the $k$ -value is set higher than 5, the number of tasks identified as likely to trigger hallucinations falls below 394, leading to insufficient statistical power and increasing the risk of unstable research results.

In conclusion, we choose $k$ = 5 as the threshold, not only meeting the requirements for statistical power but also increasing the sensitivity to accurately identify true hallucination scenarios while maintaining a lower misidentification rate. This methodological choice provides strong statistical support for our benchmark tests and ensures the reliability and practicality of our results.

Investigating Code Hallucinations in LLMs via Execution-based Verification (7)

Classification of Code Hallucinations

Model	#2( $\downarrow$ )	#3( $\downarrow$ )	Avg.( $\downarrow$ )
LLaMA-2-7B	0.72	0.00	0.36
Gemma-7B	2.15	0.00	1.07
ChatGLM-3-6B	2.15	0.14	1.14
GPT-4	2.86	0.00	1.43
GPT-3.5	3.00	0.14	1.57
Claude-3-haiku	3.15	0.00	1.57
DeepSeek Coder-6.7B	3.29	0.00	1.65
MagicCoder-7B	3.29	0.14	1.72
WizardCoder-7B	3.58	0.00	1.79
Gemini-1.0	3.86	0.29	2.07
Code LLaMA-7B	4.43	0.00	2.22
Mistral-7B	4.43	0.14	2.29
Qwen-turbo	4.86	0.00	2.43
StarCoder-16B	5.01	0.29	2.65
Ernie-3.5	5.29	0.43	2.86
CodeGeeX-2-6B	11.44	0.14	5.79
\hdashlineMean	3.97	0.11	2.04

Data compliance hallucinations occur when LLMs have a vague understanding of the data types and parameter values of the objects being manipulated, resulting in generated code that attempts to perform type-mismatched or rule-violating operations. This hallucination typically manifests as TypeError, ValueError, or ZeroDivisionError exceptions being thrown during code execution, reflecting the model’s unexpected behavior in data type validation, parameter value handling, and arithmetic operations during code generation.

Structure access hallucinations occur when LLMs misinterpret the data structures of the objects being manipulated, leading to generated code that attempts to access non-existent array indices or dictionary keys. This hallucination typically manifests as IndexError or KeyError exceptions being thrown during code execution, reflecting the model’s lack of clarity in understanding the data structures within the given context.

Identity hallucinations occur when LLMs possess biased memories or lack sufficient understanding of the context. This leads to generated code that references undefined variables, accesses non-existent object properties, or uses unassigned variables in local scopes. This hallucination typically manifests as NameError, AttributeError, or UnboundLocalError exceptions being thrown during code execution, reflecting the model’s difficulty in managing long-distance dependencies and accurately tracking variable definitions and scopes.

External source hallucinations occur when LLMs exhibit significant memory-related issues or obvious conflicts with facts concerning external knowledge sources, resulting in generated code that attempts to import non-existent modules or fails to correctly load modules from other paths. This hallucination typically manifests as ImportError or ModuleNotFoundError exceptions being thrown during code execution, reflecting the model’s lack of deep understanding and accurate memory of the representation and organization of external knowledge sources.

Physical constraint hallucinations occur when LLMs underestimate resource consumption during data processing operations, causing code failure during execution due to exceeding memory capacity, stack depth, or other physical constraints. This hallucination typically manifests as RecursionError or MemoryError exceptions being thrown during code execution, reflecting the model’s inadequate understanding of the global operating conditions.

Computational boundary hallucinations occur when LLMs blur recognition of numerical calculation limits and iteration endpoints during data processing operations, causing code failure due to numerical overflow or improper iteration control. This hallucination typically manifests as OverflowError or StopIteration exceptions being thrown during code execution, reflecting the model’s erroneous understanding of numerical boundary conditions and insufficient grasp of the inherent logic of iterators, loops, and control flows.

Logic deviation occurs when LLMs generate code that lacks sufficient logical consideration or contradicts the intended instructions. This hallucination usually does not cause errors during code execution; however, due to logical deviations or confusion, the program’s outcomes fail to meet the expected results. This issue reflects the model’s insufficient handling of logic and inadequate understanding of algorithms during the code generation process.

Logic breakdown occurs when LLMs struggle to interpret or maintain a continuous understanding of context during code generation. This indicates that the models may lose direction while generating code, making it difficult to maintain strict consistency of contextual information. Consequently, the models might generate stuttering, infinite enumeration, or gibberish code, often lacks structural coherence and logical consistency.

Case Study

Table 4-A provide case studies of different models in various types of code hallucinations.

NL-Description

Salem gave you

n

sticks with integer positive lengths

a_{1},a_{2},\ldots,a_{n}

.For every stick, you can change its length to any other positive integer length (that is, either shrink or stretch it). The cost of changing the stick’s length from

a

b

|a-b|

, where

|x|

means the absolute value of

x

.A stick length

a_{i}

is called almost good for some integer

t

|a_{i}-t|\leq 1

.Salem asks you to change the lengths of some sticks (possibly all or none), such that all sticks’ lengths are almost good for some positive integer

t

and the total cost of changing is minimum possible. The value of

t

is not fixed in advance and you can choose it as any positive integer.As an answer, print the value of

t

and the minimum cost. If there are multiple optimal choices for

t

, print any of them.Input: The first line contains a single integer

n

(

1\leq n\leq 1000

) — the number of sticks. The second line contains

n

integers

a_{i}

(

1\leq a_{i}\leq 100

) — the lengths of the sticks.Output: Print the value of

t

and the minimum possible cost. If there are multiple optimal choices for

t

, print any of them.

Generated Code⬇n = int(input())a = list(map(int, input().split()))min_cost = float(’inf’)t = 0for i in range(1, 101): cost = sum(abs(x - i) - (abs(x - i) > 1) for x in a) if cost < min_cost: min_cost = cost t = iprint(t, min_cost)

\hdashlineInputExcepted OutputExecuted Result

3 \n 10 1 43 7Logic deviation: No exceptions were thrown at runtime, but

logical confusion caused the output to not meet expectations

Reference Code⬇n = int(input())arr = list(map(int, input().split()))arr.sort()a = []for t in range(1, 101): tot = 0 for item in arr: if (abs(item - t) >= 1): tot += abs(item - t) - 1 a.append((tot, t))a.sort()print(a[0][1], a[0][0])

\hdashlineInputExcepted OutputExecuted Result

3 \n 10 1 43 73 7

Investigating Code Hallucinations in LLMs via Execution-based Verification (2024)

Abstract

1 Introduction

2 Related Work

Hallucination

Existing Coding Benchmarks

3 Code Hallucination

Definition 1 (Code Hallucinations).

Definition 2 (Code Errors).

Remark 3 (Code Hallucinations vs. Code Errors).

4 CodeHalu Algorithm

5 Code Hallucinations Classification

Definition 4 (Mapping Hallucinations).

Definition 5 (Naming Hallucinations).

Definition 6 (Resource Hallucinations).

Definition 7 (Logic Hallucinations).

Remark 8 (Discussion of Rationality).

6 Cause Analysis of Code Hallucinations

7 The CodeHaluEval Benchmark

8 Experiments

Result & Analysis

9 Conclusion

10 Limitations

References

Appendix A Appendix

SyntaxError Analysis

Related Work

CodeHaluEval Power Analysis

Classification of Code Hallucinations

Case Study

Notes

Constraints

Input

Output

References