Investigating Code Hallucinations in LLMs via Execution-based Verification (2024)

Yuchen Tian1Weixiang Yan211footnotemark: 1Qian Yang3,4Xuandong Zhao5Qian Chen6Wen Wang6Ziyang Luo7Lei Ma1Dawn Song5
1The University of Tokyo2UC, Santa Barbara3Mila-Québec AI Institute4Université de Montréal
5UC, Berkeley6Alibaba Group7Hong Kong Baptist University
yuchentovo@gmail.comweixiangyan@ucsb.eduqian.yang@mila.quebec
xuandongzhao@berkeley.edu{tanqing.cq,w.wang}@alibaba-inc.com
cszyluo@comp.hkbu.edu.hkma.lei@acm.orgdawnsong@cs.berkeley.edu
Equal contribution. Corresponding to: weixiangyan@ucsb.edu

Abstract

Large Language Models (LLMs) have made significant progress in code generation, offering developers groundbreaking automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible, but may not execute as expected or fulfill specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To advance the community’s understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification.We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity.Additionally, we present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs.The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.

CodeHalu: Investigating Code Hallucinations in LLMs via
Execution-based Verification


Yuchen Tian1thanks: Equal contribution. Corresponding to: weixiangyan@ucsb.eduWeixiang Yan211footnotemark: 1Qian Yang3,4Xuandong Zhao5Qian Chen6Wen Wang6Ziyang Luo7Lei Ma1Dawn Song51The University of Tokyo2UC, Santa Barbara3Mila-Québec AI Institute4Université de Montréal5UC, Berkeley6Alibaba Group7Hong Kong Baptist Universityyuchentovo@gmail.comweixiangyan@ucsb.eduqian.yang@mila.quebecxuandongzhao@berkeley.edu{tanqing.cq,w.wang}@alibaba-inc.comcszyluo@comp.hkbu.edu.hkma.lei@acm.orgdawnsong@cs.berkeley.edu


1 Introduction

Investigating Code Hallucinations in LLMs via Execution-based Verification (1)

Deep neural networks often generate erroneous information that contradicts the original content, cannot be verified, or conflicts with real-world knowledge. This phenomenon, commonly known as model hallucination, attracts widespread attention in the fields of natural language processing and multimodal learning(Ji etal., 2023; Zhang etal., 2023; Liu etal., 2024), with the community actively exploring methods to mitigate hallucinations(Peng etal., 2023; Elaraby etal., 2023; Liu etal., 2023). However, the issue of model hallucination in the code generation domain remains unexplored.

Conducting a thorough and dedicated study on code hallucinations is crucial for improving the quality of code generated by LLMs. Firstly, the purpose of code is to solve problems, and its value is realized only when the code executes successfully and passes tests(Chen etal., 2021; Austin etal., 2021; Yan etal., 2023). This necessitates that the generated code not only maintain strict logic and precision but also undergoes execution verification to confirm its correctness. Therefore, the practical use and verification of code differ significantly from Natural Language(NL) texts, meaning we cannot directly apply the definitions and methods used for NL hallucinations to code. Secondly, code snippets containing hallucinations may trigger runtime errors,or exhibit functional defects, which hinder the reliable deployment of LLMs in automated software development scenarios. Lastly, by exploring and verifying code hallucinations in a targeted manner, we can effectively uncover their causes and contribute to improving the architecture and training methods of LLMs.

To fill this gap, we define the concept of code hallucination in LLMs, based on the unique purpose and function of the code. Code hallucinations refer to the phenomenon where code generated by LLMs is syntactically correct or even semantically plausible but ultimately cannot execute as expected or fails to meet specified requirements.111We test 16 LLMs using 105,958 code samples. The experimental results demonstrate that only 9 models occasionally exhibit syntactic errors in the generated code, with an exceptionally low average error rate of 0.0020. These findings support our initial hypothesis that the code generated by LLMs is generally syntactically correct and even semantically plausible or appropriate. Detailed statistical data and experimental results are shown in AppendixA. This phenomenon typically arises from various factors, such as errors or outdated information in the training data, an inadequate grasp of the syntax rules and programming paradigms of the programming languages, and limitations in the logical processing capabilities of the models.In contrast to previous methods that passively explore hallucinations in NLP through a Q&A framework or by prompting LLMs to generate hallucinated answers(Lin etal., 2021; Cheng etal., 2023), we employ an active strategy to detect hallucinations during the code generation process by LLMs. This approach is crucial as the ultimate goal of the generated code is to execute correctly and fulfill specific tasks.

To detect and quantify hallucinations in LLMs during code generation, we develop a dynamic detection algorithm named CodeHalu. This algorithm employs a statistical induction method based on execution validation to identify specific patterns that frequently occur in code generated by multiple LLMs, such as error types, syntax interruptions, or unexpected execution results. When a pattern consistently appears across multiple LLMs, it is recognized as a common code hallucination. Based on the CodeHalu algorithm 1, we employ an execution-based validation approach for hallucination detection, combined with a two-stage heuristic identification method. By conducting statistical quantification on 17 mainstream LLMs, we categorize code hallucinations into four major categories: Mapping, Naming, Resource, and Logical Hallucinations. These categories are further divided into eight subcategories, as illustrated in Figure1. We analyze 17 LLMs for cross-task occurrence rates in eight categories of code hallucinations. The low average rate of 2.04% confirms the independence and validity of our classification.

To effectively measure and compare code hallucinations across different LLMs, we introduce an evaluation benchmark named CodeHaluEval, which is based on the incidence rate of hallucinations. It follows a structured process of Validation-Identification-Construction as shown in Figure4 to detect and evaluate code hallucinations in LLMs, closely tied to real-world programming scenarios, ensuring that the generated code correctly achieves the expected functionality. CodeHaluEval encompasses eight types of code hallucinations as illustrated in Figure1, covering 699 distinct tasks and corresponding to 8,883 samples. Additionally, we systematically evaluate 17 mainstream LLMs to reveal the distribution and behavior patterns of their code hallucinations. We also analyze the potential causes of various code hallucinations, providing detailed insights for further improving the code generation capabilities of LLMs. Our contributions can be summarized as follows:

  • Code Hallucination: We introduce the concept of code hallucination in LLMs and propose an execution-based verification method to define code hallucination, addressing a gap in the research on hallucination within the code generation domain.

  • CodeHalu Algorithm: We develop a dynamic detection algorithm, CodeHalu, to identify and quantify the types of hallucinations that occur in LLMs during code generation. We categorize code hallucinations into four main categories based on a two-stage heuristic approach, discussing their theoretical implications and potential causes.

  • CodeHaluEval Benchmark: We propose the CodeHaluEval benchmark to systematically evaluate 17 popular LLMs, revealing the distribution and patterns of code hallucinations across these models, and providing insights for developing more robust and reliable LLMs.

2 Related Work

Hallucination

In the field of NLP, hallucination is initially defined as the phenomenon where the text generated by a model is fluent and natural but either lacks substantive meaning or is inconsistent with the provided source content(Ji etal., 2023). Recently, Zhang etal. (2023) standardize the definition of NL hallucinations in LLMs into three categories: input-conflicting hallucinations, where the content generated by LLMs diverges from the user’s input; context-conflicting hallucinations, in which the generated content contradicts previously generated content; and fact-conflicting hallucinations, where the generated content conflicts with established world knowledge.These hallucinations are attributed to various factors, such as poor-quality data samples in the training dataset or the use of sampling algorithms with high uncertainty.

In the multimodal domain, Zhai etal. (2023) classify types of hallucinations in image-to-text scenarios, such as image captioning and visual question answering. They define three main types of hallucinations: object existence hallucinations, object attribute hallucinations, and object relationship hallucinations. In text-to-image scenarios, such as image generation, hallucinations refer to the creation of factually incorrect details by the image generation model in response to the given text input. Huang etal. (2024) introduce VHTest, which evaluates hallucinations across eight dimensions in images, including the existence, shape, color, orientation, OCR, size, position, and counting of visual objects.In text-to-video scenarios, such as video generation, Chu etal. (2024) define three types of hallucinations: prompt consistency hallucinations, static hallucinations, and dynamic hallucinations. Detailed definitions and descriptions of each type of hallucination are provided in AppendixA.Although the issue of hallucinations receives extensive attention in NLP and multimodal domains, it remains unexplored in the code domain. Therefore, we propose CodeHalu to systematically define, identify, classify, and quantify code hallucinations in LLMs.

Existing Coding Benchmarks

In recent years, numerous studies focus on evaluating the capability of LLMs to handle various programming tasks. Among these, the HumanEval(Chen etal., 2021), includes 164 Python programming problems, each with an average of 6.7 unit tests. The MBPP(Austin etal., 2021) benchmark contains 974 Python programming tasks.Compared to these, the APPS(Hendrycks etal., 2021) benchmark presents more challenging programming questions, with each problem averaging 293.2 words in length. CodeScope(Yan etal., 2023) covers 43 programming languages and eight coding tasks to comprehensively evaluate LLMs in code understanding and generation.MMCode(Li etal., 2024) is designed to evaluate the programming capability of code models in multimodal scenarios.SWE-bench(Jimenez etal., 2023) evaluates the capability of LLMs to edit code repositories to solve problems with a level of complexity similar to that faced by human programmers in real-world programming situations.Overall, existing code benchmarks focus on evaluating the performance of LLMs on various programming tasks. However, there is still a lack of effective methods to detect and quantify potential hallucinations that may occur in code generation. Therefore, we propose CodeHaluEval to detect and quantify code hallucinations in LLMs.

3 Code Hallucination

As a tool, code aims to achieve specific objectives through correct execution. This inherent characteristic motivates our use of an execution-based verification method to explore and identify code hallucinations. In this section, we define the concept of code hallucination and distinguish it from code errors, clarifying the relationship and differences between these two phenomena.

Definition 1 (Code Hallucinations).

Code hallucinations refer to the code generated by large language models that is syntactically correct or even semantically plausible, but ultimately cannot execute as expected or fails to meet specified requirements.

Definition 2 (Code Errors).

Code errors refer to issues in a program that cause it to stop executing.

Remark 3 (Code Hallucinations vs. Code Errors).

In multiple domains, existing work(Ji etal., 2023; Zhang etal., 2023; Huang etal., 2024; Zhai etal., 2023; Chu etal., 2024) often equates errors with hallucinations, or considers errors as a specific subset of hallucinations. We follow this perspective and regard code errors as a specific subset of code hallucinations. In other words, errors manifest as a form of hallucination, but not all hallucinations can be adequately expressed through errors. Figure 2 illustrates the distinction between code errors and code hallucinations. The code on the left exhibits a typical code error due to the use of an undefined variable “N”, resulting in a NameError. On the right, the code repeatedly calls the same function due to a logical collapse during generation, eventually exceeding the maximum token limit and leading to a SyntaxError. However, the underlying issue is a latent logical hallucination, rather than the observed syntactic error.

Overall, although there is some slightly overlap between code hallucinations and code errors, their meanings, research objects, and scopes differ significantly. Code hallucinations focus on why the model produces hallucinations, while code errors focus on what grammatical rules the code violates. Code errors form a proper subset of code hallucinations, while code hallucinations encompass a broader range of potential logical and functional issues, representing a finer-grained and more comprehensive evaluation of the overall quality and functionality of the code.

Investigating Code Hallucinations in LLMs via Execution-based Verification (2)

4 CodeHalu Algorithm

1:Input: Code Generation Dataset α𝛼\alphaitalic_α, Language models π𝜋\mathsf{\pi}italic_π

2:Output: HaluTypes ξ𝜉\xiitalic_ξ

3:Let ξ𝜉absent\xi\leftarrowitalic_ξ ← empty list

4:for αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where i{1,,k}𝑖1𝑘i\in\{1,\dots,k\}italic_i ∈ { 1 , … , italic_k } do

5:for πjsubscript𝜋𝑗\pi_{j}italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where j{1,,m}𝑗1𝑚j\in\{1,\dots,m\}italic_j ∈ { 1 , … , italic_m } do

6: 𝖦𝖢jαisuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖absent\mathsf{GC}_{j}^{\alpha_{i}}\leftarrowsansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ← πj(𝖦𝖨j,𝖰)subscript𝜋𝑗subscript𝖦𝖨𝑗𝖰\pi_{j}(\mathsf{GI}_{j},\mathsf{Q})italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( sansserif_GI start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , sansserif_Q )

7:if 𝖦𝖢jαisuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\mathsf{GC}_{j}^{\alpha_{i}}sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is stuttering, infinite enumeration, or gibberish

8: ξξState(𝖦𝖢jαi)𝜉𝜉Statesuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\xi\leftarrow\xi\cup\operatorname{State}(\mathsf{GC}_{j}^{\alpha_{i}})italic_ξ ← italic_ξ ∪ roman_State ( sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )

9:else

10:for tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where n{1,,N}𝑛1𝑁n\in\{1,\dots,N\}italic_n ∈ { 1 , … , italic_N } do

11:if Execute(𝖦𝖢jαi(tn))\mathsf{GC}_{j}^{\alpha_{i}}(t_{n}))sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

12: 𝖤𝖱jαi(tn)superscriptsubscript𝖤𝖱𝑗subscript𝛼𝑖subscript𝑡𝑛absent\mathsf{ER}_{j}^{\alpha_{i}}(t_{n})\leftarrowsansserif_ER start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ←Execute(𝖦𝖢jαi(tn))superscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖subscript𝑡𝑛(\mathsf{GC}_{j}^{\alpha_{i}}(t_{n}))( sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )

13:if 𝖤𝖱jαi(tn)optnsuperscriptsubscript𝖤𝖱𝑗subscript𝛼𝑖subscript𝑡𝑛𝑜subscript𝑝subscript𝑡𝑛\mathsf{ER}_{j}^{\alpha_{i}}(t_{n})\neq op_{t_{n}}sansserif_ER start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≠ italic_o italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT

14: ξξState(𝖦𝖢jαi)𝜉𝜉Statesuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\xi\leftarrow\xi\cup\operatorname{State}(\mathsf{GC}_{j}^{\alpha_{i}})italic_ξ ← italic_ξ ∪ roman_State ( sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )

15:else

16: ξξState(𝖦𝖢jαi)𝜉𝜉Statesuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\xi\leftarrow\xi\cup\operatorname{State}(\mathsf{GC}_{j}^{\alpha_{i}})italic_ξ ← italic_ξ ∪ roman_State ( sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT )

17:Aggregate and count frequencies of unique State(𝖦𝖢jαi)Statesuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\operatorname{State}(\mathsf{GC}_{j}^{\alpha_{i}})roman_State ( sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) in ξ𝜉\xiitalic_ξ

In this section, we introduce a dynamic detection algorithm called CodeHalu, which detects and quantifies hallucinations in LLMs in code generation. CodeHalu operates on the assumption 𝖠𝖲𝖲𝖠𝖲𝖲\mathsf{ASS}sansserif_ASS: if a specific pattern frequently appears in the code generated by multiple LLMs, it is considered a common code hallucination. These patterns include error types, syntax interruptions, logical collapse, or unexpected execution results.

Consider a dataset α𝛼\alphaitalic_α contains k𝑘kitalic_k samples, where each sample αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consists of a problem description Q𝑄Qitalic_Q and a series of test cases t1,t2,,tnsubscript𝑡1subscript𝑡2subscript𝑡𝑛t_{1},t_{2},\ldots,t_{n}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Each test case tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT includes an input 𝗂𝗉𝗂𝗉\mathsf{ip}sansserif_ip and the corresponding expected output 𝗈𝗉𝗈𝗉\mathsf{op}sansserif_op. Notably, following previous work(Li etal., 2023b; Yan etal., 2023), we integrate resource (time and memory) constraints into the code generation instructions. As shown in Algorithm 1, we use a πjsubscript𝜋𝑗\mathsf{\pi}_{j}italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to generate a code solution 𝖦𝖢jαisuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\mathsf{GC}_{j}^{\alpha_{i}}sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each sample αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on the code generation instruction 𝖦𝖨𝗃subscript𝖦𝖨𝗃\mathsf{GI_{j}}sansserif_GI start_POSTSUBSCRIPT sansserif_j end_POSTSUBSCRIPT and the problem description 𝖰𝖰\mathsf{Q}sansserif_Q. If 𝖦𝖢jαisuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\mathsf{GC}_{j}^{\alpha_{i}}sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT exhibits any of the states such as stuttering, infinite loops, or gibberish, we include it in ξ𝜉\xiitalic_ξ.

To test the potential hallucinations of 𝖦𝖢jαisuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\mathsf{GC}_{j}^{\alpha_{i}}sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT at a fine-grained level, we execute all test cases of sample αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT one by one to verify whether it successfully executes and meets the expected functionality. We record the actual execution result 𝖤𝖱jαi(tn)superscriptsubscript𝖤𝖱𝑗subscript𝛼𝑖subscript𝑡𝑛\mathsf{ER}_{j}^{\alpha_{i}}(t_{n})sansserif_ER start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of the code under each test case tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and extensively test each sample αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT across more than 15 π𝜋\mathsf{\pi}italic_π to obtain statistically-based inductive results. If the code execution fails or does not meet the expected results, we record it in ξ𝜉\xiitalic_ξ.

Finally, we merge the identical states State(𝖦𝖢jαi)Statesuperscriptsubscript𝖦𝖢𝑗subscript𝛼𝑖\operatorname{State}(\mathsf{GC}_{j}^{\alpha_{i}})roman_State ( sansserif_GC start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) detected by CodeHalu and calculate their occurrence frequencies. According to assumption 𝖠𝖲𝖲𝖠𝖲𝖲\mathsf{ASS}sansserif_ASS, ξ𝜉\xiitalic_ξ can be represented as [(ξ1subscript𝜉1\xi_{1}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), …, (ξosubscript𝜉𝑜\xi_{o}italic_ξ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, Posubscript𝑃𝑜P_{o}italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT)], where ξosubscript𝜉𝑜\xi_{o}italic_ξ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT denotes the othsuperscript𝑜𝑡o^{th}italic_o start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT type of code hallucination, and Posubscript𝑃𝑜P_{o}italic_P start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT indicates its frequency. Code hallucinations come from four perspectives: errors, syntax, logic, and execution results. Additionally, CodeHalu is language-agnostic and can adjust to match various programming scenarios depending on the programming language.

5 Code Hallucinations Classification

In this section, we analyze the hallucination states detected by the CodeHalu algorithm, classify and define four main types of hallucinations, and discuss the rationale behind the classification method.

According to the TIOBE Index222https://www.tiobe.com/tiobe-index/, a metric of programming language popularity, we primarily investigate code hallucinations within the Python. By applying the CodeHalu algorithm on the complex APPS dataset(Hendrycks etal., 2021) and 17 widely-used LLMs, we identify and validate 18 types of hallucination states that violate human expectations during the code generation, including inconsistent code context, ambiguous logic and data flow, conflicting intentions, among others. Using the two-stage heuristic classification method introduced in Remark8, we categorize code hallucinations into four main types based on the nature and origin of these phenomena: mapping hallucinations, naming hallucinations, resource hallucinations, and logical hallucinations, as illustrated in Figure 1.

Definition 4 (Mapping Hallucinations).

Mapping Hallucinations refer to the ambiguity and confusion that occur in LLMs’ perception and mapping of data types, values, and structures during data operations. This phenomenon is further divided into two sub-categories: data compliance hallucinations and structure access hallucinations.

Data compliance hallucinations occur when LLMs have a vague understanding of the data types and parameter values of the objects being manipulated, resulting in generated code that attempts to perform type-mismatched or rule-violating operations.

Structure access hallucinations occur when LLMs misinterpret the data structures of the objects being manipulated, leading to generated code that attempts to access non-existent array indices or dictionary keys.

Definition 5 (Naming Hallucinations).

Naming Hallucinations refer to the memory-related issues and factual inaccuracies exhibited by LLMs when handling the naming, scope, and existence of variables, attributes, and modules. This phenomenon is further divided into two subcategories: identity hallucinations and external source hallucinations.

Identity hallucinations occur when LLMs possess biased memories or lack sufficient understanding of the context, leading to generated code that references undefined variables, accesses non-existent object properties, or uses unassigned variables in local scopes.

External source hallucinations occur when LLMs exhibit significant memory-related issues or obvious conflicts with facts concerning external knowledge sources, resulting in generated code that attempts to import non-existent modules or fails to correctly load modules from other paths.

Definition 6 (Resource Hallucinations).

Resource Hallucinations occur when LLMs lack an adequate perception and prediction of resource consumption and control flow of the generated code during execution. This phenomenon is further divided into physical constraint hallucinations and computational boundary hallucinations.

Physical constraint hallucinations arise when LLMs underestimate resource consumption during data processing operations, causing code failure due to exceeding memory capacity, stack depth, or other physical constraints.

Computational boundary hallucinations occur when LLMs blur recognition of numerical calculation limits and iteration endpoints during data processing operations, causing code failure due to numerical overflow or improper iteration control.

Definition 7 (Logic Hallucinations).

Logic Hallucinations refer to the discrepancies between the expected results and the actual outcomes after executing the code generated by LLMs, or outputs with low semantic density or even complete chaos. This phenomenon is further divided into logic deviation and logic breakdown.

Logic deviation occurs when LLMs generate code that lacks sufficient logical consideration or contradicts the intended instructions. While this hallucination may not cause errors during execution, logical deviations or confusion result in outcomes that fail to meet the expected results.

Logic breakdown occurs when LLMs struggle to interpret or maintain a continuous understanding of context during code generation. This indicates that the models may lose direction while generating code, making it difficult to maintain strict consistency of contextual information.

Remark 8 (Discussion of Rationality).

To ensure the rationality and effectiveness of our code hallucination classification method, we conduct in-depth analyses.

Firstly, we extensively reference classification methods for hallucinations in the fields of NLP and multimodal research(Zhang etal., 2023; Ji etal., 2023; Huang etal., 2024; Zhai etal., 2023; Chu etal., 2024), as well as methods for classifying code errors and vulnerabilities in software engineering(Pan etal., 2023; Wang etal., 2024; Huang etal., 2023). We adopt a two-stage heuristic classification strategy. Initially, team members independently review failed code cases and develop preliminary classification frameworks; then, we reach a consensus through collaborative discussions. This widely used approach ensures the adaptability and accuracy of our framework, enabling a systematic understanding of code hallucinations in LLMs.

Secondly, we analyze the cross-task occurrence rates of each model across eight categories of code hallucinations, detailed in Appendix Table 3. The results show that the average cross-task occurrence rate for these categories is only 2.04%, confirming the independence and rationality of our classification. For the Gemma-7B model, which exhibits the most severe hallucinations in Table 2, only 1.07% of task samples show cross-task hallucinations, as illustrated in Figure 3.

Lastly, we conduct an empirical investigation of our classification results and design a questionnaire to evaluate the rationality of our method, detailed in the Appendix Figure8. The survey receives 23 responses, and after excluding seven respondents with less than three years of experience, we analyzed 16 valid responses. The survey results indicate a rationality rating of 91.08% for our classification method, further supporting its validity.

Investigating Code Hallucinations in LLMs via Execution-based Verification (3)
Investigating Code Hallucinations in LLMs via Execution-based Verification (4)

6 Cause Analysis of Code Hallucinations

In this section, we explore the potential causes of various hallucinations generated by LLMs, aiming to provide valuable insights for optimizing training data, training methods, model architecture, and alignment strategies.

Mapping hallucinations stem from the model’s misunderstanding of data types and structures. This phenomenon arises due to several factors: (1) The model generates code based on tokens, lacking insight into higher-level structures such as statements and functions(Yang etal., 2021); (2) When dealing with long-distance data dependencies, especially within complex code blocks, the model fails to continuously track the structure and state of variables, overly relying on local information while neglecting the importance of the overall context(Zhang etal., 2024); (3) The model does not explicitly perform type checking and structure matching during code generation, lacking static checking and error correction mechanisms.

Naming hallucinations reflect the limitations of models in tracking information and utilizing external knowledge. This issue arises from several factors: (1) Token-based feature representation makes it difficult to accurately model long-distance dependencies, leading to model misjudgments regarding variable scope, lifecycle, and visibility(Xu etal., 2020); (2) The code generation process lacks consistency checks for identifiers and does not perform global tracking of variable definitions and usage; (3) Knowledge of external libraries is not effectively and timely integrated into the model’s knowledge system, making it difficult for the model to accurately understand the names, functions, and invocation methods of libraries(Jesse etal., 2023).

Resource hallucinations highlight the model’s lack of deep understanding of code execution mechanisms and physical constraints. These issues arise from several factors: (1) The training data lacks information related to resource consumption and performance optimization, making it difficult for the model to learn about complexity analysis and resource evaluation; (2) As the model generates code based on probabilities, it lacks a module for calculating and estimating the resource consumption of the generated code, making it unable to simulate the real-world operating environment and resource limits; (3) During the model training process, the focus is on the correctness of the code’s functionality, often overlooking its complexity and resource constraints in actual execution environments.

Logic hallucinations reveal the model’s deficiencies in semantic understanding and reasoning about code. This issue arises due to several factors: (1) The model relies on pattern matching and statistical rules to generate code, lacking a fundamental understanding of symbolic systems and rigorous verification of program logic; (2) The training data is often not rigorously verified for accuracy and may contain code with very similar functions. Since models sometimes imitate and memorize previous examples(Yan and Li, 2022), this can result in the model directly replicating similar logic in the code or even learning incorrect logic from the outset; (3) When the model generates code, repetition at the line level has a self-reinforcing effect, causing the model to become increasingly confident in the code it generates, which may lead to a stuttering phenomenon(Xu etal., 2022).

7 The CodeHaluEval Benchmark

We construct the CodeHaluEval benchmark, a unified evaluation method for comparing various types and frequencies of hallucinations in code generation across different LLMs. We develop the CodeHaluEval based on the APPS testing set, following a structured process of Validation-Identification-Construction, as shown in Figure 4.

Category#Tasks#SamplesSub-Category#Tasks#SamplesMapping2622,288Data Compliance110941Structure Access1521347Naming1571,853Identity1151323External Source42530Resource1071,130Physical Constraint47491Computational Boundary60639Logic1733,612Logic Deviation1192,443Logic Breakdown541,169

ModelMapping (\downarrow)Naming (\downarrow)Resource (\downarrow)Logic (\downarrow)Average (\downarrow)DCSAAvg.IDESAvg.PCCBAvg.LDLBAvg.GPT-432.3110.0219.1927.740.5719.970.203.762.2185.760.5158.1733.04LLaMA-3-8B46.8723.4633.0912.090.008.6315.4812.9914.0778.390.0053.0233.67DeepSeek Coder-6.7B24.2325.6125.0415.800.0011.2817.5217.2117.3599.060.1767.0538.28GPT-3.520.1922.0521.2830.540.0021.8018.536.4211.6899.880.0067.5538.98Claude-3-haiku38.6829.2533.139.070.756.6937.0720.8127.88100.000.1767.6941.00ChatGLM-3-6B36.1344.9141.3050.870.0036.3224.852.3512.1288.990.0060.1944.23Ernie-3.548.1436.9041.5230.310.3821.7518.1311.8914.6098.980.0066.9444.31Qwen-turbo49.6348.3348.8629.482.0821.647.942.825.0498.080.1766.3944.74MagicCoder-7B50.2726.5836.3217.690.0012.6321.1828.3325.22100.0016.2572.9044.84Code LLaMA-7B65.0442.1751.5731.070.0022.1818.536.2611.5994.769.4167.1446.68StarCoder-16B48.1438.8342.6660.709.2545.9828.9211.1118.8595.090.7764.5649.23LLaMA-2-7B51.2232.2940.0878.4671.1376.3614.870.006.4681.050.3454.9349.41Gemini-1.034.1153.5345.5445.880.0032.7624.4416.1219.7398.6510.3570.0749.57Mistral-7B45.4836.5340.2159.1815.8546.7927.4910.8018.0599.350.1767.2549.76WizardCoder-7B26.5731.4029.4131.290.0022.3433.209.3919.7393.9072.3786.9350.10CodeGeeX-2-6B47.6127.9936.0645.050.0032.1636.6623.4729.2089.6099.6692.8657.47Gemma-7B55.2641.0546.9051.850.0037.0214.4614.5514.5197.18100.0098.0961.53

In the validation phase, we use the CodeHalu algorithm to identify multiple types of hallucinations 𝖧𝖺𝗅𝗎𝖳𝗒𝗉𝖾𝗌𝖧𝖺𝗅𝗎𝖳𝗒𝗉𝖾𝗌\mathsf{HaluTypes}sansserif_HaluTypes ξ𝜉\xiitalic_ξ, represented as [(ξ1subscript𝜉1\xi_{1}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), …, (ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)]. In the identification phase, we annotate the 3k3𝑘{3k}3 italic_k most common hallucinations and their frequencies in each sample αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, represented as [(ξ1subscript𝜉1\xi_{1}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT), …, (ξ3ksubscript𝜉3𝑘\xi_{3k}italic_ξ start_POSTSUBSCRIPT 3 italic_k end_POSTSUBSCRIPT, P3ksubscript𝑃3𝑘P_{3k}italic_P start_POSTSUBSCRIPT 3 italic_k end_POSTSUBSCRIPT)]. In the construction phase, we sort all samples in descending order based on the frequency Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each hallucination type ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If the hallucination frequency in a sample αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT exceeds the threshold k𝑘kitalic_k, we include this sample in the corresponding hallucination type set in the CodeHaluEval benchmark. When selecting the threshold k𝑘kitalic_k, we consider both the minimum number of samples required to detect code hallucination effects in the CodeHaluEval benchmark and the inference costs associated with evaluating various LLMs. The efficacy analysis and the selection process for the threshold k𝑘kitalic_k are detailed in Appendix A. Through this method, we establish the CodeHaluEval benchmark, with detailed statistics shown in Table 1.

8 Experiments

Models. To comprehensively analyze the different hallucinations of various competitive LLMs in CodeHaluEval, we evaluate 12 general LLMs, including GPT-4(OpenAI, 2023), GPT-3.5(OpenAI, 2023), Gemini-Pro-1.0(Gemini, 2023), Claude-3-haiku(Anthropic, 2024), LLaMA-2 & 3(Touvron etal., 2023), Vicuna(Chiang etal., 2023), Qwen-turbo(Bai etal., 2023), ChatGLM3-6B(Du etal., 2021), Ernie-3.5(Baidu, 2023), Mistral-7B(Jiang etal., 2023), Gemma (Team etal., 2024). We also evaluate 5 coding LLMs, including Code LLaMA (Roziere etal., 2023), DeepSeek Coder (Guo etal., 2024), CodeGeeX-2 (Zheng etal., 2023), StarCoder-2 (Li etal., 2023a), MagicCoder-7B (Wei etal., 2023), WizardCoder-7B (Luo etal., 2023).The experimental evaluation is conducted using API calls or 8 NVIDIA A6000 GPUs.

Metrics. Given the limited exploration of code hallucinations, no dedicated metrics currently exist for evaluating them in LLMs. To address this gap, we propose an evaluation metric called Hallucination Rate (HR). Specifically, HR is defined as the percentage of hallucination samples detected in the test set among all samples, with the formula: HR=1Ni=1NS(i,K)HR1𝑁superscriptsubscript𝑖1𝑁𝑆𝑖𝐾\operatorname{HR}=\frac{1}{N}\sum_{i=1}^{N}S(i,K)roman_HR = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_S ( italic_i , italic_K ), where S(i,K)𝑆𝑖𝐾S(i,K)italic_S ( italic_i , italic_K ) is an indicator function. If the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample satisfies the hallucination condition, then S(i,K)𝑆𝑖𝐾S(i,K)italic_S ( italic_i , italic_K ) = 1; otherwise, S(i,K)𝑆𝑖𝐾S(i,K)italic_S ( italic_i , italic_K ) = 0. Ideally, a lower HR indicates a lower likelihood of hallucinations during code generation by the LLM, thus demonstrating greater robustness and reliability. To our knowledge, HR is the first metric that accurately reflects the hallucination phenomenon in LLMs during code generation tasks through actual execution tests.

Result & Analysis

Investigating Code Hallucinations in LLMs via Execution-based Verification (5)

The experimental results are presented in Table 2 and Figure 5, with case studies of various models across different hallucination categories detailed in Appendix A.

Mapping hallucination: GPT-4 and GPT-3.5 consistently identify and follow rules related to data types, values, and structures, demonstrating strong context sensitivity.

Naming hallucination: Claude-3 reliably remembers and references entity names from the context and external knowledge bases. In contrast, LLaMA-2 exhibits significant memory bias when processing external knowledge and occasionally fabricates information.

Resource hallucination: GPT-4, Qwen, and LLaMA-2 effectively account for actual resource constraints when generating code, showing an understanding of computational boundaries and limitations, which leads them to produce code with lower complexity.

Logical hallucination: Although all models face challenges in maintaining logical coherence, LLaMA-3 and GPT-4 perform relatively well in reducing repetition. Most models rarely generate code with stuttering or infinite loops, but such issues are more common in Gemma, CodeGeeX-2, and WizardCoder, indicating a tendency to lose semantic and logical consistency during code generation.

Overall, GPT-4 and LLaMA-3 perform well across all hallucination categories, displaying stability and robustness in various scenarios. Logical hallucinations remain the most prevalent issue across all models, while naming and resource hallucinations are relatively less common. The performance of different models varies significantly across hallucination types, likely due to differences in their training data, methods, and architectures. The average hallucination rate ranges from approximately 20% to 60%.

We view mitigating code hallucination as future work. Based on a detailed analysis of experimental results and generated cases, we provide insights into strategies for mitigating code hallucinations in LLMs. In terms of training data, improving the quality and increasing the diversity of data sources enhances the model’s generalization ability. In terms of training methods, employing alignment strategies based on compilation and execution verification, as well as setting multiple objectives during training, enables the model to better understand the data flow and control flow of code. In terms of model architecture, introducing a static code verification module provides real-time feedback on verification results, thereby enhancing the model’s robustness. Additionally, incorporating a code graph module allows the model to construct and utilize graph structure information when generating code, deepening its understanding of patterns and logical relationships in the generated code.

9 Conclusion

We introduce the concept of code hallucination and propose an execution-based verification method to classify code hallucinations. We develop the dynamic detection algorithm, CodeHalu, which categorizes code hallucinations into four main types and eight subtypes, providing a comprehensive understanding of the various challenges faced by LLMs in code generation. Additionally, we establish the CodeHaluEval benchmark and evaluate 17 widely-used LLMs, revealing significant differences in their hallucination patterns during code generation, and providing detailed insights for further improving the code generation capabilities of LLMs. Overall, we lay the theoretical foundation for understanding the hallucination phenomenon of LLMs in code generation, and provide a complete set of tools for detecting and evaluating code hallucinations.

10 Limitations

Python is our focus for exploring code hallucination, as it is the most widely used programming language according to the TIOBE Index. Furthermore, many existing studies, such as HumanEval and MBPP benchmarks, concentrate on Python. Thus, we do not extend our investigation to other languages.

CodeHalu focuses on ensuring the correctness of generated code to meet the needs of developers and users. In contrast, identifying and preventing security risks is a higher-level concern, effectively addressed through sandbox environments.

We focus on code hallucination specifically within the code generation task, excluding other programming tasks such as code translation, and code repair. This is because code generation is currently the most widely studied task in the community. It is important to emphasize that our hallucination detection and evaluation methods can be easily adapted to other tasks, which we consider as future work.

References

  • Anthropic (2024)Anthropic. 2024.The Claude 3 Model Family: Opus, Sonnet, Haiku.https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf.
  • Austin etal. (2021)Jacob Austin, Augustus Odena, MaxwellI. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, CarrieJ. Cai, Michael Terry, QuocV. Le, and Charles Sutton. 2021.Program synthesis with large language models.CoRR, abs/2108.07732.
  • Bai etal. (2023)Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, YuHan, Fei Huang, etal. 2023.Qwen technical report.arXiv preprint arXiv:2309.16609.
  • Baidu (2023)Baidu. 2023.Introducing ernie 3.5: Baidu’s knowledge-enhanced foundation model takes a giant leap forward.
  • Chen etal. (2021)Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, HenriquePondé deOliveiraPinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, FelipePetroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, WilliamHebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, AndrewN. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021.Evaluating large language models trained on code.CoRR, abs/2107.03374.
  • Cheng etal. (2023)Qinyuan Cheng, Tianxiang Sun, Wenwei Zhang, Siyin Wang, Xiangyang Liu, Mozhi Zhang, Junliang He, Mianqiu Huang, Zhangyue Yin, Kai Chen, etal. 2023.Evaluating hallucinations in chinese large language models.arXiv preprint arXiv:2310.03368.
  • Chiang etal. (2023)Wei-Lin Chiang, Zhuohan Li, ZiLin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, JosephE Gonzalez, etal. 2023.Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.See https://vicuna. lmsys. org (accessed 14 April 2023).
  • Chu etal. (2024)Zhixuan Chu, Lei Zhang, Yichen Sun, Siqiao Xue, Zhibo Wang, Zhan Qin, and Kui Ren. 2024.Sora detector: A unified hallucination detection for large text-to-video models.CoRR, abs/2405.04180.
  • Dhuliawala etal. (2023)Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023.Chain-of-verification reduces hallucination in large language models.CoRR, abs/2309.11495.
  • Du etal. (2021)Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2021.Glm: General language model pretraining with autoregressive blank infilling.arXiv preprint arXiv:2103.10360.
  • Elaraby etal. (2023)Mohamed Elaraby, Mengyin Lu, Jacob Dunn, Xueying Zhang, YuWang, and Shizhu Liu. 2023.Halo: Estimation and reduction of hallucinations in open-source weak large language models.CoRR, abs/2308.11764.
  • Gardent etal. (2017)Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017.Creating training corpora for NLG micro-planners.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.
  • Gemini (2023)Gemini. 2023.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805.
  • Guo etal. (2024)Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, YWu, YKLi, etal. 2024.Deepseek-coder: When the large language model meets programming–the rise of code intelligence.arXiv preprint arXiv:2401.14196.
  • Hendrycks etal. (2021)Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, etal. 2021.Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938.
  • Huang etal. (2023)Kai Huang, Xiangxin Meng, Jian Zhang, Yang Liu, Wenjie Wang, Shuhao Li, and Yuqing Zhang. 2023.An empirical study on fine-tuning large language models of code for automated program repair.In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1162–1174.
  • Huang etal. (2024)Wen Huang, Hongbin Liu, Minxin Guo, and NeilZhenqiang Gong. 2024.Visual hallucinations of multi-modal large language models.CoRR, abs/2402.14683.
  • Jesse etal. (2023)K.Jesse, T.Ahmed, P.T. Devanbu, and E.Morgan. 2023.Large language models and simple, stupid bugs.In 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), pages 563–575, Los Alamitos, CA, USA. IEEE Computer Society.
  • Ji etal. (2023)Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, YeJin Bang, Andrea Madotto, and Pascale Fung. 2023.Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38.
  • Jiang etal. (2023)AlbertQ Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, DevendraSingh Chaplot, Diego delas Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, etal. 2023.Mistral 7b.arXiv preprint arXiv:2310.06825.
  • Jimenez etal. (2023)CarlosE Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023.Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770.
  • Li etal. (2024)Kaixin Li, Yuchen Tian, Qisheng Hu, Ziyang Luo, and Jing Ma. 2024.Mmcode: Evaluating multi-modal code large language models with visually rich programming problems.Preprint, arXiv:2404.09486.
  • Li etal. (2023a)Raymond Li, LoubnaBen Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, etal. 2023a.Starcoder: may the source be with you!arXiv preprint arXiv:2305.06161.
  • Li etal. (2023b)Rongao Li, Jie Fu, Bo-Wen Zhang, Tao Huang, Zhihong Sun, Chen Lyu, Guang Liu, Zhi Jin, and GeLi. 2023b.Taco: Topics in algorithmic code generation dataset.arXiv preprint arXiv:2312.14852.
  • Lin etal. (2021)Stephanie Lin, Jacob Hilton, and Owain Evans. 2021.Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958.
  • Liu etal. (2023)f*ckiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. 2023.Aligning large multi-modal model with robust instruction tuning.arXiv preprint arXiv:2306.14565.
  • Liu etal. (2024)Hanchao Liu, Wenyuan Xue, Yifei Chen, Dapeng Chen, Xiutian Zhao, KeWang, Liping Hou, Rongjun Li, and Wei Peng. 2024.A survey on hallucination in large vision-language models.arXiv preprint arXiv:2402.00253.
  • Luo etal. (2023)Ziyang Luo, Can Xu, PuZhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023.Wizardcoder: Empowering code large language models with evol-instruct.arXiv preprint arXiv:2306.08568.
  • OpenAI (2023)OpenAI. 2023.GPT-4 technical report.
  • Pan etal. (2023)Rangeet Pan, AliReza Ibrahimzada, Rahul Krishna, Divya Sankar, LambertPouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2023.Understanding the effectiveness of large language models in code translation.arXiv preprint arXiv:2308.03109.
  • Peng etal. (2023)Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, YuHu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, etal. 2023.Check your facts and try again: Improving large language models with external knowledge and automated feedback.arXiv preprint arXiv:2302.12813.
  • Roziere etal. (2023)Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, XiaoqingEllen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, etal. 2023.Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950.
  • Team etal. (2024)Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale, Juliette Love, etal. 2024.Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295.
  • Touvron etal. (2023)Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, etal. 2023.Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288.
  • Wang etal. (2024)Zhijie Wang, Zijie Zhou, DaSong, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. 2024.Where do large language models fail when generating code?arXiv preprint arXiv:2406.08731.
  • Wei etal. (2023)Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023.Magicoder: Source code is all you need.arXiv preprint arXiv:2312.02120.
  • Xu etal. (2020)Hongfei Xu, Josef van Genabith, Deyi Xiong, Qiuhui Liu, and Jingyi Zhang. 2020.Learning source phrase representations for neural machine translation.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 386–396, Online. Association for Computational Linguistics.
  • Xu etal. (2022)Jin Xu, Xiaojiang Liu, Jianhao Yan, Deng Cai, Huayang Li, and Jian Li. 2022.Learning to break the loop: Analyzing and mitigating repetitions for neural text generation.Advances in Neural Information Processing Systems, 35:3082–3095.
  • Yan and Li (2022)Weixiang Yan and Yuanchun Li. 2022.Whygen: explaining ml-powered code generation by referring to training examples.In Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pages 237–241.
  • Yan etal. (2023)Weixiang Yan, Haitian Liu, Yunkun Wang, Yunzhe Li, Qian Chen, Wen Wang, Tingyu Lin, Weishan Zhao, LiZhu, Shuiguang Deng, etal. 2023.Codescope: An execution-based multilingual multitask multidimensional benchmark for evaluating llms on code understanding and generation.arXiv preprint arXiv:2311.08588.
  • Yang etal. (2021)Chen Yang, Yan Liu, and Changqing Yin. 2021.Recent advances in intelligent source code generation: A survey on natural language based studies.Entropy, 23(9).
  • Zhai etal. (2023)Bohan Zhai, Shijia Yang, Xiangchen Zhao, Chenfeng Xu, Sheng Shen, Dongdi Zhao, Kurt Keutzer, Manling Li, Tan Yan, and Xiangjun Fan. 2023.Halle-switch: Rethinking and controlling object existence hallucinations in large vision language models for detailed caption.arXiv preprint arXiv:2310.01779.
  • Zhang etal. (2024)Kechi Zhang, GeLi, Huangzhao Zhang, and Zhi Jin. 2024.Hirope: Length extrapolation for code models.arXiv preprint arXiv:2403.19115.
  • Zhang etal. (2023)Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, YuZhang, Yulong Chen, etal. 2023.Siren’s song in the ai ocean: a survey on hallucination in large language models.arXiv preprint arXiv:2309.01219.
  • Zheng etal. (2023)Qinkai Zheng, Xiao Xia, XuZou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, etal. 2023.Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x.arXiv preprint arXiv:2303.17568.

Appendix A Appendix

SyntaxError Analysis

Investigating Code Hallucinations in LLMs via Execution-based Verification (6)

Related Work

In the field of NLP, hallucination is initially defined as the phenomenon where the text generated by the model is fluent and natural but lacks actual meaning or is inconsistent with the provided source content(Ji etal., 2023). Recently, Zhang etal. (2023) standardize the definition of NL hallucinations in LLMs into three categories: Input-conflicting hallucinations, where the content generated by LLMs diverges from the user’s input; Context-conflicting hallucinations, in which the generated content contradicts previously generated content; and Fact-conflicting hallucinations, where the generated content conflicts with established world knowledge.These hallucinations may be attributed to various factors, such as poor-quality data samples in the training dataset or the use of sampling algorithms with high uncertainty.To mitigate the impact of these hallucinations, recent studies primarily focus on addressing the issues of data preparation phase(Gardent etal., 2017), training phase(Elaraby etal., 2023), and inference phase(Dhuliawala etal., 2023) to alleviate the hallucination problem in LLMs.

In the multimodal domain, Zhai etal. (2023) classify types of hallucinations in image-to-text scenarios, such as image captioning and visual question answering. They define three main types of hallucinations: object existence hallucinations, where an object mentioned in the description does not exist in the corresponding image; object attribute hallucinations, where the description of the object’s attributes does not match the actual attributes in the image; and object relationship hallucinations, where the relationships between objects are incorrectly described. In text-to-video scenarios, such as video generation, Chu etal. (2024) define three types of hallucinations: prompt consistency hallucinations, which means the text description does not match the visual output; static hallucinations, which refers to errors in the spatial relationships, physical properties, and semantic consistency of objects and scenes within a single frame; and dynamic hallucinations, which indicates inconsistencies or abnormalities in the motion and behavior of objects or entities across frames.

CodeHaluEval Power Analysis

To ensure our study has sufficient statistical power to detect hallucinations commonly triggered by different LLMs during code generation tasks, we conduct a power analysis to determine the minimum sample size required. The power analysis is based on the following parameters, which are standard in social and behavioral science research:

  • Power Level: We set the power level at 0.8 to ensure adequate statistical strength for detecting actual effects.

  • Significance Level (Alpha): We set the significance level at 0.05 to balance the study’s sensitivity with the false positive rate.

  • Effect Size: We select a Cohen’s d value of 0.2, a widely accepted standard for small effect sizes, to ensure reliable detection of even relatively minor effects.

Using the statsmodels library for calculations, we determine that a minimum of 394 samples is necessary. This indicates that at least 394 code task samples are required to ensure the statistical reliability of our research findings.

In selecting the k𝑘kitalic_k-value (defined as the minimum number of models that must identify the same hallucination for a task to be classified as prone to triggering hallucinations), we perform the following analysis:

As illustrated in Figure 7, we summarize the number of samples included in CodeHaluEval for different k𝑘kitalic_k-values. At k𝑘kitalic_k = 5, we find that 699 tasks are identified by at least five models as potentially triggering hallucinations, which far exceeds our minimum sample size of 394 derived from power analysis. This ensures the reliability and robustness of our statistical analysis.

If the k𝑘kitalic_k-value is set higher than 5, the number of tasks identified as likely to trigger hallucinations falls below 394, leading to insufficient statistical power and increasing the risk of unstable research results.

In conclusion, we choose k𝑘kitalic_k = 5 as the threshold, not only meeting the requirements for statistical power but also increasing the sensitivity to accurately identify true hallucination scenarios while maintaining a lower misidentification rate. This methodological choice provides strong statistical support for our benchmark tests and ensures the reliability and practicality of our results.

Investigating Code Hallucinations in LLMs via Execution-based Verification (7)

Classification of Code Hallucinations

Model#2(\downarrow)#3(\downarrow)Avg.(\downarrow)
LLaMA-2-7B0.720.000.36
Gemma-7B2.150.001.07
ChatGLM-3-6B2.150.141.14
GPT-42.860.001.43
GPT-3.53.000.141.57
Claude-3-haiku3.150.001.57
DeepSeek Coder-6.7B3.290.001.65
MagicCoder-7B3.290.141.72
WizardCoder-7B3.580.001.79
Gemini-1.03.860.292.07
Code LLaMA-7B4.430.002.22
Mistral-7B4.430.142.29
Qwen-turbo4.860.002.43
StarCoder-16B5.010.292.65
Ernie-3.55.290.432.86
CodeGeeX-2-6B11.440.145.79
\hdashlineMean3.970.112.04

Data compliance hallucinations occur when LLMs have a vague understanding of the data types and parameter values of the objects being manipulated, resulting in generated code that attempts to perform type-mismatched or rule-violating operations. This hallucination typically manifests as TypeError, ValueError, or ZeroDivisionError exceptions being thrown during code execution, reflecting the model’s unexpected behavior in data type validation, parameter value handling, and arithmetic operations during code generation.

Structure access hallucinations occur when LLMs misinterpret the data structures of the objects being manipulated, leading to generated code that attempts to access non-existent array indices or dictionary keys. This hallucination typically manifests as IndexError or KeyError exceptions being thrown during code execution, reflecting the model’s lack of clarity in understanding the data structures within the given context.

Identity hallucinations occur when LLMs possess biased memories or lack sufficient understanding of the context. This leads to generated code that references undefined variables, accesses non-existent object properties, or uses unassigned variables in local scopes. This hallucination typically manifests as NameError, AttributeError, or UnboundLocalError exceptions being thrown during code execution, reflecting the model’s difficulty in managing long-distance dependencies and accurately tracking variable definitions and scopes.

External source hallucinations occur when LLMs exhibit significant memory-related issues or obvious conflicts with facts concerning external knowledge sources, resulting in generated code that attempts to import non-existent modules or fails to correctly load modules from other paths. This hallucination typically manifests as ImportError or ModuleNotFoundError exceptions being thrown during code execution, reflecting the model’s lack of deep understanding and accurate memory of the representation and organization of external knowledge sources.

Physical constraint hallucinations occur when LLMs underestimate resource consumption during data processing operations, causing code failure during execution due to exceeding memory capacity, stack depth, or other physical constraints. This hallucination typically manifests as RecursionError or MemoryError exceptions being thrown during code execution, reflecting the model’s inadequate understanding of the global operating conditions.

Computational boundary hallucinations occur when LLMs blur recognition of numerical calculation limits and iteration endpoints during data processing operations, causing code failure due to numerical overflow or improper iteration control. This hallucination typically manifests as OverflowError or StopIteration exceptions being thrown during code execution, reflecting the model’s erroneous understanding of numerical boundary conditions and insufficient grasp of the inherent logic of iterators, loops, and control flows.

Logic deviation occurs when LLMs generate code that lacks sufficient logical consideration or contradicts the intended instructions. This hallucination usually does not cause errors during code execution; however, due to logical deviations or confusion, the program’s outcomes fail to meet the expected results. This issue reflects the model’s insufficient handling of logic and inadequate understanding of algorithms during the code generation process.

Logic breakdown occurs when LLMs struggle to interpret or maintain a continuous understanding of context during code generation. This indicates that the models may lose direction while generating code, making it difficult to maintain strict consistency of contextual information. Consequently, the models might generate stuttering, infinite enumeration, or gibberish code, often lacks structural coherence and logical consistency.

Case Study

Table 4-A provide case studies of different models in various types of code hallucinations.

NL-Description
Salem gave you n𝑛nitalic_n sticks with integer positive lengths a1,a2,,ansubscript𝑎1subscript𝑎2subscript𝑎𝑛a_{1},a_{2},\ldots,a_{n}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.For every stick, you can change its length to any other positive integer length (that is, either shrink or stretch it). The cost of changing the stick’s length from a𝑎aitalic_a to b𝑏bitalic_b is |ab|𝑎𝑏|a-b|| italic_a - italic_b |, where |x|𝑥|x|| italic_x | means the absolute value of x𝑥xitalic_x.A stick length aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is called almost good for some integer t𝑡titalic_t if |ait|1subscript𝑎𝑖𝑡1|a_{i}-t|\leq 1| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t | ≤ 1.Salem asks you to change the lengths of some sticks (possibly all or none), such that all sticks’ lengths are almost good for some positive integer t𝑡titalic_t and the total cost of changing is minimum possible. The value of t𝑡titalic_t is not fixed in advance and you can choose it as any positive integer.As an answer, print the value of t𝑡titalic_t and the minimum cost. If there are multiple optimal choices for t𝑡titalic_t, print any of them.Input: The first line contains a single integer n𝑛nitalic_n (1n10001𝑛10001\leq n\leq 10001 ≤ italic_n ≤ 1000) — the number of sticks. The second line contains n𝑛nitalic_n integers aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (1ai1001subscript𝑎𝑖1001\leq a_{i}\leq 1001 ≤ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 100) — the lengths of the sticks.Output: Print the value of t𝑡titalic_t and the minimum possible cost. If there are multiple optimal choices for t𝑡titalic_t, print any of them.
Generated Coden = int(input())a = list(map(int, input().split()))min_cost = float(’inf’)t = 0for i in range(1, 101): cost = sum(abs(x - i) - (abs(x - i) > 1) for x in a) if cost < min_cost: min_cost = cost t = iprint(t, min_cost)
\hdashlineInputExcepted OutputExecuted Result
3 \n 10 1 43 7Logic deviation: No exceptions were thrown at runtime, but
logical confusion caused the output to not meet expectations
Reference Coden = int(input())arr = list(map(int, input().split()))arr.sort()a = []for t in range(1, 101): tot = 0 for item in arr: if (abs(item - t) >= 1): tot += abs(item - t) - 1 a.append((tot, t))a.sort()print(a[0][1], a[0][0])
\hdashlineInputExcepted OutputExecuted Result
3 \n 10 1 43 73 7
NL-Description
A permutation of length n𝑛nitalic_n is an array consisting of n𝑛nitalic_n distinct integers from 1111 to n𝑛nitalic_n in arbitrary order. For example, [2,3,1,5,4]23154[2,3,1,5,4][ 2 , 3 , 1 , 5 , 4 ] is a permutation, but [1,2,2]122[1,2,2][ 1 , 2 , 2 ] is not a permutation (2222 appears twice in the array) and [1,3,4]134[1,3,4][ 1 , 3 , 4 ] is also not a permutation (n=3𝑛3n=3italic_n = 3 but there is 4444 in the array).Consider a permutation p𝑝pitalic_p of length n𝑛nitalic_n, we build a graph of size n𝑛nitalic_n using it as follows: For every 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, find the largest j𝑗jitalic_j such that 1j<i1𝑗𝑖1\leq j<i1 ≤ italic_j < italic_i and pj>pisubscript𝑝𝑗subscript𝑝𝑖p_{j}>p_{i}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and add an undirected edge between node i𝑖iitalic_i and node j𝑗jitalic_j. For every 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n, find the smallest j𝑗jitalic_j such that i<jn𝑖𝑗𝑛i<j\leq nitalic_i < italic_j ≤ italic_n and pj>pisubscript𝑝𝑗subscript𝑝𝑖p_{j}>p_{i}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and add an undirected edge between node i𝑖iitalic_i and node j𝑗jitalic_j.In cases where no such j𝑗jitalic_j exists, we make no edges. Also, note that we make edges between the corresponding indices, not the values at those indices.For clarity, consider as an example n=4𝑛4n=4italic_n = 4, and p=[3,1,4,2]𝑝3142p=[3,1,4,2]italic_p = [ 3 , 1 , 4 , 2 ]; here, the edges of the graph are (1,3),(2,1),(2,3),(4,3)13212343(1,3),(2,1),(2,3),(4,3)( 1 , 3 ) , ( 2 , 1 ) , ( 2 , 3 ) , ( 4 , 3 ).A permutation p𝑝pitalic_p is cyclic if the graph built using p𝑝pitalic_p has at least one simple cycle.Given n𝑛nitalic_n, find the number of cyclic permutations of length n𝑛nitalic_n. Since the number may be very large, output it modulo 109+7superscript109710^{9}+710 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT + 7.Please refer to the Notes section for the formal definition of a simple cycle.Input: The first and only line contains a single integer n𝑛nitalic_n (3n1063𝑛superscript1063\leq n\leq 10^{6}3 ≤ italic_n ≤ 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT).Output: Output a single integer 0x<109+70𝑥superscript10970\leq x<10^{9}+70 ≤ italic_x < 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT + 7, the number of cyclic permutations of length n𝑛nitalic_n modulo 109+7superscript109710^{9}+710 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT + 7.
Generated Code
def count_cyclic_permutations(n): MOD = 10**9 + 7 powers = [1] powers.extend([(p * i) % MOD for i in range(1, n+1)]) factorial = [1] for i in range(1, n+1): factorial.append((factorial[i-1] * i) % MOD) inverse_factorial = [1] inv = powers[MOD-2] for i in range(1, n+1): inverse_factorial.append((inverse_factorial[i-1] * inv) % MOD) result = 0 for i in range(1, n+1): result = (result + (powers[n-1] - powers[n-i-1]) * factorial[i-1] * inverse_factorial[i-1]) % MOD return result# Read inputn = int(input())# Calculate and print the resultprint(count_cyclic_permutations(n))
\hdashlineInputExcepted OutputExecuted Result
4 \n 16Identity hallucination: Use undefined variables
Reference Code
n = int(input())MOD = 10 ** 9 + 7ans1 = 1ans2 = 1for i in range(1,n+1): ans1 *= i ans1 %= MODfor i in range(n-1): ans2 *= 2 ans2 %= MODprint((ans1 - ans2) % MOD)
\hdashlineInputExcepted OutputExecuted Result
4 \n 1616
NL-Description
Ksenia has ordinary pan scales and several weights of an equal mass. Ksenia has already put some weights on the scales, while other weights are untouched. Ksenia is now wondering whether it is possible to put all the remaining weights on the scales so that the scales were in equilibrium.The scales is in equilibrium if the total sum of weights on the left pan is equal to the total sum of weights on the right pan.InputThe first line has a non-empty sequence of characters describing the scales. In this sequence, an uppercase English letter indicates a weight, and the symbol "|" indicates the delimiter (the character occurs in the sequence exactly once). All weights that are recorded in the sequence before the delimiter are initially on the left pan of the scale. All weights that are recorded in the sequence after the delimiter are initially on the right pan of the scale.The second line contains a non-empty sequence containing uppercase English letters. Each letter indicates a weight which is not used yet.It is guaranteed that all the English letters in the input data are different. It is guaranteed that the input does not contain any extra characters.OutputIf you cannot put all the weights on the scales so that the scales were in equilibrium, print string "Impossible". Otherwise, print the description of the resulting scales, copy the format of the input.If there are multiple answers, print any of them.
Generated Code
def put_weights_on_scales(scales, weights): left_weights = [] right_weights = [] for weight in weights: if weight not in left_weights: left_weights.append(weight) if weight not in right_weights: right_weights.append(weight) if sum(left_weights) == sum(right_weights): return scales + "|" + "".join(right_weights) else: return "Impossible"# Example usagescales = input().strip()weights = input().strip()print(put_weights_on_scales(scales, weights))
\hdashlineInputExcepted OutputExecuted Result
AC|T\nL\n AC|TL\nData compliance hallucination: Chaotic data types
Reference Code
scales = input().split(’|’)weights = list(input())balance = len(scales[0]) - len(scales[1])while weights: if len(scales[0]) < len(scales[1]): scales[0] += weights[0] weights.pop(0) else: scales[1] += weights[0] weights.pop(0) balance = len(scales[0]) - len(scales[1])if balance == 0: print(’|’.join(scales))else: print(’Impossible’)
\hdashlineInputExcepted OutputExecuted Result
AC|T\nL\nAC|TL\nAC|TL\n
NL-Description
Polycarp is a music editor at the radio station. He received a playlist for tomorrow, that can be represented as a sequence a1,a2,,ansubscript𝑎1subscript𝑎2subscript𝑎𝑛a_{1},a_{2},...,a_{n}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a band, which performs the i𝑖iitalic_i-th song. Polycarp likes bands with the numbers from 1111 to m𝑚mitalic_m, but he doesn’t really like others.We define as bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT the number of songs the group j𝑗jitalic_j is going to perform tomorrow. Polycarp wants to change the playlist in such a way that the minimum among the numbers b1,b2,,bmsubscript𝑏1subscript𝑏2subscript𝑏𝑚b_{1},b_{2},...,b_{m}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT will be as large as possible.Find this maximum possible value of the minimum among the bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (1jm1𝑗𝑚1\leq j\leq m1 ≤ italic_j ≤ italic_m), and the minimum number of changes in the playlist Polycarp needs to make to achieve it. One change in the playlist is a replacement of the performer of the i𝑖iitalic_i-th song with any other group.InputThe first line of the input contains two integers n𝑛nitalic_n and m𝑚mitalic_m (1mn20001𝑚𝑛20001\leq m\leq n\leq 20001 ≤ italic_m ≤ italic_n ≤ 2000).The second line contains n𝑛nitalic_n integers a1,a2,,ansubscript𝑎1subscript𝑎2subscript𝑎𝑛a_{1},a_{2},...,a_{n}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (1ai1091subscript𝑎𝑖superscript1091\leq a_{i}\leq 10^{9}1 ≤ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 10 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT), where aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the performer of the i𝑖iitalic_i-th song.OutputIn the first line print two integers: the maximum possible value of the minimum among the bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (1jm1𝑗𝑚1\leq j\leq m1 ≤ italic_j ≤ italic_m), where bjsubscript𝑏𝑗b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the number of songs in the changed playlist performed by the j𝑗jitalic_j-th band, and the minimum number of changes in the playlist Polycarp needs to make.In the second line print the changed playlist.If there are multiple answers, print any of them.
Generated Code
def calculate_probability(n, balls): balls.sort() total_ways = n * (n - 1) * (n - 2) // 6 favorable_ways = 0 for i in range(n): for j in range(i + 1, n): for k in range(j + 1, n): if balls[i] < balls[j] and balls[i] < balls[k] and balls[j] > balls[k]: favorable_ways += 1 return favorable_ways / total_waysn = int(input())balls = list(map(int, input().split()))result = calculate_probability(n, balls)print("{:.10f}".format(result))
\hdashlineInputExcepted OutputExecuted Result
4 4\n1000000000 100 7 10000000001 4\n1 2 3 4Structural access hallucination: accessed index exceeds the
range of the list
Reference Code
from collections import Countern, m = [int(x) for x in input().split()]a = [int(x) for x in input().split()]tgt = n // mb = Counter(a)rd = sum(b[x] for x in b if x > m)r = 0for i in range(1, m + 1): while rd and b[i] < tgt: for j in range(n): if a[j] > m: b[a[j]] -= 1 b[i] += 1 a[j] = i rd -= 1 r += 1 break while b[i] < tgt: for j in range(n): if b[a[j]] > tgt: b[a[j]] -= 1 b[i] += 1 a[j] = i r += 1 breakprint(tgt, r)print("".join(str(x) for x in a))
\hdashlineInputExcepted OutputExecuted Result
4 4\n1000000000 100 7 10000000001 4\n1 2 3 41 4\n1 2 3 4
NL-Description
Salem gave you n𝑛nitalic_n sticks with integer positive lengths a1,a2,,ansubscript𝑎1subscript𝑎2subscript𝑎𝑛a_{1},a_{2},\ldots,a_{n}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.For every stick, you can change its length to any other positive integer length (that is, either shrink or stretch it). The cost of changing the stick’s length from a𝑎aitalic_a to b𝑏bitalic_b is |ab|𝑎𝑏|a-b|| italic_a - italic_b |, where |x|𝑥|x|| italic_x | means the absolute value of x𝑥xitalic_x.A stick length aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is called almost good for some integer t𝑡titalic_t if |ait|1subscript𝑎𝑖𝑡1|a_{i}-t|\leq 1| italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t | ≤ 1.Salem asks you to change the lengths of some sticks (possibly all or none), such that all sticks’ lengths are almost good for some positive integer t𝑡titalic_t and the total cost of changing is minimum possible. The value of t𝑡titalic_t is not fixed in advance and you can choose it as any positive integer.As an answer, print the value of t𝑡titalic_t and the minimum cost. If there are multiple optimal choices for t𝑡titalic_t, print any of them.Input: The first line contains a single integer n𝑛nitalic_n (1n10001𝑛10001\leq n\leq 10001 ≤ italic_n ≤ 1000) — the number of sticks. The second line contains n𝑛nitalic_n integers aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (1ai1001subscript𝑎𝑖1001\leq a_{i}\leq 1001 ≤ italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ 100) — the lengths of the sticks.Output: Print the value of t𝑡titalic_t and the minimum possible cost. If there are multiple optimal choices for t𝑡titalic_t, print any of them.
Generated Codefrom math import modulodef number_of_games(a, b, k, t): games = (2 * k + 1) ** t return games % (10 ** 9 + 7)print(number_of_games(1, 2, 2, 1))print(number_of_games(1, 1, 1, 2))print(number_of_games(2, 12, 3, 1))
\hdashlineInputExcepted OutputExecuted Result
1 2 2 16External source hallucinations: no referenced components in the module
Reference Codedef c(n, k): if k > n: return 0 a = b = 1 for i in range(n - k + 1, n + 1): a *= i for i in range(1, k + 1): b *= i return a // ba, b, k, t = map(int, input().split())n, m, s = 2 * k + 1, 2 * t, 2 * k * t + b - aans, mod = 0, 1000000007for i in range(m + 1): ans = (ans + [1, -1][i & 1] * c(m, i) * c(m + s - n * i, m)) % modprint((pow(n, m, mod) - ans) % mod)
\hdashlineInputExcepted OutputExecuted Result
1 2 2 166
NL-Description
Reziba has many magic gems. Each magic gem can be split into M𝑀Mitalic_M normal gems. The amount of space each magic (and normal) gem takes is 1111 unit. A normal gem cannot be split.Reziba wants to choose a set of magic gems and split some of them, so the total space occupied by the resulting set of gems is N𝑁Nitalic_N units. If a magic gem is chosen and split, it takes M𝑀Mitalic_M units of space (since it is split into M𝑀Mitalic_M gems); if a magic gem is not split, it takes 1111 unit.How many different configurations of the resulting set of gems can Reziba have, such that the total amount of space taken is N𝑁Nitalic_N units? Print the answer modulo 1000000007100000000710000000071000000007 (109+7superscript109710^{9}+710 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT + 7). Two configurations are considered different if the number of magic gems Reziba takes to form them differs, or the indices of gems Reziba has to split differ.InputThe input contains a single line consisting of 2222 integers N𝑁Nitalic_N and M𝑀Mitalic_M (1N10181𝑁superscript10181\leq N\leq 10^{18}1 ≤ italic_N ≤ 10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT, 2M1002𝑀1002\leq M\leq 1002 ≤ italic_M ≤ 100).OutputPrint one integer, the total number of configurations of the resulting set of gems, given that the total amount of space taken is N𝑁Nitalic_N units. Print the answer modulo 1000000007100000000710000000071000000007 (109+7superscript109710^{9}+710 start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT + 7).
Generated Codedef count_configurations(N, M): MOD = 1000000007 dp = [0] * (N + 1) dp[0] = 1 for i in range(1, N + 1): for j in range(1, min(i, M) + 1): dp[i] = (dp[i] + dp[i - j]) % MOD return dp[N]N, M = map(int, input().split())print(count_configurations(N, M))
\hdashlineInputExcepted OutputExecuted Result
1000000000000000000 2680057396Physical constraint hallucination: Underestimating the
resources required to execute code
Reference Code
import sysMOD = 10**9+7def polymod(P, Q): assert(Q[-1] == 1) n = len(Q) while len(P) >= n: p = P[-1] for i in range(n): P[-i-1] -= p*Q[-i-1] assert(P[-1] == 0) P.pop() return Pdef polyprod(P, Q): W = [0]*(len(P) + len(Q) - 1) for i in range(len(P)): for j in range(len(Q)): W[i+j] += P[i]*Q[j] return [w % MOD for w in W]def power(A, B, m, mult): if m == 0: return B while m > 1: if m % 2 == 1: B = mult(A, B) A = mult(A, A) m //= 2 return mult(A, B)def calc_nth_term(init, linear_coeff, n): def mult(A, B): return polymod(polyprod(A, B), linear_coeff) ans = power([0, 1], [1], n, mult) return sum(ans[i]*init[i] for i in range(len(ans))) % MODn, m = map(int, input().split())linear_rec = [0]*(m+1)linear_rec[0] = linear_rec[m-1] = -1linear_rec[m] = 1print(calc_nth_term([1]*m, linear_rec, n))
\hdashlineInputExcepted OutputExecuted Result
1000000000000000000 2680057396680057396
NL-Description
Given are two strings s𝑠sitalic_s and t𝑡titalic_t consisting of lowercase English letters. Determine if there exists an integer i𝑖iitalic_i satisfying the following condition, and find the minimum such i𝑖iitalic_i if it exists.Let ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the concatenation of 10100superscript1010010^{100}10 start_POSTSUPERSCRIPT 100 end_POSTSUPERSCRIPT copies of s𝑠sitalic_s. t𝑡titalic_t is a subsequence of the string s1s2sisubscriptsuperscript𝑠1subscriptsuperscript𝑠2subscriptsuperscript𝑠𝑖{s^{\prime}}_{1}{s^{\prime}}_{2}\ldots{s^{\prime}}_{i}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (the first i𝑖iitalic_i characters in ssuperscript𝑠s^{\prime}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT).

Notes

A subsequence of a string a𝑎aitalic_a is a string obtained by deleting zero or more characters from a𝑎aitalic_a and concatenating the remaining characters without changing the relative order. For example, the subsequences of contest include net, c, and contest.

Constraints

1|s|1051𝑠superscript1051\leq|s|\leq 10^{5}1 ≤ | italic_s | ≤ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT1|t|1051𝑡superscript1051\leq|t|\leq 10^{5}1 ≤ | italic_t | ≤ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPTs𝑠sitalic_s and t𝑡titalic_t consist of lowercase English letters.

Input

Input is given from Standard Input in the following format:
st

Output

If there exists an integer i𝑖iitalic_i satisfying the following condition, print the minimum such i𝑖iitalic_i; otherwise, print 11-1- 1.
Generated Code
import sysdef solve(s, t): s = s * (10 ** 100) i = 0 j = 0 while i < len(s) and j < len(t): if s[i] == t[j]: j += 1 i += 1 if j == len(t): return i - len(t) else: return -1s = input()t = input()print(solve(s, t))
\hdashlineInputExcepted OutputExecuted Result
zv\nz1Computational boundary hallucinations: Use an integer that exceeds the system’s
index size limit
Reference Code
import bisectdef solve(S, T): dic = {} for i, s in enumerate(S, 1): if s not in dic: dic[s] = [] dic[s].append(i) prov = 0 turn = 0 for t in T: if t not in dic: return -1 x = bisect.bisect(dic[t], prov) if x == len(dic[t]): turn += 1 prov = dic[t][0] else: prov = dic[t][x] return turn * len(S) + provS = input()T = input()print(solve(S, T))
\hdashlineInputExcepted OutputExecuted Result
zv\nz11
Investigating Code Hallucinations in LLMs via Execution-based Verification (8)
Investigating Code Hallucinations in LLMs via Execution-based Verification (2024)

References

Top Articles
Latest Posts
Article information

Author: Ms. Lucile Johns

Last Updated:

Views: 6718

Rating: 4 / 5 (41 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Ms. Lucile Johns

Birthday: 1999-11-16

Address: Suite 237 56046 Walsh Coves, West Enid, VT 46557

Phone: +59115435987187

Job: Education Supervisor

Hobby: Genealogy, Stone skipping, Skydiving, Nordic skating, Couponing, Coloring, Gardening

Introduction: My name is Ms. Lucile Johns, I am a successful, friendly, friendly, homely, adventurous, handsome, delightful person who loves writing and wants to share my knowledge and understanding with you.