Benchmarking Language Model Creativity:
A Case Study on Code Generation

Yining Lu  Dixuan Wang  Tianjian Li  Dongwei Jiang  Daniel Khashabi
Johns Hopkins University
Abstract

As LLMs become increasingly prevalent, it is interesting to consider how “creative” these models can be. From cognitive science, creativity consists of at least two key characteristics: convergent thinking (purposefulness to achieve a given goal) and divergent thinking (adaptability to new environments or constraints) [44].

In this work, we introduce a framework for quantifying LLM creativity that incorporates the two characteristics. This is achieved by (1) Denial Prompting pushes LLMs to come up with more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies, and (2) defining and computing the NeoGauge metric which examines both convergent and divergent thinking in the generated creative responses by LLMs.

We apply the proposed framework on Codeforces problems, a natural data source for collecting human coding solutions. We quantify NeoGauge for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NeoCoder dataset for reproducing our results on future models.111Our code and dataset are available at https://github.com/JHU-CLSP/NeoCoder

Creative achievement depended not only on the number of alternatives but also the generation of a single high-quality alternative.

  Mark Runco, Critical Creative Process [44]

1 Introduction

LLMs have shown increasing performance across a wide variety of natural language tasks [13, 26, 33, 36, 39, 40, 42, 58]. However, the degree to which LLMs possess and utilize creativity in solving tasks remains unclear. An automatic method for evaluating LLMs creativity could help developers better understand the emergence of model behaviors, serve as a design objective and pave the path to solving complex real-world problems.

However, despite the importance of machine creativity in establishing artificial general intelligence [9], only a few works have touched upon it [5, 10, 11, 16, 23, 52, 63] because of two major challenges: (1) eliciting diverse and creative generations is difficult [8, 56, 60], and (2) there are no reliable and comprehensive quantitative measurements of LLM creativity. Below, we explicitly illustrate how we address these two challenges.

Refer to caption
Figure 1: An overview of how Denial Prompting encourages creative solutions. A solution space is a collection of all possible solutions at a certain state. A, B indicate atomic techniques (e.g., for-loops, if-else, etc.) used in the solution.

LLM generations are often repetitive and regurgitating training data [24, 31, 51, 56, 61], making it hard to elicit creative generations. However, we argue that an effective creativity evaluation should be based on the spectrum of maximal creative responses attained from LLMs. Therefore, we introduce Denial Prompting3.1), a prompting method that iteratively “denies” one of the basic tools, techniques, or strategies used in the previous solution (e.g., A: for loops and B: if-else in Fig.1), thereby pushing LLM to think out-of-the-box and elicit creative generations to its fullest extent.

Another challenge in creativity evaluation is to build a reliable and comprehensive quantitative measurement. We propose that such evaluation should be state-aware: adaptive to different contexts, and human-grounded: compares LLM solutions with historical human solutions. According to many cognitive studies, human creativity is viewed as taking place in the interaction with a person, environment, or another model [2, 14, 15, 18, 19, 25]. Similarly, the essence of LLM creativity is also captured from its interaction with the current state (state-aware) and past human knowledge background (human-grounded). This understanding reveals that creativity evaluation should be dynamic, with an individual’s creativity varying under different contexts. For example, in Fig.1, a solution at state t=0𝑡0t=0italic_t = 0 probably not be judged at the same creative level as one at state t=2𝑡2t=2italic_t = 2, even if they solve the same problem. Because the latter solution is more likely to use novel techniques that humans hardly thought of, such as C: Recursion, to adapt to increasingly challenging constraints.

To address the second challenge, we propose NeoGauge score (§4) which involves (1) verifying the generated solution is correct and follows the given constraints from Denial Prompting (convergent thinking), and (2) checking whether the solution is novel enough that its techniques have never been previously used in human solutions (divergent thinking). This aligned well with the arguments made by Runco [44] that creative achievement depends on both the number of alternative solutions and the generation of high-quality alternatives. By considering both convergent [34, 46, 47, 48] and divergent [22, 25, 53] creative thinking, NeoGauge not only offers a state-aware evaluation at state T but grounds the evaluation in human knowledge through comparing the generated solutions with historical human solutions.

In our experiments, We apply Denial Prompting on Codeforcecs,222https://codeforces.com/problemset a challenging Text-to-Code task where model solutions can be automatically verified and numerous human solutions exist. Specifically, we retrieve 199 latest problems from Codeforces along with 30 human solutions per problem that have successfully passed unit tests. We then run these problems on Denial Prompting to obtain our dataset NeoCoder which consists of original questions with sequences of temporally relevant and increasingly difficult constriants. Examples of NeoCoder are provided in Table 6. We benchmark a broad range of LLMs on NeoCoder and calculate their NeoGauge scores. Additionally, we evaluate four reasoning strategies (MCTS [59], self-correction [45], planning [29], and sampling [12]) on our dataset to study the correlation between augmented machine intelligence and creativity. In summary, our contributions are twofold:

  • We introduce Denial Prompting to elicit creative generations from LLMs and NeoGauge metric to evaluate LLM creativity that follows the two proposed protocols.

  • We release a creativity benchmark NeoCoder and provide a thorough analysis of creativity on SOTA language models and reasoning strategies.

2 Background and Related Works

We discuss the existing works on machine creativity evaluation. Then, we explain the concepts of divergent and convergent creativity in cognitive science which our evaluation incorporates.

1 Machine Creativity Evaluation.

While the extensive studies on human creativity from psychological and cognitive science [3, 20, 22, 35, 44, 50, 53], LLM creativity has received little attention. Existing works in studying LLM creativity [16, 52, 64], however, hardly tackle the above two challenges: (1) incentivizing creative generations, and (2) metrics of evaluation are unreliable and incomprehensive.

Tian et al. [52] release a challenging real-world problem dataset to push LLM to think out-of-the-box, but they do not provide an automatic creativity evaluation method built upon their dataset. Additionally, their problems are constructed from a single constraint. In contrast, our Denial Prompting is formulated for multiple iterations of constraint detection and problem refinement, making the generations more creative and providing more states for creativity evaluation. Zhu et al. [64] and Xu et al. [56] design protocols to dynamically generate challenging problems with controllable constraints. However, they mainly focus on accuracy evaluation rather than creativity.

Chakrabarty et al. [10], DeLorenzo et al. [16], and Zhao et al. [63] introduce automatic evaluation pipelines to quantify the four subcomponents of creativity proposed in the Torrance Tests of Creative Thinking [53]: fluency, flexibility, originality, and elaboration. However, the test is originally designed to study human divergent creative thinking (§2) and is unclear whether it applies to machine creativity. Additionally, their evaluation methods hardly meet the above two proposed protocols.

2 Divergent Creative Thinking.

Divergent thinking is a cognitive process that involves exploring a multitude of potential applications for a given set of tools [25]. It typically occurs spontaneously and randomly, leading to numerous possible solutions. Extensive research [3, 22] has been conducted to study divergent creativity, including popular psychometric approaches such as the Unusual Uses Test [22] and Torrance Tests of Creative Thinking [53]. These are designed to let examinees think of as many uses for a (common or unusual) object as possible. The underlying idea of stimulating creative solutions from constrained and unusual settings is also adopted in our Denial Prompting.

Divergent thinking can also be viewed through the lens of 𝒫𝒫\mathcal{P}caligraphic_P-creativity (Psychological) and \mathcal{H}caligraphic_H-creativity (Historical) defined by Boden et al. [7]. A valuable idea is 𝒫𝒫\mathcal{P}caligraphic_P-creative if the person in whose mind it arises could not have come up with it before. Furthermore, a valuable idea is \mathcal{H}caligraphic_H-creative if it is 𝒫𝒫\mathcal{P}caligraphic_P-creative and no one else in human history has ever had it before. 𝒫𝒫\mathcal{P}caligraphic_P-creativity measurement is already pivoted to the nature of Denial Prompting that at each state, we ask LLM to come up with a brand new solution that it has never thought of before by imposing a new constraint. Therefore, we mainly consider \mathcal{H}caligraphic_H-creativity measurement in our NeoGauge score, where we compare the model-generated solution with a set of collected human solutions to examine if it has ever been proposed in human history (i.e., the ratio of the region out of human solution space in Fig.1).333In our experiment, we collect 30 human-annotated solutions for each problem to approximate the historical human solution space, which we consider sufficient given the high overlapping rate among human solutions. This makes our NeoGauge human-grounded and reflects the novelty from history.

3 Convergent Creative Thinking.

Since the twenty-first century, more researchers have begun to accept the proposition that creative thought involves not merely the generation of many alternative solutions (divergent thinking) but also the identification of new feasible solutions [6, 44]. They frame this problem-solving process as convergent creative thinking and begin to examine how understanding human cognition and convergent thinking might be used to account for creative thought [20, 35, 50]. Several famous cognitive approaches that study the mental representation and process underlying convergent creative thinking [34] involve asking examinees to predict future states from past states using incomplete information [46, 47], or solving the problems as though the counterfactual premises are true [48, 49]. All these tests share certain characteristics, such as always having a single best answer and asking examinees to think in unconventional ways. In our work, besides computing \mathcal{H}caligraphic_H-creativity for evaluating divergent thinking, we also measure convergent creativity by verifying the feasibility of the generated solution: whether they are correct and following the given constraints. Our NeoGauge metric delivers a more comprehensive evaluation of machine creativity.

3 Constructing the NeoCoder Dataset

We present Denial Prompting to stimulate creative responses from LLMs.

3.1 Denial Prompting: Eliciting Creative Generations from LLMs

Refer to caption
Figure 2: Example of Denial Prompting (Algorithm 1) for NeoCoder construction. The question comes from our NeoCoder dataset with ID 1898A.
Algorithm 1 Denial Prompting

Input: Input problem x𝑥xitalic_x, augmentation model 𝒫LMsubscript𝒫LM\mathcal{P}_{\text{LM}}caligraphic_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT, max iterations T𝑇Titalic_T

Output: Constraint list 𝒞Tsubscript𝒞𝑇\mathcal{C}_{T}caligraphic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT

1:for t=1𝑡1t=1italic_t = 1 to T𝑇Titalic_T do
2:    # Response generation
3:   yt𝐏LM(xτ1τt1)similar-tosubscript𝑦𝑡subscript𝐏LMdirect-sum𝑥subscript𝜏1subscript𝜏𝑡1y_{t}\sim\mathbf{P}_{\text{LM}}(x\oplus\tau_{1}\oplus\cdots\oplus\tau_{t-1})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_x ⊕ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ ⋯ ⊕ italic_τ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )
4:    # Technique detection
5:   𝒯t𝐏LM(yt)similar-tosubscript𝒯𝑡subscript𝐏LMsubscript𝑦𝑡\mathcal{T}_{t}\sim\mathbf{P}_{\text{LM}}(y_{t})caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
6:   τt𝒯t𝒞t1similar-tosubscript𝜏𝑡subscript𝒯𝑡subscript𝒞𝑡1\tau_{t}\sim\mathcal{T}_{t}\setminus\mathcal{C}_{t-1}italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT
7:   𝒞t={τ1,τ2,,τt}subscript𝒞𝑡subscript𝜏1subscript𝜏2subscript𝜏𝑡\mathcal{C}_{t}=\{\tau_{1},\tau_{2},\cdots,\tau_{t}\}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
8:end for

Our purpose is to construct a pipeline that iteratively imposes constraints on previous solutions (e.g., disallowing the use of hashmaps) to force more creative solutions. The setup is as follows: we use a highly capable augmentation model 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT (e.g. GPT-4) to generate solutions and scrutinize “technique(s)” used in the generated solution, then update problem by imposing the detected technique as a constraint. We repeat this process t𝑡titalic_t times to obtain consecutive t𝑡titalic_t problems with increasingly hard constraints. Fig.2 shows an example with t=2𝑡2t=2italic_t = 2. As shown in Algorithm 1, given a reasoning problem x𝑥xitalic_x and an initial empty constraint list 𝒞0={}subscript𝒞0\mathcal{C}_{0}=\{\}caligraphic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { }, we first let the augmentation model 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT to generate an initial solution y1𝐏LM(x)similar-tosubscript𝑦1subscript𝐏LM𝑥y_{1}\sim\mathbf{P}_{\text{LM}}(x)italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_x ) via a default problem-solving prompt and conversation history. We then use the same augmentation model 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT to detect atomic techniques (e.g., recursion, for loop, hashmaps, etc.), 𝒯1={τ1,τ2,,τi}subscript𝒯1superscript𝜏1superscript𝜏2superscript𝜏𝑖\mathcal{T}_{1}=\{\tau^{1},\tau^{2},\cdots,\tau^{i}\}caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_τ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT }, used in y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to solve x𝑥xitalic_x with a technique detection prompt. Then, one technique is randomly sampled τ1𝒯1𝒞0similar-tosubscript𝜏1subscript𝒯1subscript𝒞0\tau_{1}\sim\mathcal{T}_{1}\setminus\mathcal{C}_{0}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∖ caligraphic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to ensure it has never been used before as a constraint. Finally, we update the problem x𝑥xitalic_x to xτ1direct-sum𝑥subscript𝜏1x\oplus\tau_{1}italic_x ⊕ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT by imposing it not to use the detected technique τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and update constraint list 𝒞0subscript𝒞0\mathcal{C}_{0}caligraphic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to 𝒞1={τ1}subscript𝒞1subscript𝜏1\mathcal{C}_{1}=\{\tau_{1}\}caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }. This is the first iteration of Denial Prompting. We repeat the process to progressively obtain the overall constraint list Ct={τ1,τ2,,τt}subscript𝐶𝑡subscript𝜏1subscript𝜏2subscript𝜏𝑡C_{t}=\{\tau_{1},\tau_{2},\cdots,\tau_{t}\}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. The prompts for Denial Prompting (including technique detection; used across all experiments) are in Appendix C.

During Denial Prompting, we use a single conversation thread of 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT to infer ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Namely, we put yt𝐏LM(xτ1τt1)similar-tosubscript𝑦𝑡subscript𝐏LMdirect-sum𝑥subscript𝜏1subscript𝜏𝑡1y_{t}\sim\mathbf{P}_{\text{LM}}(x\oplus\tau_{1}\oplus\cdots\oplus\tau_{t-1})italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_x ⊕ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ ⋯ ⊕ italic_τ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) (line 2 in Algorithm 1) over t𝑡titalic_t in one single thread, such that the model can utilize the trace of previous interactions (including problem statements, constraints, and LLM solutions from each iteration). In practice, we observed incorporating prior interactions in the context improved model generations. Conversely, when detecting solution techniques 𝒯t𝐏LM(yt)similar-tosubscript𝒯𝑡subscript𝐏LMsubscript𝑦𝑡\mathcal{T}_{t}\sim\mathbf{P}_{\text{LM}}(y_{t})caligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (line 4 in Algorithm 1) we disregard the context from previous conversation rounds to focus the responses solely on the most recent round.

3.2 NeoCoder Dataset to Support Benchmarking LLM Creativity

Challenge problems.

To construct our creativity benchmark, we compile n=199𝑛199n=199italic_n = 199 latest Codeforces problems. We chose problems with a difficulty of 800 (easiest level) since in our preliminary experiments we observed near-random performance on more challenging problems when using well-known open-source models. Furthermore, we selected the recent data to prevent any memorization during pre-training [27].

Human solutions.

For each problem, we extract m=30𝑚30m=30italic_m = 30 correct human solutions per problem (total of 59595959K human solutions). We use human solutions to measure \mathcal{H}caligraphic_H-creativity of LLM responses.

Human annotated test examples.

We also retrieve all test examples provided with each problem (4.5 test examples per problem on average, total of 2.2K test examples). We then perform manual fixes to address any parsing or formatting issues in the collected test examples and ensure that they follow a standardized input-output format. We use these test examples to measure 𝒫𝒫\mathcal{P}caligraphic_P-creativity or the functional correctness of LLM responses.

State (# of constraints) 0 1 2 3 4 5
# of problems 199 199 198 194 176 97
Table 1: Number of instances at each state.
[Uncaptioned image]
Table 2: Proportion of top 5 most common atomic techniques found in the constraint list at each state and human solutions.

Augmentation with Denial Prompting.

We use GPT-4 [37] as the augmentation model 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT.444We use gpt-4-1106-preview across all experiments, accessed on Dec 2023 through April 2024. We feed retrieved problems to Denial Prompting with maximum T=5𝑇5T=5italic_T = 5 to get our dataset NeoCoder. Our dataset consists of pairs (x,𝒞t={τ1,τ2,,τt})𝑥subscript𝒞𝑡subscript𝜏1subscript𝜏2subscript𝜏𝑡(x,\mathcal{C}_{t}=\{\tau_{1},\tau_{2},\ldots,\tau_{t}\})( italic_x , caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ), where x𝑥xitalic_x represents a problem (programming challenge), and 𝒞tsubscript𝒞𝑡\mathcal{C}_{t}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the constraints that must be adhered to when solving the problem. This implies that a single programming problem may be associated with various sets of constraints, forming different pairs accordingly.

Statistics for NeoCoder.

Table 2 shows the number of problems x𝑥xitalic_x and the number of the associated constraints |𝒞t|subscript𝒞𝑡|\mathcal{C}_{t}|| caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT |. Note that the number of problems decreases for larger number of constraints. This occurs because Denial Prompting may reach a point where it fails to generate any new constraints after a certain number of iterations (i.e., 𝒯t𝒞t1=subscript𝒯𝑡subscript𝒞𝑡1\mathcal{T}_{t}\setminus\mathcal{C}_{t-1}=\varnothingcaligraphic_T start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ∅ in Algorithm 1). In such a case, we let τt=subscript𝜏𝑡\tau_{t}=\varnothingitalic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅ and jump to the next iteration t+1𝑡1t+1italic_t + 1 without updating the constraint list 𝒞t=𝒞t1subscript𝒞𝑡subscript𝒞𝑡1\mathcal{C}_{t}=\mathcal{C}_{t-1}caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

We also compare the distribution of the top 5 most common techniques that are discovered from Denial Prompting with the human distribution (Fig.2). It is evident that, without any constraints, models tend to use easy and common techniques (e.g., for-loops) as humans do. However, as more constraints are imposed, the less common but more sophisticated techniques are being employed.

4 State-Aware and Human-Grounded Evaluation of Machine Creativity

Given NeoCoder dataset, we introduce our metric of creativity NeoGauge for a given model 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT. Our NeoCoder at state t𝑡titalic_t (tT𝑡𝑇t\leq Titalic_t ≤ italic_T) is: 𝒟t={(xi,𝒞ti={τ1i,τ2i,,τti})i=1,2,,n}subscript𝒟𝑡conditional-setsuperscript𝑥𝑖superscriptsubscript𝒞𝑡𝑖subscriptsuperscript𝜏𝑖1subscriptsuperscript𝜏𝑖2subscriptsuperscript𝜏𝑖𝑡𝑖12𝑛\mathcal{D}_{t}=\{(x^{i},\mathcal{C}_{t}^{i}=\{\tau^{i}_{1},\tau^{i}_{2},% \cdots,\tau^{i}_{t}\})\mid i=1,2,\cdots,n\}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = { italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_τ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ) ∣ italic_i = 1 , 2 , ⋯ , italic_n }, where i𝑖iitalic_i is the problem index. To evaluate the creativity of the testing model 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT555We use distinct notations for the testing language model, 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT, and the augmentation language model, 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT, to highlight their different roles. at state t𝑡titalic_t, we feed 𝒟tsubscript𝒟𝑡\mathcal{D}_{t}caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT to obtain its predictions:

𝒴t={yti𝐆LM(xi𝒞ti)|𝒞ti|=t,(xi,𝒞ti)𝒟t},subscript𝒴𝑡conditional-setsimilar-tosubscriptsuperscript𝑦𝑖𝑡subscript𝐆LMdirect-sumsuperscript𝑥𝑖subscriptsuperscript𝒞𝑖𝑡formulae-sequencesuperscriptsubscript𝒞𝑡𝑖𝑡for-allsuperscript𝑥𝑖superscriptsubscript𝒞𝑡𝑖subscript𝒟𝑡\mathcal{Y}_{t}=\{y^{i}_{t}\sim\mathbf{G}_{\text{LM}}(x^{i}\oplus\mathcal{C}^{% i}_{t})\mid|\mathcal{C}_{t}^{i}|=t,\forall(x^{i},\mathcal{C}_{t}^{i})\in% \mathcal{D}_{t}\},caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⊕ caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∣ | caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | = italic_t , ∀ ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } , (1)

where |𝒞ti|superscriptsubscript𝒞𝑡𝑖|\mathcal{C}_{t}^{i}|| caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | denotes the cardinality of constraints set. It ensures that at any given state t𝑡titalic_t, the questions we evaluated always have t𝑡titalic_t distinct constraints. Below, we present how we compute convergent and divergent creativity and introduce NeoGauge metric that unifies them.

Refer to caption
Figure 3: Example of NeoGauge computation. The question comes from our NeoCoder dataset with ID 1829B and testing model 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT here is GPT-4. For each state, we compute NeoGauge (Eq.4) as the probability of LM generating correct solutions that meet the given constraints (convergent creativity defined in Eq.2) and also exhibit \mathcal{H}caligraphic_H-creativity (divergent creativity defined in Eq.3). However, none of the above three solutions are considered to be “creative” since convergent solutions may lack divergent creativity (e.g., state t=0𝑡0t=0italic_t = 0). Alternatively, LMs’ hallucinated responses resulting in high \mathcal{H}caligraphic_H-creativity, but often lack correctness and constraint following (e.g., state t=1𝑡1t=1italic_t = 1). Therefore, truly creative works should not only be innovative but also appropriately solve a problem.

Convergent creativity involves problem-solving and constraint following.

To evaluate 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT’s convergent thinking ability, we examine two characteristics of generated solutions: whether they are correct and follow the given constraints. Therefore, given 𝒴tsubscript𝒴𝑡\mathcal{Y}_{t}caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from Eq.1, we define its convergent creativity as follows:

convergent(𝐆LM,t)convergentsubscript𝐆LM𝑡\displaystyle\textbf{convergent}(\mathbf{G}_{\text{LM}},t)convergent ( bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT , italic_t ) =1|𝒴t|yti𝒴t𝟙𝒯ti𝒞ti=×𝟙Correct(yti),where𝒯ti𝐏LM(yti).formulae-sequenceabsent1subscript𝒴𝑡subscriptsubscriptsuperscript𝑦𝑖𝑡subscript𝒴𝑡superscript1subscriptsuperscript𝒯𝑖𝑡subscriptsuperscript𝒞𝑖𝑡superscript1Correctsuperscriptsubscript𝑦𝑡𝑖similar-towheresubscriptsuperscript𝒯𝑖𝑡subscript𝐏LMsubscriptsuperscript𝑦𝑖𝑡\displaystyle=\frac{1}{|\mathcal{Y}_{t}|}\sum_{y^{i}_{t}\in\mathcal{Y}_{t}}% \mathbbm{1}^{\mathcal{T}^{i}_{t}\cap\mathcal{C}^{i}_{t}=\varnothing}\times% \mathbbm{1}^{\text{Correct}(y_{t}^{i})},\;\text{where}\;\mathcal{T}^{i}_{t}% \sim\mathbf{P}_{\text{LM}}(y^{i}_{t}).= divide start_ARG 1 end_ARG start_ARG | caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_1 start_POSTSUPERSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅ end_POSTSUPERSCRIPT × blackboard_1 start_POSTSUPERSCRIPT Correct ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT , where caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (2)

𝟙Correct(yti)superscript1Correctsubscriptsuperscript𝑦𝑖𝑡\mathbbm{1}^{\text{Correct}(y^{i}_{t})}blackboard_1 start_POSTSUPERSCRIPT Correct ( italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT is a measure of program correctness, set to 1 if the generated solution passes all test examples, otherwise it is 0. We use the augmentation model 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT to detect all atomic techniques 𝒯tisubscriptsuperscript𝒯𝑖𝑡\mathcal{T}^{i}_{t}caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT used in solution ytisubscriptsuperscript𝑦𝑖𝑡y^{i}_{t}italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and compare them with the given constraint list 𝒞tisubscriptsuperscript𝒞𝑖𝑡\mathcal{C}^{i}_{t}caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to check if the solution follows the given constraints. In Fig.3 examples, only the solution generated at t=0𝑡0t=0italic_t = 0 (which does not involve any constraint) exhibits convergent creativity.

Divergent creativity requires comparison to historical human solutions.

As discussed earlier in §2, the primary focus of our evaluation is on \mathcal{H}caligraphic_H-creativity, which requires a juxtaposition of model solutions with historical human solutions. Let’s consider a finite set of correct human written solutions with size m𝑚mitalic_m, denoted as isuperscript𝑖\mathcal{H}^{i}caligraphic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, for problem xisuperscript𝑥𝑖x^{i}italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Rather than directly comparing solutions using certain sentence-level similarity scores, as done by few prior works such as DeLorenzo et al. [16], we break down the comparison to the atomic technique level, which is more interpretable and generalizable across varying solutions. Our divergent creativity score is defined as:

divergent(𝐆LM,t)=1|𝒴t|yti𝒴t|𝒯ti𝒯^i||𝒯ti|,divergentsubscript𝐆LM𝑡1subscript𝒴𝑡subscriptsubscriptsuperscript𝑦𝑖𝑡subscript𝒴𝑡subscriptsuperscript𝒯𝑖𝑡superscript^𝒯𝑖subscriptsuperscript𝒯𝑖𝑡\displaystyle\textbf{divergent}(\mathbf{G}_{\text{LM}},t)=\frac{1}{|\mathcal{Y% }_{t}|}\sum_{y^{i}_{t}\in\mathcal{Y}_{t}}\frac{|\mathcal{T}^{i}_{t}\setminus% \widehat{\mathcal{T}}^{i}|}{|\mathcal{T}^{i}_{t}|},divergent ( bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT , italic_t ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG | caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ,
where𝒯ti𝐏LM(yti),𝒯^i=j=1m𝒯^ji𝐏LM(y^ji),y^jii.formulae-sequenceformulae-sequencesimilar-towheresubscriptsuperscript𝒯𝑖𝑡subscript𝐏LMsubscriptsuperscript𝑦𝑖𝑡superscript^𝒯𝑖superscriptsubscript𝑗1𝑚subscriptsuperscript^𝒯𝑖𝑗similar-tosubscript𝐏LMsubscriptsuperscript^𝑦𝑖𝑗subscriptsuperscript^𝑦𝑖𝑗superscript𝑖\displaystyle\text{where}\;\mathcal{T}^{i}_{t}\sim\mathbf{P}_{\text{LM}}(y^{i}% _{t}),\;\widehat{\mathcal{T}}^{i}=\bigcup_{j=1}^{m}\widehat{\mathcal{T}}^{i}_{% j}\sim\mathbf{P}_{\text{LM}}(\hat{y}^{i}_{j}),\;\hat{y}^{i}_{j}\in\mathcal{H}^% {i}.where caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ⋃ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (3)

Here we first find all atomic techniques 𝒯^isuperscript^𝒯𝑖\widehat{\mathcal{T}}^{i}over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT used by m𝑚mitalic_m human solutions and atomic techniques used in the model solution 𝒯tisubscriptsuperscript𝒯𝑖𝑡\mathcal{T}^{i}_{t}caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at state t𝑡titalic_t. We then compute the \mathcal{H}caligraphic_H-creativity as the ratio of techniques used by 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT that have never been used in the human solution set. For example, as shown in Fig.3 at state t=1𝑡1t=1italic_t = 1, among the three techniques identified within the generated solution, only the recursion has never been used by humans, thereby resulting in a ratio of 1313\frac{1}{3}divide start_ARG 1 end_ARG start_ARG 3 end_ARG. Finally, we average ratios across different problems to obtain the final \mathcal{H}caligraphic_H-creativity at state t𝑡titalic_t.

NeoGauge unifies convergent and divergent creativity.

Given the above definitions, NeoGauge of 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT at state t can be formalized as:

NeoGauge@t=1|𝒴t|yti𝒴t𝟙𝒯ti𝒞ti=𝟙Correct(yti)Convergent Creativity×|𝒯ti𝒯^i||𝒯ti|Divergent Creativity,NeoGauge@t1subscript𝒴𝑡subscriptsubscriptsuperscript𝑦𝑖𝑡subscript𝒴𝑡subscriptsuperscript1subscriptsuperscript𝒯𝑖𝑡subscriptsuperscript𝒞𝑖𝑡superscript1Correctsuperscriptsubscript𝑦𝑡𝑖Convergent Creativitysubscriptsubscriptsuperscript𝒯𝑖𝑡superscript^𝒯𝑖subscriptsuperscript𝒯𝑖𝑡Divergent Creativity\displaystyle\textbf{{NeoGauge}@t}=\frac{1}{|\mathcal{Y}_{t}|}\sum_{y^{i}_{t}% \in\mathcal{Y}_{t}}\underbrace{\mathbbm{1}^{\mathcal{T}^{i}_{t}\cap\mathcal{C}% ^{i}_{t}=\varnothing}\mathbbm{1}^{\text{Correct}(y_{t}^{i})}}_{\text{% Convergent Creativity}}\times\underbrace{\frac{|\mathcal{T}^{i}_{t}\setminus% \widehat{\mathcal{T}}^{i}|}{|\mathcal{T}^{i}_{t}|}}_{\text{Divergent % Creativity}},bold_smallcaps_NeoGauge @t = divide start_ARG 1 end_ARG start_ARG | caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG blackboard_1 start_POSTSUPERSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅ end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT Correct ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT Convergent Creativity end_POSTSUBSCRIPT × under⏟ start_ARG divide start_ARG | caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG end_ARG start_POSTSUBSCRIPT Divergent Creativity end_POSTSUBSCRIPT ,
where𝒴t={yti𝐆LM(xi𝒞ti)|𝒞ti|=t,(xi,𝒞ti)𝒟t},wheresubscript𝒴𝑡conditional-setsimilar-tosubscriptsuperscript𝑦𝑖𝑡subscript𝐆LMdirect-sumsuperscript𝑥𝑖subscriptsuperscript𝒞𝑖𝑡formulae-sequencesuperscriptsubscript𝒞𝑡𝑖𝑡for-allsuperscript𝑥𝑖superscriptsubscript𝒞𝑡𝑖subscript𝒟𝑡\displaystyle\text{where}\;\mathcal{Y}_{t}=\{y^{i}_{t}\sim\mathbf{G}_{\text{LM% }}(x^{i}\oplus\mathcal{C}^{i}_{t})\mid|\mathcal{C}_{t}^{i}|=t,\forall(x^{i},% \mathcal{C}_{t}^{i})\in\mathcal{D}_{t}\},where caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⊕ caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∣ | caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | = italic_t , ∀ ( italic_x start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ,
𝒯ti𝐏LM(yti),𝒯^i=j=1m𝒯^ji𝐏LM(y^ji),y^jii.formulae-sequenceformulae-sequencesimilar-tosubscriptsuperscript𝒯𝑖𝑡subscript𝐏LMsubscriptsuperscript𝑦𝑖𝑡superscript^𝒯𝑖superscriptsubscript𝑗1𝑚subscriptsuperscript^𝒯𝑖𝑗similar-tosubscript𝐏LMsubscriptsuperscript^𝑦𝑖𝑗subscriptsuperscript^𝑦𝑖𝑗superscript𝑖\displaystyle\mathcal{T}^{i}_{t}\sim\mathbf{P}_{\text{LM}}(y^{i}_{t}),\;% \widehat{\mathcal{T}}^{i}=\bigcup_{j=1}^{m}\widehat{\mathcal{T}}^{i}_{j}\sim% \mathbf{P}_{\text{LM}}(\hat{y}^{i}_{j}),\;\hat{y}^{i}_{j}\in\mathcal{H}^{i}.caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ⋃ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (4)
Metric Description Definition Place of Use
convergent(𝐆LM,t)convergentsubscript𝐆LM𝑡\textbf{convergent}(\mathbf{G}_{\text{LM}},t)convergent ( bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT , italic_t ) Convergent creativity of 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT at state t𝑡titalic_t Eq.2 Tbl.4, Fig.6, 8
divergent(𝐆LM,t)divergentsubscript𝐆LM𝑡\textbf{divergent}(\mathbf{G}_{\text{LM}},t)divergent ( bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT , italic_t ) Divergent creativity of 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT at state t𝑡titalic_t Eq.3 Tbl.4, Fig.6, 8
NeoGauge@t Creativity evaluation of GLMsubscriptGLM\textbf{G}_{\text{LM}}G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT at state t𝑡titalic_t Eq.4 Tbl.4, Fig.4
\hdashlinepass@1 [12] Probability of the first sample passes the unit tests 𝔼problems[1ncn]problems𝔼delimited-[]1𝑛𝑐𝑛\underset{\text{problems}}{\mathbbm{E}}\big{[}1-\frac{n-c}{n}\big{]}underproblems start_ARG blackboard_E end_ARG [ 1 - divide start_ARG italic_n - italic_c end_ARG start_ARG italic_n end_ARG ] Tbl.4
constraint following Average ratio of following the constraints at state t𝑡titalic_t 𝔼problems[𝟙τt𝒞t=]problems𝔼delimited-[]superscript1subscript𝜏𝑡subscript𝒞𝑡\underset{\text{problems}}{\mathbbm{E}}[\mathbbm{1}^{\tau_{t}\cap\mathcal{C}_{% t}=\varnothing}]underproblems start_ARG blackboard_E end_ARG [ blackboard_1 start_POSTSUPERSCRIPT italic_τ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅ end_POSTSUPERSCRIPT ] Tbl.4
convergent(human, t) convergent creativity of human at state t𝑡titalic_t Eq.5 Fig.6
divergent(human) divergent creativity of human at state 00 Eq.6 Fig.6
Table 3: Description of various metrics used across experiments.

5 Experiments and Results

We benchmark language model (§5.2) and evaluate reasoning strategies (§5.3) for creativity.

5.1 Experimental Setup

Models.

We use GPT-4 as the augmentation model 𝐏LMsubscript𝐏LM\mathbf{P}_{\text{LM}}bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT. We benchmark the creativity performance of the following testing models 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT: GPT-4 [37], GPT-3.5 [38], Claude 3 Sonnet (Claude-3) [4], Llama3-70B [1], Llama2-70B [54], CodeLlama-34B-Python (CodeLlama-34B) [43], CodeGemma-7B [21], and Mistral-7B [28]. We access all non-proprietary models through Huggingface Transformers [55].

Metrics.

Beyond the three proposed metrics for evaluating convergent, divergent and overall creativity, we also compute pass@1 [12] and constraint following ratio for further comparison in Table 4. NeoGauge@T actually is a joint probability of 𝐆LMsubscript𝐆LM\mathbf{G}_{\text{LM}}bold_G start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT being both convergent and divergent creative at state t𝑡titalic_t. Therefore, we also report the cumulative NeoGauge across states in Fig.4, which indicates the model’s maximum creativity performance boundary. Additionally, we compute human convergent and divergent creativity in Fig.6 to compare LLM with human creativity performance (details in Appendix A). We summarize all used metrics in Table 3.

5.2 Benchmarking Language Model Creativity

A number of psychological investigators have studied the link between creativity and intelligence [25], agreeing on two key points: (1) creative individuals tend to have higher intelligence [41], and (2) people with extremely high intelligence not necessarily to be extremely creative [17]. We re-examine the two findings on LLMs and answer: Are larger LLMs more creative? Do extremely large models of equal size exhibit comparable creativity? Our investigation is based on the widely accepted hypothesis that language model size correlates positively with intelligence [30, 32, 62].

Refer to caption
Figure 4: NeoGauge (left) and cumulative NeoGauge (right) across states.

GPT-4 is the most creative LLM thus far.

We visualize NeoGauge and cumulative NeoGauge in Fig.4. GPT-4 consistently has the highest NeoGauge almost at every state t𝑡titalic_t. While others (e.g., Claude-3 and Llama3-70B) have a close NeoGauge@0 score to GPT-4, their NeoGauge quickly decreases to 0 within the next 2 states. According to cumulative NeoGauge, GPT-4 also has the highest creativity performance boundary, followed by Claude-3 and Llama3-70B, greatly outperforming smaller models such as GPT-3.5 and Llama2-70B. These observations could potentially answer the above two questions: larger LLMs are generally more creaitive, but extremely large LLM is not necessarily exhibiting extremely creative performance. In Fig.5, we provide example outputs from each model to show their different creativity abilities.

Refer to caption
Figure 5: Example model outputs for question 1895B at state t=5𝑡5t=5italic_t = 5. Full questions and constraints can be found in Table 6. It is evident that different models have different convergent and divergent creative performances. Specifically, CodeGemma-7B and Mistral-7B fail to generate parsable solutions, and Llama2-70B is seeking more hints from its users.
Refer to caption
Figure 6: A comparison of LLM and human creativity. //// denotes the performance difference of convergent creativity, and \\\\ denotes the difference of divergent creativity. Current LLMs still hardly demonstrate human-like creativity.

LLM creativity is still far behind humans.

Fig.6 displays the comparison between LLM and human creativity evaluation. LLM shows slightly better performance in divergent creativity compared to humans. However, human divergent creativity, as computed from Eq.6, is a constant and only reflects the initial divergent creativity at state t=0𝑡0t=0italic_t = 0. The performance gap between them in higher states remains undetermined. Additionally, Both human and LLM convergent creativity declines drastically over the increase in state t𝑡titalic_t, which follows our expectation that there is a trade-off between solution quality and novelty. When stress-testing humans or LLMs to look for more creative solutions, they are very likely to make mistakes and may copy previous solutions during the process. However, LLM convergent creative performance is much worse than humans. Future works should focus on how to narrow the gap (//// area) between human and LLM convergent creativity.

State t𝑡titalic_t pass@1 Constraint Following Convergent Creative Divergent Creative NeoGauge
0 16.1 100.0 16.2 4.5 1.0
1 11.6 75.4 8.1 11.9 1.4
2 7.1 46.0 3.6 11.5 0.9
3 5.2 33.0 1.6 12.4 0.5
4 2.3 26.1 0.0 13.2 0.0
5 2.1 14.4 0.0 15.3 0.0
Table 4: GPT-4 creativity evaluation results (in %). Convergent and divergent creativity perform oppositely, it is crucial to consider both in evaluation.

In-depth analysis of creativity evaluation.

We provide evaluation results for GPT-4 in Table 4. It is evident that as the state increases (more hard constraints are imposed), the quality of solutions declines both in terms of correctness and constraint following. Even if the model may still generate new alternative solutions at state 5 (divergent(GPT-4, 5555) =15.3absent15.3=15.3= 15.3), they fail at convergent evaluation (convergent(GPT-4, 5555) =0absent0=0= 0). Therefore, at state 5, GPT-4 shows 0 creativity (NeoGauge@5 =0absent0=0= 0). Additionally, unlike the convergent score, which typically decreases as t𝑡titalic_t increases, the divergent score of GPT-4 continually rises. This observation empirically proves the key assumption of Denial Prompting that LLMs tend to seek more creative solutions when facing an unconventional environment characterized by unusual hard constraints.

5.3 Evaluating Reasoning Strategies for Creativity

We evaluate four reasoning strategies on our NeoCoder dataset to further study the correlation between augmented machine intelligence and creativity: Whether such intelligence-enhancing techniques also improve creative thinking? We implement the following four works that are specifically designed for programming tasks with different reasoning strategies:

  • MCTS: Zhang et al. [59] propose a novel decoding method that uses Monte-Carlo tree search (MCTS) to generate better programs using the pass rate as reward.

  • Self-Correction: Shinn et al. [45] use verbal feedback from a reflection agent to reinforce the performance of an actor agent in code generation.

  • Planning: Jiang et al. [29] design a planning module to let LLM plan out concise solution steps from the intent, followed by an implementation module to generate code step by step.

  • Sampling: Chen et al. [12] generate k𝑘kitalic_k samples and compute the probability that at least one of the k𝑘kitalic_k-generated code samples for a problem passes the unit tests. For creativity evaluation, we generate k=5𝑘5k=5italic_k = 5 samples for each problem and report the NeoGauge from samples that have the highest convergent and divergent creativity, 𝟙𝒯ti𝒞ti=×𝟙Correct(yti)×|𝒯ti𝒯^i||𝒯ti|superscript1subscriptsuperscript𝒯𝑖𝑡subscriptsuperscript𝒞𝑖𝑡superscript1Correctsuperscriptsubscript𝑦𝑡𝑖subscriptsuperscript𝒯𝑖𝑡superscript^𝒯𝑖subscriptsuperscript𝒯𝑖𝑡\mathbbm{1}^{\mathcal{T}^{i}_{t}\cap\mathcal{C}^{i}_{t}=\varnothing}\times% \mathbbm{1}^{\text{Correct}(y_{t}^{i})}\times\frac{|\mathcal{T}^{i}_{t}% \setminus\widehat{\mathcal{T}}^{i}|}{|\mathcal{T}^{i}_{t}|}blackboard_1 start_POSTSUPERSCRIPT caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∩ caligraphic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅ end_POSTSUPERSCRIPT × blackboard_1 start_POSTSUPERSCRIPT Correct ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT × divide start_ARG | caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∖ over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | end_ARG start_ARG | caligraphic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG in Eq.4, among k=5𝑘5k=5italic_k = 5 samples.

Note that these methods are originally applicable to different kinds of models. Considering the computation complexity and the cost, we re-evaluate MCTS on the open-source language model (CodeGemma-7B [21]) and re-evaluate others on the proprietary model (GPT-3.5 [38]).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Creativity performance difference before and after applying reasoning strategies. A larger difference value indicates that the strategy improves the testing model’s creativity. Detailed numeric changes are provided in Table 5.

Most reasoning strategies fail to improve divergent thinking.

According to Fig.7, all reasoning strategies except sampling help to improve the model’s convergent creativity thinking ability on multiple states, as they are fundamentally designed to improve the accuracy. Conversely, only MCTS successfully enhances divergent creativity, due to it rolling out numerous paths during the expansion. Strategies like self-correction, planning, and sampling, which operate on a single trial or path, fail to explore divergent solutions.

There is a tradeoff between divergent and convergent creativity.

Noticeably, while MCTS consistently enhances divergent creative thinking in all 5 states, its improvement on NeoGauge is minimal and becomes 0 after t=2𝑡2t=2italic_t = 2. This suggests that divergent solutions generated by MCTS may not truly augment creativity, potentially due to incorrectness or failure to follow the given constraints. This also implies that MCTS might prioritize divergent thinking over convergent thinking. On the other hand, self-correction and planning sacrifice their divergent thinking ability in improving their convergent thinking because the divergent creativity difference even goes to negative at certain states (e.g., Divergent Diff =1.2absent1.2=-1.2= - 1.2 at t=0,3𝑡03t=0,3italic_t = 0 , 3 on sampling). None of the four reasoning strategies have been able to simultaneously improve both convergent and divergent creativity, resulting in limited improvement of NeoGauge. Thus, our findings indicate that these intelligence-augmenting methods do not provide much benefit to LLM creativity. We leave for future works to discover specialized strategies for better enhancing LLM’s creative performance and NeoGauge.

6 Conclusion

We propose two key protocols for evaluating language model creativity and introduce the Denial Prompting framework and NeoGauge metric to provide a one-stop creativity evaluation that follows the protocols. Our NeoGauge stands out for its comprehensive evaluation of both convergent and divergent creativity, aligning with concepts from human creativity studies. To facilitate future research, we release our NeoCoder dataset and shed light on the limitations of current reasoning strategies in improving LLM creativity.

Limitations

Application scope.

While NeoGauge offers a general-purpose framework for evaluation of LLM creativity, our study is restricted to Text-to-Code, as it requires a historical human solution set. For most tasks in the literature, collecting a comprehensive set of distinct human responses is nontrivial.

Data leakage concern.

Our proposed dataset NeoCoder is built using latest Codeforces problems. Despite their recency, future LLMs might get exposure to these problems during their pre-training. To alleviate such risks, future works can focus on more difficult problems or evaluate NeoGauge for higher states, besides incorporating a newer batch of problems.

Acknowledgements

This work is in part supported by ONR grant N00014-241-2089, and generous gifts from Amazon and the Allen Institute for AI. We also greatly appreciate the help of the students at CLSP.

References

Supplemental Material

Appendix Contents
Appendix A Additional Details of Experimental Setup
Appendix B Additional Details of Experimental Results
Appendix C Prompts for Denial Prompting and Benchmarking

Appendix A Experiment Setup

A.1 Human Creativity Evaluation

We compute human convergent creativity as follows:

convergent(human,t)=1m|𝒴t|ι{i𝒞ti=t,i=1,2,,n}j=1m𝟙𝒯^jι𝒞tι=,where𝒯^jι𝐏LM(y^jι),y^jιι.\displaystyle\textbf{convergent}(\text{human},t)=\frac{1}{m|\mathcal{Y}_{t}|}% \sum_{\begin{subarray}{c}\iota\in\{i\mid\mathcal{C}_{t}^{i}=t,\\ i=1,2,\cdots,n\}\end{subarray}}\sum_{j=1}^{m}\mathbbm{1}^{\widehat{\mathcal{T}% }^{\iota}_{j}\cap\mathcal{C}^{\iota}_{t}=\varnothing},\;\text{where}\;\widehat% {\mathcal{T}}^{\iota}_{j}\sim\mathbf{P}_{\text{LM}}(\hat{y}^{\iota}_{j}),\;% \hat{y}^{\iota}_{j}\in\mathcal{H}^{\iota}.convergent ( human , italic_t ) = divide start_ARG 1 end_ARG start_ARG italic_m | caligraphic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_ι ∈ { italic_i ∣ caligraphic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_t , end_CELL end_ROW start_ROW start_CELL italic_i = 1 , 2 , ⋯ , italic_n } end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_1 start_POSTSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∩ caligraphic_C start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∅ end_POSTSUPERSCRIPT , where over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT . (5)

Because the collected historical human solutions y^jιsubscriptsuperscript^𝑦𝜄𝑗\hat{y}^{\iota}_{j}over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are always correct, for human convergent creativity evaluation, we focus on constraint following ratio by examining whether the atomic techniques 𝒯^jιsuperscriptsubscript^𝒯𝑗𝜄\widehat{\mathcal{T}}_{j}^{\iota}over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT used by each human solution follow the given constraints 𝒞tιsubscriptsuperscript𝒞𝜄𝑡\mathcal{C}^{\iota}_{t}caligraphic_C start_POSTSUPERSCRIPT italic_ι end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at state t𝑡titalic_t. We use the same idea as Eq.3 to compute human divergent creativity.

divergent(human)divergenthuman\displaystyle\textbf{divergent}(\text{human})divergent ( human ) =1mni=1nj=1m|𝒯^ji^ji||𝒯^ji|,absent1𝑚𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑚superscriptsubscript^𝒯𝑗𝑖subscriptsuperscript^𝑖𝑗superscriptsubscript^𝒯𝑗𝑖\displaystyle=\frac{1}{mn}\sum_{i=1}^{n}\sum_{j=1}^{m}\frac{|\widehat{\mathcal% {T}}_{j}^{i}\setminus\widehat{\mathcal{L}}^{i}_{j}|}{|\widehat{\mathcal{T}}_{j% }^{i}|},= divide start_ARG 1 end_ARG start_ARG italic_m italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG | over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∖ over^ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG | over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT | end_ARG ,
where𝒯^ji𝐏LM(y^ji),^jisimilar-towheresuperscriptsubscript^𝒯𝑗𝑖subscript𝐏LMsubscriptsuperscript^𝑦𝑖𝑗subscriptsuperscript^𝑖𝑗\displaystyle\text{where}\;\widehat{\mathcal{T}}_{j}^{i}\sim\mathbf{P}_{\text{% LM}}(\hat{y}^{i}_{j}),\;\widehat{\mathcal{L}}^{i}_{j}where over^ start_ARG caligraphic_T end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , over^ start_ARG caligraphic_L end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =k=1,kjm𝒯^ki𝐏LM(y^ki),y^ji,y^kii.formulae-sequenceabsentsuperscriptsubscriptformulae-sequence𝑘1𝑘𝑗𝑚subscriptsuperscript^𝒯𝑖𝑘similar-tosubscript𝐏LMsubscriptsuperscript^𝑦𝑖𝑘subscriptsuperscript^𝑦𝑖𝑗subscriptsuperscript^𝑦𝑖𝑘superscript𝑖\displaystyle=\bigcup_{k=1,k\neq j}^{m}\widehat{\mathcal{T}}^{i}_{k}\sim% \mathbf{P}_{\text{LM}}(\hat{y}^{i}_{k}),\;\hat{y}^{i}_{j},\;\hat{y}^{i}_{k}\in% \mathcal{H}^{i}.= ⋃ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over^ start_ARG caligraphic_T end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ bold_P start_POSTSUBSCRIPT LM end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_H start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (6)

Given total n𝑛nitalic_n problems, where each problem has m𝑚mitalic_m human solutions, we compute the average ratio of new techniques used by a single human solution y^jisuperscriptsubscript^𝑦𝑗𝑖\hat{y}_{j}^{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT (jthsuperscript𝑗thj^{\text{th}}italic_j start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT human solution for ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT problem) that the remaining human solutions {y^kikj,k=1,2,,m}conditional-setsuperscriptsubscript^𝑦𝑘𝑖formulae-sequence𝑘𝑗𝑘12𝑚\{\hat{y}_{k}^{i}\mid k\neq j,k\ =1,2,\cdots,m\}{ over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∣ italic_k ≠ italic_j , italic_k = 1 , 2 , ⋯ , italic_m } have never used. The only difference to Eq.3 is that because it is not practical to ask human coding experts to perform Denial Prompting on hundreds of programming problems, human divergent creativity in Eq.6 is thus state insensitive and is equivalent to divergent(human, t=0𝑡0t=0italic_t = 0).

Appendix B Experiment Results

Refer to caption
Figure 8: Stacked results of convergent (Eq.2) and divergent (Eq.3) creativity evaluation across states.
Strategy State ΔΔ\Deltaroman_ΔConvergent(oldnew)oldnew{}_{(\textbf{old}\rightarrow\textbf{new})}start_FLOATSUBSCRIPT ( old → new ) end_FLOATSUBSCRIPT ΔΔ\Deltaroman_ΔDivergent(oldnew)oldnew{}_{(\textbf{old}\rightarrow\textbf{new})}start_FLOATSUBSCRIPT ( old → new ) end_FLOATSUBSCRIPT ΔΔ\Deltaroman_ΔNeoGauge(oldnew)oldnew{}_{(\textbf{old}\rightarrow\textbf{new})}start_FLOATSUBSCRIPT ( old → new ) end_FLOATSUBSCRIPT
MCTS 0 10.60(1.52→12.12) 7.79(5.18→12.97) 0.82(0.00→0.82)
1 3.03(0.00→3.03) 8.62(6.31→14.93) 0.08(0.00→0.08)
2 1.02(0.00→1.02) 8.61(5.55→14.16) 0.10(0.00→0.10)
3 0.52(0.00→0.52) 9.11(5.35→14.46) 0.00(0.00→0.00)
4 0.00(0.00→0.00) 8.93(4.82→13.75) 0.00(0.00→0.00)
5 0.00(0.00→0.00) 9.69(4.23→13.92) 0.00(0.00→0.00)
Self-Correction 0 12.12(2.53→14.65) -0.37(5.41→5.04) 0.29(0.25→0.54)
1 2.53(0.51→3.03) 0.26(4.56→4.82) 0.29(0.00→0.29)
2 1.02(0.00→1.02) 0.24(3.79→4.03) 0.00(0.00→0.00)
3 0.00(0.00→0.00) -1.62(6.04→4.42) 0.00(0.00→0.00)
4 0.00(0.00→0.00) 1.77(3.70→5.47) 0.00(0.00→0.00)
5 0.00(0.00→0.00) 0.05(4.93→4.98) 0.00(0.00→0.00)
Planning 0 7.07(2.53→9.60) -1.16(5.41→4.25) 0.25(0.25→0.50)
1 2.53(0.50→3.03) 1.28(4.56→5.84) 0.25(0.00→0.25)
2 1.02(0.00→1.02) 2.07(3.78→5.85) 0.00(0.00→0.00)
3 0.52(0.00→0.52) -1.24(6.04→4.80) 0.00(0.00→0.00)
4 0.57(0.00→0.57) 2.67(3.70→6.37) 0.00(0.00→0.00)
5 0.00(0.00→0.00) 1.94(4.93→6.87) 0.00(0.00→0.00)
Sampling 0 -0.50(2.52→2.02) -0.38(5.41→5.03) 0.00(0.25→0.25)
1 -0.51(0.51→0.00) -0.10(4.56→4.46) 0.00(0.00→0.00)
2 0.00(0.00→0.00) 1.05(3.78→4.83) 0.00(0.00→0.00)
3 0.52(0.00→0.52) -1.69(6.04→4.35) 0.00(0.00→0.00)
4 0.00(0.00→0.00) 0.76(3.70→4.46) 0.00(0.00→0.00)
5 0.00(0.00→0.00) -1.86(4.93→3.07) 0.00(0.00→0.00)
Table 5: Creativity difference before and after applying reasoning strategies.

It is crucial to consider both convergent and divergent thinking in creativity evaluation.

We plot the stacked convergent and divergent creativity evaluation results in Fig.8. Among all models, GPT-4 generally exhibits the best performance on both convergent and divergent creative thinking across all states, followed by Claude-3 and Llama3-70B. It is noticeable that Llama3-70B even outperforms GPT-4 on convergent creative thinking when t=0𝑡0t=0italic_t = 0 (convergent(GPT-4, 00) =16.16absent16.16=16.16= 16.16 < convergent(Llama3-70B, 00) =19.19absent19.19=19.19= 19.19). We hypothesize that the latest Llama3 models are pre-trained on Codeforces problems and human solutions, so they have superior performance when there is no external constraint t=0𝑡0t=0italic_t = 0. However, as t𝑡titalic_t increases, its convergent performance drops drastically. Moreover, divergent creative thinking never goes to 0 across all states and is sometimes even equally distributed on those less small models (e.g., CodeGemma-7B and Mistral-7B). Together with independent findings from Xu et al. [57], this observation indicates that LLMs with insufficient reasoning capabilities tend to make up new solutions regardless of the quality when facing unusual problems. Which, in turn, demonstrates the importance of the claim we made in section 1 that creative thinking involves not merely the generation of many diverse alternatives but also the verification of new valid alternatives.

Appendix C Prompts for Denial Prompting and Benchmarking

We apply the same problem-solving prompt in both Denial Prompting and the benchmarking process.

Problem-Solving Prompt for Codeforces: You are a Python code generator, only return the import and python function. Input will be an very detailed description of task, output will be the code. The input will be from command line, and the output will be printed to the console as well. Your result will be solely a function named solve(), and do not call this function in your code. Make sure the code is free of bug and can pass the test cases provided. You can use any library you want. The test cases are provided in the code. Do not call the solve() function in your code.
Technique Dection Prompt: You are a code reviewer. Detect all the programming techniques from the input and return a list of programming techniques. Only select the techniques from this list: [’if statement’, ’for loop’, ’while loop’, ’break statement’, ’continue statement’, ’pass statement’, ’match statement’, ’recursion’, ’stack’, ’queue’, ’tuple’, ’set’, ’dictionary’, ’linked list’, ’tree’, ’graph’, ’graph traversal’, ’two pointers’, ’sliding window’, ’matrix operation’, ’hashmap’, ’depth first search’, ’width first search’, ’back tracking’, ’dived & conquer’, ’Kadanes algorithm’, ’binary search’, ’heap’, ’dynamic programming’, ’greedy algorithm’, ’misc’, ’minimax’, ’topological sort’, ’sorting’, ’graph traversal’]
Your output should look like this:
- technique 1
- technique 2
- technique 3
- ...
State Constraint Problem Statement
0 N/A B. Points and Minimum Distance You are given a sequence of integers a of length 2n. You have to split these 2n integers into n pairs; each pair will represent the coordinates of a point on a plane. Each number from the sequence a should become the x or y coordinate of exactly one point. Note that some points can be equal. \cdots
\cdashline1-3 1 for loop B. Points and Minimum Distance Programming constraints: DO NOT use the following techniques - for loop You are given a sequence of integers a of length 2n. You have to split these 2n integers into n pairs; each pair will represent the coordinates of a point on a plane. Each number from the sequence a should become the x or y coordinate of exactly one point. Note that some points can be equal. \cdots
\cdashline1-3 2 for loop if statement B. Points and Minimum Distance Programming constraints: DO NOT use the following techniques - if statement - for loop You are given a sequence of integers a of length 2n. You have to split these 2n integers into n pairs; each pair will represent the coordinates of a point on a plane. Each number from the sequence a should become the x or y coordinate of exactly one point. Note that some points can be equal. \cdots
\cdashline1-3 3 for loop if statement while loop B. Points and Minimum Distance Programming constraints: DO NOT use the following techniques - while loop - if statement - for loop You are given a sequence of integers a of length 2n. You have to split these 2n integers into n pairs; each pair will represent the coordinates of a point on a plane. Each number from the sequence a should become the x or y coordinate of exactly one point. Note that some points can be equal. \cdots
\cdashline1-3 4 for loop if statement while loop sorting B. Points and Minimum Distance Programming constraints: DO NOT use the following techniques - sorting - while loop - if statement - for loop You are given a sequence of integers a of length 2n. You have to split these 2n integers into n pairs; each pair will represent the coordinates of a point on a plane. Each number from the sequence a should become the x or y coordinate of exactly one point. Note that some points can be equal. \cdots
\cdashline1-3 5 for loop if statement while loop sorting tuple B. Points and Minimum Distance Programming constraints: DO NOT use the following techniques - tuple - sorting - while loop - if statement - for loop You are given a sequence of integers a of length 2n. You have to split these 2n integers into n pairs; each pair will represent the coordinates of a point on a plane. Each number from the sequence a should become the x or y coordinate of exactly one point. Note that some points can be equal. \cdots
Table 6: An example of NeoCoder dataset with problem ID 1895B and state t=5𝑡5t=5italic_t = 5.