\addauthor

[Deqing]dfblue \addauthor[Oliver]olorange \forestset L1/.style=draw=black,, L2/.style=,edge=,line width=0.8pt,

DeLLMa: A Framework for Decision Making Under Uncertainty with Large Language Models

Ollie Liu , Deqing Fu^∗, Dani Yogatama, Willie Neiswanger
Department of Computer Science, University of Southern California
[email protected], {deqingfu, yogatama, neiswang}@usc.edu Equal contribution.

Abstract

The potential of large language models (LLMs) as decision support tools is increasingly being explored in fields such as business, engineering, and medicine, which often face challenging tasks of decision-making under uncertainty. In this paper, we show that directly prompting LLMs on these types of decision-making problems can yield poor results, especially as the problem complexity increases. To aid in these tasks, we propose DeLLMa (Decision-making Large Language Model assistant), a framework designed to enhance decision-making accuracy in uncertain environments. DeLLMa involves a multi-step scaffolding procedure, drawing upon principles from decision theory and utility theory, to provide a rational and human-auditable decision-making process. We validate our framework on multiple realistic decision-making environments, demonstrating that DeLLMa can consistently enhance the decision-making performance of leading language models, and achieve up to a 40% increase in accuracy over competing methods.

1 Introduction

Large language models (LLMs) are rapidly gaining traction across many domains due to their potential for automating and enhancing a broad spectrum of tasks [7, 9]. One important potential use is in decision making under uncertainty, i.e., providing guidance on what action to take, given some set of possibilities, properly factoring in user goals and uncertainty about the world. The ability to make good decisions under uncertainty holds broad relevance across high-stakes tasks in fields such as business, marketing, medicine, aeronautics, and logistics [19, 31]—and its value is not limited to organization or company-level decisions but extends to aiding individuals in making informed choices as well. The ability of LLMs to analyze large quantities of data makes them potentially well-suited for sophisticated decision support tools, and ensuring that these models give accurate, context-aware recommendations could significantly augment human decision-making capabilities.

However, optimal decision making under uncertainty is often challenging. For humans, there exist frameworks from decision theory and utility theory (developed in fields such as economics, statistics, and philosophy) to provide a structured approach for more-optimal decision-making [42, 26, 6]. Research has consistently demonstrated that without these frameworks, human decision-making can often be highly irrational, swayed by biases and incomplete information [4]. Similarly, making optimal decisions with LLMs faces its own set of challenges. Issues include the tendency to fixate on specific explanations or information without adequately balancing all evidence, and inability to effectively handle uncertainty, manage biases, or align with a user’s goals and utilities [14, 5]. Our paper presents experiments that exemplify these issues.

Furthermore, beyond merely making rational decisions, it is crucial to understand why an LLM made a particular decision. This understanding aids in building trust in the decision, assessing its optimality, and improving any components that may lead to suboptimal outcomes. The ability to explain decisions and verify decision-making quality—which we refer to as human auditability—is essential for the practical application of LLMs to aid decision making in many real problems [40].

In this paper, our goal is to develop a framework that enables LLMs to make better decisions under uncertainty. Our aim is not only to enhance decision-making accuracy but also to allow human users to understand the rationale behind each decision and evaluate its optimality. Drawing inspiration from prior work on multi-step scaffolding like Chain-of-Thought (CoT) [44] and Tree-of-Thoughts (ToT) [48], we provide a scaffold for LLMs, which follows the framework of classical decision theory, originally designed for rational decision making under uncertainty by humans. Our approach involves three key steps: first, identify and forecast pertinent unknown variables given in-context information; second, elicit a utility function that aligns with the user’s goals; and finally, use this utility function to identify the decision that maximizes expected utility. We call our proposed framework DeLLMa, short for Decision-making Large Language Model assistant.

We demonstrate DeLLMa on realistic decision-making scenarios in agriculture and finance, and compare it against existing strategies for LLM decision-making, including zero-shot (direct) prompting, self-consistency [43], and CoT approaches. Our findings indicate that DeLLMa significantly enhances decision-making accuracy, with improvements of up to a \dfreplace(XYZ)%40% increase in prediction accuracy in both scenarios, particularly as the complexity and number of potential actions increases. Additionally, we find that DeLLMa consistently enhances performance across a variety of leading language models, and that its structure allows us to understand the rationale behind each decision. In full, our contributions are:

•

We introduce DeLLMa, a framework for human-auditable LLM-based decision making under uncertainty, employing a multi-step scaffolding procedure based on classical decision theory.
•

We detail and implement a specific version of this framework, tailored to address a subset of decision problems, which is compatible with current LLMs.
•

On realistic decision-making environments, we show that DeLLMa gives up to a \dfreplace(XYZ)%40% improvement in decision-making accuracy over competing methods, and yields consistent improvements when deployed across multiple leading LLMs.

2 Related Work

Decision Making with LLMs.

Initiated by Chain-of-Thought (CoT) prompting [44], a number of recent works have developed methods to decompose multi-step reasoning problems into modular sub-problems. For example, Tree-of-Thought (ToT) prompting [48] generalizes CoT with a tree-search procedure to optimize a reasoning path subject to external feedback. Subsequent works aim to improve ToT with better search algorithms, self-induced feedback, and tool usage [17, 32, 50, 49]. However, these works do not focus on decision-making under uncertainty, which we demonstrate is challenging even for carefully-crafted chains that emulate classical decision-making procedures.

Another line of work leverages LLMs for optimizing blackbox functions [47, 30, 38]. These settings involve methods that make a substantial number of low-cost decisions (which do not incur a high price for suboptimality). Instead, we focus on single-step expensive decisions, particularly in the prescence of uncertainty, with a focus on the optimality of decisions. Additionally, a number of applied domains that involve decision making have recently started to explore LLM-based methods, such as in supply chain optimization [23], medicine and health [5], and automated driving [28].

Uncertainty in LLMs.

LLMs, without proper calibration, can be overly confident in their responses [39]. Such pitfalls make them unlikely to make reliable decisions under uncertainty. Prior work has aimed to solve this issue; one line of research involves asking LLMs for their own confidence, with or without additional finetuning [18, 24, 29, 10, 41, 46]. Referring to [3] for a detailed survey, many recent advances [25, 13, 11, 45] adopt a Bayesian inference framework to quantify and reason with uncertainties in LLMs. Other works have shown that tool usage [35], retrieval augmentation [16], and model ensemble [37] can improve calibration and forecasting capabilities of LLMs. Our framework can take advantage of these advances in LLM-based forecasting for improved decision making.

3 Methods

Preliminaries.

Suppose that a decision maker needs to make a choice between a set of options to achieve some goal—i.e., has a decision problem. We begin by formalizing such a decision problem, and afterwards describe how we approach decision making with LLMs. There are three main components to the decision problems that we will describe: actions, states, and utilities.

First, the actions are the possible options that a decision maker wishes to choose between. We use $\mathcal{A}$ to denote the space of actions, and $a\in\mathcal{A}$ for a single action. Second, the set of unknown states of nature are denoted $\Theta$ . In our formulation, we define a state $\theta\in\Theta$ to be any latent variable whose true value is unknown, yet affects outcomes relevant to the decision maker’s goals. To perform optimal decision making, one must act while accounting for uncertainty over these unknown states.

The third component involves the decision maker’s preferences for different possible outcomes. We formalize our framework for decision making under uncertainty using utility theory, which can be viewed as “modeling the preferences of an agent as a real-valued function over uncertain outcomes” [20, 36, 15]. A key element of decision theory is the utility function (in some formulations, this is instead given in terms of a loss function $L$ ). The utility function, denoted $U:\Theta\times\mathcal{A}\rightarrow\mathbb{R}$ , assigns a scalar value to any state and action $(\theta,a)\in\Theta\times\mathcal{A}$ . Intuitively, a higher utility means that the state-action pair yields a more-preferable outcome for the user.

The goal of the decision maker will be to choose a final decision $a^{*}\in\mathcal{A}$ , which yields the highest possible utility, while accounting for uncertainty in the unknown states $\theta$ .

Refer to caption — Figure 1: Given a decision problem as a prompt, DeLLMa (decision-making LLM assistant) computes and maximizes the expected utility to to carry out decision making under uncertainty. We illustrate the key steps of DeLLMa on decision-making tasks in agriculture planning (top) and finance (bottom).

Decision Making with LLMs: Setup and Current Approaches.

We first describe the setting in which we intend our framework to operate. Suppose a human wishes to use this LLM assistant to help make a decision. They begin by describing a decision problem via a user prompt $\mathcal{P}$ . We formalize a user prompt as a triplet $\mathcal{P}=(\mathcal{G},\mathcal{A},\mathcal{C})$ , which includes: a natural language description of the user’s goal $\mathcal{G}$ , a list of $n$ actions $\mathcal{A}=(a_{1},\ldots,a_{n})$ , and a passage of contextual information $\mathcal{C}$ , which might be, e.g., pages from a report, or a text-based representation of historical data.

Referring to the agriculture planning decision problem in Figure 1 as a running example, the goal $\mathcal{G}$ is for a farmer to maximize their revenue in the forthcoming year; the action set $\mathcal{A}$ lists the possible produce the farmer is considering planting (e.g., apples, avocados, pears); and the context $\mathcal{C}$ consists of historical summaries of agricultural yields or information about the climate around the farm.

It is tempting to delegate such decision-making to LLMs with direct prompting. However, we observe that responses from conventional approaches, such as Self-Consistency and CoT, do not adequately balance available evidence, handle uncertain information, or align with user preferences; we show in Section 4 that these methods perform poorly, especially with an increasing numbers of actions.

DeLLMa: Decision-Making LLM Assistant.

To help encourage improved decisions under uncertainty, we propose a framework that guides an LLM to follow the scaffolding of classical decision theory. By restricting LLMs to this scaffold we can also explicitly see components of the decision-making process—e.g., predictions of unknown states and utility function values—which provides human-auditability, allowing a user to identify why a given decision was made by the model.

In our initial formalization of this framework, we restrict ourselves to a slightly curtailed class of problems, and thus make a few simplifying assumptions. For example, we have assumed above that there are a discrete, enumerable set of $n$ possible actions, i.e., $\mathcal{A}=(a_{1},\ldots,a_{n})$ . We will also assume there is a discrete set of $m$ possible states, $\Theta=(\theta_{1},\ldots,\theta_{m})$ , though $m$ may be quite large.

Our framework will use an LLM to produce a belief distribution over the unknown states, given the input context $\mathcal{C}$ . We view this as a posterior belief distribution over the states, which we denote by $\pi(\theta\mid\mathcal{C})$ . Implicitly, we are assuming that the LLM implies a prior belief distribution $\pi(\theta)$ , given only the model weights or training data.

Our framework will also elicit a utility function, based in part on the description of the user’s goals $\mathcal{G}\in\mathcal{P}$ . This utility function assigns a scalar value to any state-action pair $(\theta,a)$ . We denote this utility function as $U(\theta,a)$ . Given these, the expected utility under our LLM of taking an action $a$ , given some additional context $\mathcal{C}$ , can be written

\displaystyle U_{\mathcal{C}}(a)=\mathbb{E}_{\pi(\theta\mid\mathcal{C})}\left[% U(\theta,a)\right]=\sum_{\theta\in\Theta}\pi(\theta\mid\mathcal{C})U(\theta,a).

(1)

Then, following the expected utility principle for rational decision making [27, 31], we select the Bayes-optimal decision $a^{*}$ , which maximizes the expected utility, and can be written

\displaystyle a^{*}=\operatorname{arg\max}_{a\in\mathcal{A}}U_{c}(a).

(2)

We call our framework DeLLMa, short for Decision-making Large Language Model assistant. DeLLMa carries out this sequence of four steps—state enumeration, state forecasting, utility elicitation, and expected utility maximization. A full description of DeLLMa is shown in the box below. In the following sections we give details on our specific implementation of each of these four steps.

[!htbp]

3.1 State Enumeration

Our goal is to present a simple implementation for each step, which performs well empirically, as an initial demonstration of the DeLLMa framework; however, each component in the framework could be extended or made more sophisticated in the future. We first describe the strategy that we adopt for enumerating a space of relevant latent states $\Theta=(\theta_{1},\ldots,\theta_{m})$ . Given $\mathcal{P}$ as context, we prompt an LLM to identify $k$ latent factors that are predicted to influence the user’s goal $\mathcal{G}$ (see §C.2 for details on this prompt). Each latent factor is a string (a word or phrase), which can be viewed as describing a dimension of our state space $\Theta$ . We denote these $k$ latent factors as $(f_{1},\ldots,f_{k})$ .

For each latent factor, we prompt our LLM to generate $\ell$ plausible values of the latent factor (empirically we find that it is sufficient to set $\ell$ to be a small number, such as 3). For a latent factor $f_{j}$ , we denote its plausible values as $\tilde{f}_{j}^{1:\ell}$ . Each of these plausible values is also a string (a word or phrase). This process discretizes the state space, where each of the $k$ dimensions has $\ell$ bins. We find that this strategy, while simple, yields a straightforward method for forecasting states (described in §3.2), which performs well in practice.

A single state $\theta_{j}$ in this state space consists of one plausible value from each of the $k$ latent factors, which we can denote by $\theta_{j}=\theta_{j}^{1:k}\in\Theta$ . In total, this produces a discretized state space of size $|\Theta|=m=\ell^{k}$ . While this state space is too large to enumerate explicitly, we develop a procedure to forecast probabilities for these states in a scalable manner.

3.2 State Forecasting

In the next step of DeLLMa, we form a probabilistic forecast of the unknown states, given information contained in the context $\mathcal{C}\in\mathcal{P}$ . We must do this in a way that allows us to compute expected utilities, given the size of the sample space. Our strategy will be to define a joint distribution over the state space, which we can then sample from to form a Monte Carlo estimate of the expected utility. Note that there exists multiple recent LLM forecasting methods that could be used in this step (described in §2). Here, we propose and implement a less-complex strategy as an effective proof of concept. However, in the future, this step might be replaced with more advanced forecasting methods, potentially leveraging search and retrieval of information.

For each of the $k$ latent factors, and each of their $\ell$ possible values $\tilde{f}_{j}^{1:\ell}$ , we prompt our LLM to assign a verbalized probability score $\in$ $\{$ very likely, likely, somewhat likely, somewhat unlikely, unlikely, very unlikely $\}$ . In total, we must assign $k\times\ell$ scores. We provide all prompts for this probability score procedure in Section C.3. We then define a dictionary $\mathcal{V}$ that maps each verbalized probability score to a numerical value. Similar strategies converting from verbalized to numeric scores have been used with success in prior work [46, 41]. After normalization, this yields a distribution over the state space $\Theta$ , assuming independence between the $k$ latent factors, which we posit for computational simplicity. We sample states from this joint distribution by iterating through each of the latent factors, sampling according to its approximate marginal probability, and concatenating the samples.

The full procedure is shown in Algorithm 1. Here, the $\operatorname{Normalize}$ function simply scales the weights instantiated in the marginal distribution to a well-defined probability mass function (PMF). We consider the sampled states to be from an LLM-defined proposal distribution $\pi^{\text{LLM}}(\theta\mid\mathcal{C})$ , returned as output from Algorithm 1, which approximates the posterior belief distribution $\pi(\theta\mid\mathcal{C})$ .

Algorithm 1 StateForecast

Input: LLM

\mathcal{M}

, user prompt

\mathcal{P}=(\mathcal{G},\mathcal{A},\mathcal{C})

, plausibilty score mapping

\mathcal{V}

, latent factors

\{f_{1},\cdots,f_{k}\}

, and plausible values

\{\tilde{f}_{1}^{1:\ell},\cdots,\tilde{f}_{k}^{1:\ell}\}

for

i=1

k

\pi_{i}(\cdot\mid\mathcal{C})\leftarrow\{\}

# Get verbalized probability scores

[v_{1},\cdots v_{\ell}]\leftarrow\mathcal{M}(\mathcal{P},f_{i},\tilde{f}_{i}^{% 1:\ell})

for

j=1

\ell

\pi_{j}(\tilde{f}_{i}^{j}\mid\mathcal{C})\leftarrow\mathcal{V}[v_{j}]

end for

\pi_{i}(\cdot\mid\mathcal{C})\leftarrow\operatorname{Normalize}(\pi_{i}(\cdot% \mid\mathcal{C}))

end for

return

\pi^{\text{LLM}}(f_{1},\cdots,f_{k}\mid\mathcal{C})\coloneqq\prod_{i=1}^{k}\pi% _{i}(\cdot\mid\mathcal{C})

Algorithm 2 UtilityElicitation

Input: LLM

\mathcal{M}

, user prompt

\mathcal{P}=(\mathcal{G},\mathcal{A},\mathcal{C})

, proposal distribution

\pi^{\text{LLM}}(\theta\mid\mathcal{C})

, sample size

s

, minibatch size

b

, and overlap proportion

q

# Sample fixed states for each action

S_{A}\leftarrow\mathcal{A}\times\{\theta_{i}\mid\theta_{i}\sim\pi^{\text{LLM}}% ,1\leq i\leq\lfloor s/|\mathcal{A}|\rfloor\}

S_{A}\leftarrow\operatorname{Shuffle}(S_{A})

\Omega\leftarrow\{\}

# Pairwise comparisons

for

i=1

s

with step

\lfloor b\times(1-q)\rfloor

# Rank the minibatch

\mathcal{R}\leftarrow\mathcal{M}\left(\mathcal{P},(\theta_{i},a_{i}),\cdots,(% \theta_{i+b},a_{i+b})\right)

# Format into comparison & update

\Omega\leftarrow\Omega\cup\operatorname{FormatRank}(\mathcal{R})

end for

return

U(\cdot,\cdot)\coloneqq\operatorname{BradleyTerry}(\Omega)\in\mathbb{R}^{s}

3.3 Utility Function Elicitation

Next, we need a method to elicit (which is to say: construct) a utility function $U:\Theta\times\mathcal{A}\rightarrow\mathbb{R}$ , which maps a state-action pair to a real value. An accurate utility function, which balances the preferences of a human user with respect to the goal that they describe, is difficult to define directly in a general-purpose manner. There is a long history of work on utility elicitation methods [12], which aim to construct a utility function from e.g., pairwise preference data. Here, we combine these methods with large language models to try and automatically elicit a utility function.

We conduct the following procedure. We first sample states from the forecast state distribution $\pi^{\text{LLM}}(\theta\mid\mathcal{C})$ , and from these form a set of state-action pairs. We then group these pairs into minibatches, and prompt our LLM to rank the elements of each minibatch, given the stated user’s goal $\mathcal{G}\in\mathcal{P}$ . This LLM-based ranking of items—where each item consists of an action and a particular instantation of states—is a procedure that can be broadly applied, and LLMs have a history of being successfully used for similar comparisons in prior works [21, 33]. Based on these minibatch rankings, we are able to extract pairwise preferences, which we can use in classic utility elicitation algorithms.

We show this procedure in Algorithm 2 and discuss two implementations of the $\operatorname{FormatRank}$ step: $\operatorname{Rank2Pairs}$ and $\operatorname{One-vs-All}$ . Denoting $(\theta,a)_{(i)}$ as the $i$ -th preferred state-action pair of the minibatch, $\operatorname{Rank2Pairs}$ converts a ranking of decreasing preference $\mathcal{R}=\left((\theta,a)_{(1)},\cdots,(\theta,a)_{(b)}\right)$ to a list of pairwise comparisons by adding $(\theta,a)_{(i)}\succ(\theta,a)_{(j)}$ whenever $i<j$ . In contrast, $\operatorname{One-vs-All}$ assumes that the LLM is indifferent towards all but the top-ranked state-action pair, i.e., $\left\{(\theta,a)_{(1)}\succ(\theta,a)_{(i)}\mid\forall~{}2\leq i\leq b\right\}$ . This implementation may be desirable when accurate comparisons of certain suboptimal state-actions is challenging. We then make use of these preferences as training data for a Bradley-Terry model [8] to elicit an approximate utility function $U:\Theta\times\mathcal{A}\rightarrow\mathbb{R}$ with respect to the sampled state-action pairs. Finally, we find two additional ingredients help improve the accuracy and computational efficiency of utility elicitation: batching and variance reduction.

Batching.

We implement a batched inference procedure that slices state-action samples $S_{A}=\{(\theta,a)\}$ into overlapping minibatches for ranking. We ensure that $q$ % of samples are shared between two consecutive minibatches drawn from $S_{A}$ , where $q$ is a hyperparameter that modulates a minibatch’s degree of exposure to the preference of the previous minibatch. Larger $q$ results in finer-grained preference at the cost of more queries. Effects of $q$ are ablated in Figure 3.

Variance Reduction.

Directly sampling $|S_{A}|$ state values from the proposal distribution $\pi^{\text{LLM}}$ may lead to high-variance estimates of utilities. We instead sample $|S_{A}|/|\mathcal{A}|$ independent state values from $\pi^{\text{LLM}}$ , create $|\mathcal{A}|$ duplicates, and pair them with each action $a$ (see Figure 6 in Appendix A).

3.4 Expected Utility Maximization

In the final step of DeLLMa, we compute the expected utility for each action, and then return the action that maximizes the expected utility. In particular, for each action, we compute a Monte Carlo estimate of the expected utility using state-action samples (drawn from the state forecast distribution $\pi^{\text{LLM}}(\theta\mid\mathcal{C})$ ), as well as the elicited utility function. Note that these calculations are all performed analytically (not via an LLM). We can then approximate the expected utility $U_{\mathcal{C}}(a)$ as

\displaystyle U_{\mathcal{C}}(a)=\mathbb{E}_{\pi(\theta\mid\mathcal{C})}\left[% U(\theta,a)\right]\approx\frac{1}{|S|}\sum_{\theta\in S}U(\theta,a),

(3)

given a set of state samples $S\subseteq\Theta$ drawn from our LLM-defined state forecast distribution, which is an approximation of the LLM’s posterior belief distribution about states given context $\mathcal{C}$ , i.e., $S\overset{\textit{i.i.d.}}{\sim}\pi^{\text{LLM}}(\theta\mid\mathcal{C})\approx% \pi(\theta\mid\mathcal{C})$ . After computing the expected utility $U_{\mathcal{C}}(a)$ for each action, DeLLMa returns the final decision: $a^{*}=\operatorname{arg\max}_{a\in\mathcal{A}}U_{\mathcal{C}}(a)$ .

4 Experiments

We evaluate the performance of DeLLMa on two decision making under uncertainty environments sourced from distinct domains: agricultural planning (Agriculture) and finance investing (Stocks). Both involve sizable degrees of uncertainty from diverse sources, and are representative of different data modalities (natural language and tabular) involved in decision making.

We propose DeLLMa variants designed to assess algorithmic improvements outlined in Section 3.

•

DeLLMa-Pairs is the method using all techniques in §3.3 and $\operatorname{Rank2Pairs}$ for utility elicitation.
•

DeLLMa-Top1 is identical to DeLLMa-Pairs, but replaces $\operatorname{Rank2Pairs}$ with $\operatorname{One-vs-All}$ .
•

DeLLMa-Naive is a base version of DeLLMa-Pairs, where we sample multiple states per action and construct pairwise comparisons from a single batch (i.e., no batching and no variance reduction).

For DeLLMa-Pairs and Top1, we allocate a per action sample size of 64 and a minibatch size of 32. We set the overlap proportion $q$ to 25% for the Agriculture dataset and 50% for the Stocks dataset due to budget constraints. For DeLLMa-Naive, we fix a total sample size of 50, which is the maximal size we observe our LLM to yield a plausible ranking without obvious hallucinations. For our default LLM in experiments (except for comparisons across different LLMs), we use GPT-4[1]¹¹1We use GPT-4 checkpoint gpt4-1106-preview..

We compare DeLLMa against three baselines—zero-shot, self-consistency, and Chain-of-Thought:

•

Zero-Shot. Only the goal $\mathcal{G}$ , the action space $\mathcal{A}$ , and the context $\mathcal{C}$ is provided. We adopt a greedy decoding process by setting temperature = 0. An example prompt for each task is provided in Sections C.2 and D.1.
•

Self-Consistency (SC) [43]. We use the exact same prompt as in zero-shot, but with temperature = 0.5 to generate a divserse set of $K$ responses. We take the majority voting of the $K$ responses and record it as the final prediction. Our preliminary study finds that LLMs tend to be very confident in their decisions, and even with much higher temperature than 0.5, often all $K$ independent runs provide quite consistent decisions. We thus set $K=5$ to balance cost and performance.
•

Chain-of-Thought (CoT) [44]. For decision-making tasks, there is no standard CoT pipeline. Inspired by workflows from decision theory, we create a prompting chain consisting of three steps: (1) ask for unknown factors that impact the decision; (2) given these, ask for their possibility of occurence; (3) then ask for a final decision. Such a mechanism looks very similar to the DeLLMa pipeline (see §3) but only consists of prompting. Example CoT prompts are provided at Section D.1.

Evaluation Metrics.

For both datasets, our action spaces consist of a set of items, and we evaluate the performance of both DeLLMa and baseline methods by comparing the accuracy of their prediction from this set against the ground-truth optimal action (i.e., the action that maximizes ground-truth utility). We also report normalized utility—i.e., the ground-truth utility of the action chosen by a given method, normalized by the optimal ground-truth utility—in Appendix B.

We defer more involved decision-making under uncertainty problems, such as constructing a weighted combination of actions (i.e., a portfolio), to future works.

4.1 Agriculture

Data Acquisition.

We collect bi-annual reports published by the United States Department of Agriculture (USDA) that provide analysis of supply-and-demand conditions in the U.S. fruit markets²²2www.ers.usda.gov/publications/pub-details/?pubid=107539. To emulate real-life farming timelines, we use the report published in September 2021 as context for planning the forthcoming agricultural year. We additionally supplement these natural language contexts with USDA-issued price and yield statistics in California³³3www.nass.usda.gov/Quick_Stats/Ag_Overview/stateOverview.php?state=CALIFORNIA.

We define the utility of planting a fruit as its price $\times$ yield reported in the forthcoming year. We identify 7 fruits—apple, avocado, grape, grapefruit, lemon, peach, and pear—that are both studied in the September 2021 report, and endowed with these statistics in 2021 and 2022. We create decision making problems by enumerating all possible combinations of availble fruits, resulting in 120 instances. For each decision-making instance, we use related sections of the USDA report and current-year price and yield statistics as context. See Section C.1 and Section C.2 for additional details on preprocessing, and Section C.4 for DeLLMa prompts.

Main Result.

Our numerical results are reported in Figure 2. We observe that all DeLLMa strategies consistently outperform baseline comparison methods, especially for larger action sets. We plot decision accuracy here, but include equivalent plots showing the utility of each strategy in Appendix B. Among DeLLMa variants, the performance of Pairs and Top1 are consistent across the board; both are significantly better than Naive. This observation shows the benefits of algorithmic modifications outlined in §3.3. We give more detailed analyses in the ablation study below. We also show a comparison of DeLLMa deployed on multiple leading LLMs (GPT-4 [1], Claude-3 [2], Gemini 1.5 [34]), with consistent performance improvements across multiple model families. We refer readers to Appendix C and supplementary material for details on data, prompts, and responses.

Failure Modes of Baseline Methods.

One surprising observation is that our baseline methods often underperform random guessing, in the case of larger action sets. A shared failure mode is that these methods elicit decisions that echo the sentiments presented in context, and they lack the ability to reason what-if scenarios that lead to utility changes. Our experiments on SC and CoT indicate that neither sampling reasoning paths, prompting the model to imagine the alternatives, nor augmenting an LLM with a posterior belief distribution can fundamentally enhance the model’s ability to reason with uncertainty. By conditioning on sampled states, DeLLMa can avoid this pitfall while leveraging in-context learning to decide the preferred state-action pair. See Section C.4 and Sections C.4, C.4 and C.4 for discussions on failure cases of baseline methods that are addressed by DeLLMa.

In addition to performance improvements, our structured approach is endowed with human auditability. In Figure 3 (right), we show an abbreviated decision network, with actions, states, sampled latent factors, and derived utilities constructed from outputs of a DeLLMa agent. This modular approach to decision making can facilitate transparency and trust of LLMs in high-stake scenarios.

Ablation Study.

Referring back to Figure 2, a potential explanation for the performance difference between DeLLMa-Naive and Pairs/Top1 is the discrepancy in sample size: for Naive, we allocate a fixed sample size (50) for all decision-making instances, and we allocate a fixed per action sample size (64) for Pairs and Top1. We eliminate this confounder in our ablation study, by noting that with per action sample size 8 (middle subfigure in Figure 3), DeLLMa-Pairs/Top1 achieve comparable performance to Naive, despite only receiving 16-48 samples in total for each problem instance.

In Figure 3 (left), we observe linear performance trends when scaling up overlap percentage and sample size. Intuitively, both higher overlap percentages (i.e., more exposure between minibatches) and larger sample size lead to construction of finer-grained pairwise comparisons and thus high quality approximate utilities. Furthermore, DeLLMa-Pairs consistently outperforms Top1, implying that a nontrivial portion of the pairwise comparisons are meaningful. However, we note that the number of required API queries scales linearly with both parameters, and users are advised to choose these parameters that balance performance and cost. We report statistics for API queries and prompt lengths for our methods in Table 1 of Appendix B.

Finally, we also perform a study to evaluate the quality of utility elicitation in DeLLMa. We curate a large set of state-action samples; human annotators (the paper authors) are then shown pairs of these state-action samples and asked to annotate a preference, based on the decision prompt $\mathcal{P}$ . We then compute the agreement rate between the pairwise annotations given by DeLLMa and those given by the annotators. Although this step is noisy, we see a strong agreement between LLM and human annotations, as shown in Figure 4, across multiple LLMs. Further, we find that this agreement rate (77.5%, details in Appendix E) is on par with the agreement rate between pairs of human annotators.

4.2 Stocks

Making decisions involving financial investing requires handling a variety of uncertainties; however, it is fundamentally different from the agriculture decision problems. Most evidently is the difference in input format—contexts for agriculture rely more on textual information (summarizations from USDA reports) whereas contexts for stocks involve tabular data (historical prices per share). Stocks decision problems are well suited to test LLMs capability in reasoning on dynamic tabular data.

Data Acquisition.

Similar to the setup in §4.1, the action space $\mathcal{A}$ consists of individual stocks and we curate a total of 120 decision problem instances of varying sizes. In our experiments, we choose popular stocks whose symbols are AMD, DIS, GME, GOOGL, META, NVDA and SPY. Unlike agriculture data where the context $\mathcal{C}$ are collected through USDA reports, we collect historical stock prices as the context for this problem. As illustrated in Figure 1, each stock is presented with 24 monthly price in history. In preventing possible data leakage and promoting LLMs to use their common-sense knowledge in making decisions, when using gpt4-1106-preview as the LLM checkpoint, historical price between December 2021 to November 2023 are provided as the context $\mathcal{C}$ . These historical monthly prices are collected via Yahoo Finance⁴⁴4finance.yahoo.com manually by the authors.

The goal of the LLM agent is to choose which stock to invest on December 1st, 2023, and sell on the last trading day of that month (December 29, 2023) so that the return is maximized. Detailed prompts are presented in Section D.1. We note that we only consider the simplistic setting—choosing one stock from a set of options $\{a_{1},\cdots,a_{n}\}$ .

Main Results.

Similarly, we compare three variants of DeLLMa with the three popular baselines. Shown in Figure 5, most of the observations are consistent with those in the agriculture experiments—DeLLMa’s variants outperform the baseline candidates. Unlike the agriculture setting, here DeLLMa-Naive barely improves over the baselines. We hypothesis this is due to the high volatility of stocks data, where inefficient sample size without variance reduction produces highly volatile predictors as well. This also validates the need for the design of the DeLLMa-Pairs and DeLLMa-Top1 methods.

Additionally, DeLLMa-Top1 performs better than DeLLMa-Pairs in stocks data. This is surprising at first glance since DeLLMa-Pairs have more observations in terms of pairwise preferences and ranking of sampled action-state pairs. However, upon further consideration, we hypothesize potential explanations: LLM hallucination is still an unsolved problem. For high-volatity data like stocks, LLMs can easily hallucinate with internal rankings if asked to perform difficult tasks (such as to rank a batch of options where a ground-truth ranking may not exist). On the other hand, the model may still have high confidence about its prediction of the top choice in the batch. In such scenarios, using the hallucinated rankings may only provide extra noise that hinders the model performance.

5 Conclusion

We propose DeLLMa, a framework designed to harness LLMs for decision making under uncertainty in high-stakes settings. We discuss a structured approach in which we provide a scaffold for LLMs that follows the principles of classical decision theory, and then develop a feasible implementation with LLMs. Through experiments on real datasets, we highlight a systematic failure of popular prompting strategies when applied to this type of decision making task, and demonstrate the benefits of our approach to address these issues. The modularity of our framework avails many possibilities, most notably auditability and using LLMs for a broader spectrum of probablistic reasoning tasks.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Anthropic [2024] AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
Baan et al. [2023] Joris Baan, Nico Daheim, Evgenia Ilia, Dennis Ulmer, Haau-Sing Li, Raquel Fernández, Barbara Plank, Rico Sennrich, Chrysoula Zerva, and Wilker Aziz. Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703, 2023.
Bazerman and Moore [2012] Max H Bazerman and Don A Moore. Judgment in managerial decision making. John Wiley & Sons, 2012.
Benary et al. [2023] Manuela Benary, Xing David Wang, Max Schmidt, Dominik Soll, Georg Hilfenhaus, Mani Nassir, Christian Sigler, Maren Knödler, Ulrich Keller, Dieter Beule, et al. Leveraging large language models for decision support in personalized oncology. JAMA Network Open, 6(11):e2343689–e2343689, 2023.
Berger [2013] James O Berger. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 2013.
Bommasani et al. [2021] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Bradley and Terry [1952] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
Bubeck et al. [2023] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
Chen and Mueller [2023] Jiuhai Chen and Jonas W. Mueller. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. ArXiv, abs/2308.16175, 2023. URL https://api.semanticscholar.org/CorpusID:261339369.
[11] Fabian Falck, Ziyu Wang, and Christopher C Holmes. Are large language models bayesian? a martingale perspective on in-context learning. In ICLR 2024 Workshop on Secure and Trustworthy Large Language Models.
Farquhar [1984] Peter H Farquhar. State of the art—utility assessment methods. Management science, 30(11):1283–1300, 1984.
Feng et al. [2024] Yu Feng, Ben Zhou, Weidong Lin, and Dan Roth. Bird: A trustworthy bayesian inference framework for large language models. arXiv preprint arXiv:2404.12494, 2024.
Ferrara [2023] Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023.
Fishburn [1968] Peter C Fishburn. Utility theory. Management science, 14(5):335–378, 1968.
Halawi et al. [2024] Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt. Approaching human-level forecasting with language models. arXiv preprint arXiv:2402.18563, 2024.
Hao et al. [2023] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
Kadavath et al. [2022] Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, and Jared Kaplan. Language models (mostly) know what they know, 2022.
Kochenderfer [2015] Mykel J Kochenderfer. Decision making under uncertainty: theory and application. MIT press, 2015.
Kochenderfer et al. [2022] Mykel J Kochenderfer, Tim A Wheeler, and Kyle H Wray. Algorithms for decision making. MIT press, 2022.
Lee et al. [2024] Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. RLAIF: Scaling reinforcement learning from human feedback with ai feedback, 2024. URL https://openreview.net/forum?id=AAxIs3D2ZZ.
Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
Li et al. [2023] Beibin Li, Konstantina Mellou, Bo Zhang, Jeevan Pathuri, and Ishai Menache. Large language models for supply chain optimization, 2023.
Lin et al. [2022] Stephanie Lin, Jacob Hilton, and Owain Evans. Teaching models to express their uncertainty in words, 2022.
Lin et al. [2023] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models, 2023.
Luce and Raiffa [1989] R Duncan Luce and Howard Raiffa. Games and decisions: Introduction and critical survey. Courier Corporation, 1989.
Machina [1987] Mark J Machina. Choice under uncertainty: Problems solved and unsolved. Journal of Economic Perspectives, 1(1):121–154, 1987.
Mao et al. [2023] Jiageng Mao, Junjie Ye, Yuxi Qian, Marco Pavone, and Yue Wang. A language agent for autonomous driving, 2023.
Mielke et al. [2022] Sabrina J. Mielke, Arthur Szlam, Emily Dinan, and Y-Lan Boureau. Reducing conversational agents’ overconfidence through linguistic calibration, 2022.
Nie et al. [2023] Allen Nie, Ching-An Cheng, Andrey Kolobov, and Adith Swaminathan. Importance of directional feedback for llm-based optimizers. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
Peterson [2017] Martin Peterson. An introduction to decision theory. Cambridge University Press, 2017.
Qin et al. [2023a] Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023a.
Qin et al. [2023b] Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, Xuanhui Wang, and Michael Bendersky. Large language models are effective text rankers with pairwise ranking prompting, 2023b.
Reid et al. [2024] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
Ren et al. [2023] Allen Z. Ren, Anushri Dixit, Alexandra Bodrova, Sumeet Singh, Stephen Tu, Noah Brown, Peng Xu, Leila Takayama, F. Xia, Jacob Varley, Zhenjia Xu, Dorsa Sadigh, Andy Zeng, and Anirudha Majumdar. Robots that ask for help: Uncertainty alignment for large language model planners. ArXiv, abs/2307.01928, 2023. URL https://api.semanticscholar.org/CorpusID:259342058.
Schoemaker [1982] Paul JH Schoemaker. The expected utility model: Its variants, purposes, evidence and limitations. Journal of economic literature, pages 529–563, 1982.
Schoenegger et al. [2024] Philipp Schoenegger, Indre Tuminauskaite, Peter S Park, and Philip E Tetlock. Wisdom of the silicon crowd: Llm ensemble prediction capabilities match human crowd accuracy. arXiv preprint arXiv:2402.19379, 2024.
Shinn et al. [2023] Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
Si et al. [2022] Chenglei Si, Chen Zhao, Sewon Min, and Jordan Boyd-Graber. Re-examining calibration: The case of question answering. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2814–2829, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.204. URL https://aclanthology.org/2022.findings-emnlp.204.
Thirunavukarasu et al. [2023] Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
Tian et al. [2023] Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D Manning. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975, 2023.
Von Neumann and Morgenstern [1944] John Von Neumann and Oskar Morgenstern. Theory of games and economic behavior. 1944.
Wang et al. [2022] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Wong et al. [2023] Lionel Wong, Gabriel Grand, Alexander K Lew, Noah D Goodman, Vikash K Mansinghka, Jacob Andreas, and Joshua B Tenenbaum. From word models to world models: Translating from natural language to the probabilistic language of thought. arXiv preprint arXiv:2306.12672, 2023.
Xiong et al. [2023] Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023.
Yang et al. [2023] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
Yao et al. [2023] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
Ye et al. [2023] Yining Ye, Xin Cong, Yujia Qin, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Large language model as autonomous decision maker. arXiv preprint arXiv:2308.12519, 2023.
Zhuang et al. [2023] Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with a* search. arXiv preprint arXiv:2310.13227, 2023.

Appendix

Appendix A Utility Elicitation Details

In Figure 6, we show an illustration of the overlapped batching and variance reduction strategies that DeLLMa uses in its utility elicitation procedure (described in detail in Section 3.3).

Appendix B Additional Results

In Figures 7 and 8 we show the normalized utility (i.e., ground-truth utility of the chosen action, normalized by the maximum ground-truth utility) of each method. These results can be contrasted against the accuracy of each method’s prediction of the optimal action in Figures 2 and 5.

Aside from performance improvements, we also compare the cost measured in terms of prompt length and API calls in Table 1. For the Agriculture dataset, DeLLMa-Naive is comparable to SC in terms of prompt length and significantly outperform the latter, while only requiring 20% of API calls. DeLLMa-Pairs/Top1 can push the performance further, but incurs a higher cost, especially for large sample sizes.

	Zero-Shot	CoT	SC	DeLLMa-Naive	DeLLMa-Pairs (16)	DeLLMa-Pairs (64)
API Calls	1	3	5	1	3	10
Word Counts	693.71	2724.2	3468.55	3681.94	7254.46	28895.46

Table 1: Number of GPT-4 API calls and word counts per decision-making instance (with action set size 4 for the Agriculture dataset) across all methods discussed in Section 4.1. For DeLLMa-Naive, we set the total sample size to 50. For DeLLMa-Pairs, we fix the overlap percentage to 25% and vary the per action sample size from 16 to 64. DeLLMa-Top1 has the same statistics as DeLLMa-Pairs since they only differ in post processing.

Appendix C Additional Details for Agriculture

C.1 Dataset Curation

To reduce context length, we first extract executive and per-fruit summaries of the report, reducing the context length from 8,721 words to around $<700$ words. These summaries are constructed on a per-model basis: each LLM (GPT-4, Claude-3, and Gemini 1.5) is prompted to generate its own summaries. We proof-read these summaries to ensure that they do not contain any factual errors. For each decision making instance, we use the executive summary, the summaries relevant to the fruits in consideration, and their current-year price and yield statistics as context. We provide these summaries in our supplementary material, and refer readers to the prompt for summarizing the report in Section C.2, and report concrete summaries and statistics in supplementary materials.

C.2 Prompt Used by Agriculture

Summary Prompt

In Section C.2 we present the prompt used for summarizing the context in our Agriculture experiments. This prompt is shared across all models tested.

[!htbp]

Zero-Shot Prompt

In Section C.2 we present an example prompt for zero-shot experiments consisting of $\mathcal{P}=(\mathcal{G},\mathcal{C},\mathcal{A})$ , with GPT-4 summaries generated from Section C.2 as context. We fix the prompt format for all LLMs and select the corresponding summaries for each LLM.

Self-Consistency Prompt

We adopt the same zero-shot prompt as the one shown in Section C.2, but take a majority vote from 5 reasoning paths as the final prediction.

[!htbp]

Chain-of-Thought Prompt

We design a multi-prompt CoT procedure that closely emulates our DeLLMa agent, consisting of three steps: (1) ask for unknown factors that impact the decision; (2) given these, ask for their possibility of occurence; (3) then ask for a final decision. This procedure differs from DeLLMa agents as it does not use an external program for utility elicitation and expected utility maximization, but delegates each step of the process to an LLM. Example prompt can be found in Section C.2.

[!htbp]

DeLLMa Prompt

Similar to the CoT Prompt, DeLLMa is a multi-prompt procedure (1) ask for unknown factors that impact the decision; (2) given these, ask for their possibility of occurence (i.e. a belief distribution). Instead of directly deciding on an action, DeLLMa samples state-action pairs (via an external program) from this belief distribution, and leverage an LLM to elicit a utility function $U:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ from pairwise comparisons. From here, we use another external program to search for an action that maximizes the expected utility.

In Sections C.2 and C.3 we present the prompt for state enumeration and state forecasting (i.e. step (1) and (2)) for the Agriculture dataset. Our implementation combines these two modules, but in the future they should be separate as our framework continues to progress. For example, state enumeration may leverage retrieval-augmented generation [22] for generating high-quality unknown factors, and state forecasting may resort to tool usage to delibrate calibrated belief distributions from quanitative analysis.

In Section C.4, we provide a DeLLMa prompt to generate ranked comparisons from sampled state-action pairs. These state action pairs are sampled from the belief distribution via an external program, written in Python. We provide all our implementations in the supplementary material.

C.3 Agriculture Dataset Responses

Summaries

See Section C.4 for formatted GPT-4 summaries from the USDA agricultural report and statistics. Claude-3 and Gemini summaries follow the exact same format, but differ in LLM generated contents.

Zero-Shot Responses

See Sections C.4, C.4 and C.4 for sample GPT-4 responses to zero-shot prompts.

Self-Consistency Responses

SC responses follow the exact same format as those in Sections C.4, C.4 and C.4.

Chain-of-Thought Responses

See Section C.4 for a curtailed version of GPT-4 CoT response.

DeLLMa Responses

[!htbp]

See Section C.4 for a curtailed version of GPT-4 DeLLMa response.

C.4 Failure Cases on the Agriculture Dataset

In Section 4.1, we postulate that a failure mode of our baseline approaches is that they lack the ability to perform probablistic reasoning, but are rather following the sentiment presented in context. Here, we qualitatively test this hypothesis, by showcasing prompts and responses from our Zero-Shot experiments. We present instances in Sections C.4, C.4 and C.4 that are not able to make optimal prediction with baseline methods, but are solved by DeLLMa agents.

Within each figure, we first present $\mathcal{P}$ , followed by the (incorrect) decision and explanation generated from the Zero-Shot baseline, and then present the top-ranked state action pair generated by DeLLMa along with its explanation.

[!htbp]

In the case of deciding between apple and grapefruit in Section C.4, GPT-4 presumes that a high price for grapefruit is sustainable due to the presence of extreme weather conditions in the previous year. With CoT, the model reasons with some counterfactual scenarios, such as more suitable weather for the forthcoming year, but not in a systematic way. With DeLLMa, we provide these a list of counterfactual scenarios as context, and leverage the model’s ability for situational thinking and ranking to elicit an informative preference. We make similar observations for Sections C.4 and C.4.

[!htbp]

Appendix D Additional Details for Stocks

D.1 Prompt Used by Stocks

Zero-Shot Prompt

Similar to the agriculture setup, we first present a sample zero-shot prompt in Section D.1. The zero-shot prompt consists of three main parts (see Figure 1 for visual illustrations): (1) an enumeration of action spaces (here AMD or GME), (2) a provided context (here, historical stock prices between December 2021 to November 2023), and (3) the goal of the user (choose only one stock to maximize their profit via investing in stocks).

Self-Consistency Prompt

SC prompts are identical to zero-shot prompts.

[!htbp]

Chain-of-Thought Prompt

Similarly, we examplify a prompt for CoT. Notably, we break the chain into three parts in Section D.1: (1) Enumerating unknown factors, (2) Enumerating $\pi^{\text{LLM}}(\cdot\mid\mathcal{C})$ , and (3) making the final decision given the previous responses and context.

[!htbp]

DeLLMa Prompt

Finally, we showcase the DeLLMa ranking prompts where the LLM is asked to provide comprehensive rankings of given state-action pairs.

[!htbp]

D.2 Stock Dataset Responses

Zero-Shot Responses

See Section D.2 for sample GPT-4 responses to zero-shot prompts.

Self-Consistency Responses

SC responses follow the exact same format as those in Section D.2.

Chain-of-Thought Responses

See Section D.2 for a curtailed version of GPT-4 CoT response.

DeLLMa Responses

See Section D.2 for a curtailed version of GPT-4 DeLLMa response.

[!htbp]

Appendix E Details on Annotation Pipeline

For each pair of state-action values, we ask human evaluators (paper authors) to annotate which state-action tuple is more preferred to the other, and compare our annotations against preferences elicited by LLMs. To reduce bias in this procedure, we present these pairs in shuffled orders, i.e. it is not always true that $(s,a)_{1}\succ(s,a)_{2}$ if $\{(s,a)_{1},(s,a)_{2}\}$ is presented to the annotators. For each pair, we ask the annotators to either label them as 1 (state-action tuple 1 is preferred), 2 (state-action tuple 2 is preferred), or 0 (uncertain). We then evaluate annotator-LLM agreement after evaluation is concluded.

For intra-annotator agreement, we present to two annotators a fixed list of state-action pairs (also in shuffled order), while holding out LLM preferences. We then evaluate annotator agreement with exact match.

Appendix F License for Existing Assets

Large Language Models

DeLLMa is built on top of frontier class LLMs such as GPT-4 [1], Claude-3 [2], and Gemini 1.5 [34]. We are not aware of their licenses, but ascertain that we abide to their respective terms of service (ToS)⁶⁶6GPT-4 ToS: https://chatgpt.com/c/fe00ee45-e34c-4b64-812d-295abaedd20f, Claude-3 ToS: https://www-cdn.anthropic.com/files/4zrzovbb/website/e2d538c84610b7cc8cb1c640767fa4ba73f30190.pdf, and Gemini 1.5 ToS: https://ai.google.dev/gemini-api/terms. for the usage of these models.

Agriculture Dataset

The Agriculture dataset is curated from statistics and reports published by the U.S. Department of Agriculture. These assets are freely accessible under the Freedom of Information Act. We ascertain that we abide to the USDA ToS⁷⁷7https://www.usda.gov/policies-and-links.

Stock Dataset

The Stock dataset is curated with the Yahoo Finance API⁸⁸8finance.yahoo.com. We ascertain that we abide to the Yahoo Developer API ToS⁹⁹9https://legal.yahoo.com/us/en/yahoo/terms/product-atos/apiforydn/index.html.

Appendix G Societal Impact and Limitations

Rational decision making under uncertainty has been extensively studied across many disciplines, but remains elusive even for humans. Our work bears significant societal consequences as we live in a world where humans increasingly rely on intelligent assistants for diverse problems. Unfortunately, existing foundation models and LLMs are not capable of acting optimally under uncertainties yet, which may induce harm if the general public delegate these models for decision making. These risks are exacerbated when intelligent assistants operate as a black box without adequate transparency. DeLLMa serves as a first step towards an approach that builds trust for humans to rely on these systems to make important decisions under uncertainty.

While our approach sensibly improves the performance of LLMs for decision making under uncertainty, deploying a prototype system in the wild without additional guardrail could incur significant financial losses, in case of a suboptimal decision. We expect users to conduct extensive field tests for the feasibility of such systems, or leverage a hybrid approach that integrates analyst judgements for optimal decision making.

Finally, a limitation of our approach stems from constrained state and action spaces. Our framework currently operates on a small number of discrete state and action spaces, due to the context window size of state-of-the-art LLMs. This drawback may potentially limit the applicability of our system in more complex use cases, or scenarios that require sequential decision making.