New Desiderata for Direct Preference Optimization

Xiangkun Hu      Tong He      David Wipf
Amazon Web Services
{xiangkhu,htong,daviwipf}@amazon.com
Abstract

Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) and followed by several notable descendants. Although effective in certain real-world settings, we introduce new evaluation criteria that serve to highlight unresolved shortcomings in the ability of existing DPO methods to interpolate between a pre-trained reference model and empirical measures of human preferences, as well as unavoidable trade-offs in how low- and high-quality responses are regularized and constraints are handled. Our insights then motivate an alternative DPO-like loss that provably mitigates these limitations. Empirical results serve to corroborate notable aspects of our analyses111A preliminary version of this work appears in the ICML 2024 Workshop on Models of Human Feedback for AI Alignment: https://openreview.net/pdf?id=Fgf0iAOb22.

1 Introduction

Although pre-trained large language models (LLMs) often display remarkable capabilities [9, 11, 26, 44], it is well-established that they are prone to responding in ways that may be at odds with human preferences for rationale discourse [5, 17]. To this end, after an initial supervised fine-tuning phase that produces a reference model or policy πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ), it is now commonplace to apply reinforcement learning with human feedback (RLHF) to further refine the LLM responses y𝑦yitalic_y to input prompts x𝑥xitalic_x [46, 38, 4, 27]. This multi-step process involves first learning a reward model that reflects human inclinations culled from labeled preference data, and then subsequently training a new policy that balances reward maximization with proximity to πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ).

Because RLHF introduces additional complexity, computational overhead, and entry points for instability, clever reparameterization techniques have recently been proposed that sidestep the need for separately learning a reward model altogether. Instead, increased alignment with human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO) [33] followed by several notable descendants and generalizations [3, 40, 41, 45]. These alternatives dramatically economize model development; however, with recency comes the potential that the consequences of less obvious properties of DPO-based objectives may still be under-explored. It is along these lines that our attention herein lies, with the end goal of quantifying and steering model behavior in transparently beneficial directions.

After introducing basic concepts and the details of existing preference optimization models in Section 2, the remainder of the paper devoted to our technical contributions can be distilled as follows:

  • We introduce new evaluation desiderata that comport with intuition regarding how a preference model ideally should behave, and yet (somewhat surprisingly) are provably not satisfied by a broad class of existing DPO-based approaches. In particular, we show that because of uniform regularization effects, the minimizers of commonly-used preference optimization objectives like DPO are at times unable to preserve performance in regions where the reference model is strong while simultaneously improving upon the reference model elsewhere (Section 3.1). Moreover, we also elucidate limitations in the ability to interpolate between ideal endpoints as model trade-off parameters are varied (Section 3.2).

  • We prove that once inevitable learning constraints are introduced (explicitly or implicitly, e.g., early-stopping, weight decay, etc.), the core reparameterizations that underpin certain DPO models no longer strictly hold (Section 3.3). This motivates alternative justifications based solely on properties of the final loss functions involved (Appendices C and D).

  • Based on the above, we introduce a new preference optimization loss called TYPOsubscriptTYPO\ell_{\tiny\mbox{TYPO}}roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT that, by design, satisfies our evaluation desiderata while avoiding any dependency on constraint-dependent reparameterizations (Section 4). Properties of this loss relative to its precursors are also corroborated using Monte-Carlo simulations (Section 5 and Appendix A.2).

2 Background

We adopt x𝒟xsimilar-to𝑥subscript𝒟𝑥x\sim{\mathcal{D}}_{x}italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT to denote an input prompt x𝑥xitalic_x drawn from some distribution 𝒟xsubscript𝒟𝑥{\mathcal{D}}_{x}caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. From here, conditioned on such prompts we may then generate responses y𝑦yitalic_y using a pre-trained reference language model/policy πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ). Moreover, given a pair of such responses y1y2subscript𝑦1subscript𝑦2y_{1}\neq y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we adopt the binary indicator variable z=𝕀[y1y2|y1,y2,x]𝑧𝕀delimited-[]succeedssubscript𝑦1conditionalsubscript𝑦2subscript𝑦1subscript𝑦2𝑥z=\mathbb{I}[y_{1}\succ y_{2}|y_{1},y_{2},x]italic_z = blackboard_I [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ] to convey that y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is preferred over y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by a human evaluator when z=1𝑧1z=1italic_z = 1, or else z=0𝑧0z=0italic_z = 0 if instead y2y1succeedssubscript𝑦2subscript𝑦1y_{2}\succ y_{1}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Given a population of such evaluators, we express the ground-truth human preference distribution as p(z|y1,y2,x)=p(y1y2|y1,y2,x)superscript𝑝conditional𝑧subscript𝑦1subscript𝑦2𝑥superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2subscript𝑦1subscript𝑦2𝑥p^{*}(z|y_{1},y_{2},x)=p^{*}(y_{1}\succ y_{2}|y_{1},y_{2},x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ). And finally, we define a set of human labeled tuples drawn from a training distribution 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT as

{yw,yl,x}𝒟tr{z,y1,y2,x}𝒟trzp(z|y1,y2,x),{y1,y2}πref(y|x),x𝒟x,formulae-sequencesimilar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟𝑧subscript𝑦1subscript𝑦2𝑥similar-tosubscript𝒟𝑡𝑟𝑧similar-tosuperscript𝑝conditional𝑧subscript𝑦1subscript𝑦2𝑥formulae-sequencesimilar-tosubscript𝑦1subscript𝑦2subscript𝜋refconditional𝑦𝑥similar-to𝑥subscript𝒟𝑥\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}~{}~{}\equiv~{}~{}\{z,y_{1},y_{2},x\}% \sim{\mathcal{D}}_{tr}~{}~{}\equiv~{}~{}z\sim p^{*}(z|y_{1},y_{2},x),~{}\{y_{1% },y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{\mathcal{D}}_{x},{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ≡ { italic_z , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ≡ italic_z ∼ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) , { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , (1)

where ywylsucceedssubscript𝑦𝑤subscript𝑦𝑙y_{w}\succ y_{l}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (subscripts here stand for ‘win’ and ‘lose’).222We generally assume that y1y2subscript𝑦1subscript𝑦2y_{1}\neq y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; however, the y1=y2subscript𝑦1subscript𝑦2y_{1}=y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT case can nonetheless be handled by simply assigning p(z|y,y,x)=1/2superscript𝑝conditional𝑧𝑦𝑦𝑥12p^{*}(z|y,y,x)=1/2italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y , italic_y , italic_x ) = 1 / 2, inclusion of which does not effect the analysis that follows. In particular, such cases merely introduce an irrelevant constant into the human preference loss functions under consideration. In other words, each training tuple is generated by drawing x𝑥xitalic_x from 𝒟xsubscript𝒟𝑥{\mathcal{D}}_{x}caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, y1y2subscript𝑦1subscript𝑦2y_{1}\neq y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the reference policy πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, and finally z𝑧zitalic_z is produced by human labelers that operate according to psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Note that per convention in prior work and ease of presentation, we will often abbreviate the preference distribution notation as p(y1y2|y1,y2,x)p(y1y2|x)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2subscript𝑦1subscript𝑦2𝑥superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥p^{*}(y_{1}\succ y_{2}|y_{1},y_{2},x)\equiv p^{*}(y_{1}\succ y_{2}|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ≡ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) when the context is sufficiently clear.

2.1 Reinforcement Learning with Human Feedback (RLHF)

Reward Function Estimation:

Given two candidate responses y1y2subscript𝑦1subscript𝑦2y_{1}\neq y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT sampled using prompt x𝑥xitalic_x, the Bradley-Terry (BT) model [8] for human preferences stipulates that

p(y1y2|x)=exp[r(y1,x)]exp[r(y1,x)]+exp[r(y2,x)]=σ[r(y1,x)r(y2,x)],superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥superscript𝑟subscript𝑦1𝑥superscript𝑟subscript𝑦1𝑥superscript𝑟subscript𝑦2𝑥𝜎delimited-[]superscript𝑟subscript𝑦1𝑥superscript𝑟subscript𝑦2𝑥p^{*}(y_{1}\succ y_{2}|x)=\frac{\exp[r^{*}(y_{1},x)]}{\exp[r^{*}(y_{1},x)]+% \exp[r^{*}(y_{2},x)]}=\sigma\big{[}r^{*}(y_{1},x)-r^{*}(y_{2},x)\big{]},italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] + roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] end_ARG = italic_σ [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) - italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] , (2)

where r(y,x)superscript𝑟𝑦𝑥r^{*}(y,x)italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) is a so-called latent reward model and σ𝜎\sigmaitalic_σ is the logistic function. Because r(y,x)superscript𝑟𝑦𝑥r^{*}(y,x)italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) is unobservable, it is not possible to directly compute p(y1y2|x)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥p^{*}(y_{1}\succ y_{2}|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ); however, we can train an approximation pϕ(y1y2|x)subscript𝑝italic-ϕsucceedssubscript𝑦1conditionalsubscript𝑦2𝑥p_{\phi}(y_{1}\succ y_{2}|x)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) (equivalent to pϕ(y1y2|y1,y2,x)subscript𝑝italic-ϕsucceedssubscript𝑦1conditionalsubscript𝑦2subscript𝑦1subscript𝑦2𝑥p_{\phi}(y_{1}\succ y_{2}|y_{1},y_{2},x)italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) as before) defined by a parameterized proxy reward rϕ(y,x)subscript𝑟italic-ϕ𝑦𝑥r_{\phi}(y,x)italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y , italic_x ). Specifically, we can minimize the loss

BT(rϕ):=𝔼{yw,yl,x}𝒟tr[logpϕ(ywyl|x)]=𝔼{yw,yl,x}𝒟tr[logσ[rϕ(yw,x)rϕ(yl,x)]].assignsubscriptBTsubscript𝑟italic-ϕsubscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟trdelimited-[]subscript𝑝italic-ϕsucceedssubscript𝑦𝑤conditionalsubscript𝑦𝑙𝑥subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟trdelimited-[]𝜎delimited-[]subscript𝑟italic-ϕsubscript𝑦𝑤𝑥subscript𝑟italic-ϕsubscript𝑦𝑙𝑥\ell_{\tiny\mbox{BT}}(r_{\phi}):=\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}% }_{\tiny\mbox{tr}}}\Big{[}-\log p_{\phi}(y_{w}\succ y_{l}|x)\Big{]}=\mathbb{E}% _{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}}\Big{[}-\log\sigma\big{[% }r_{\phi}(y_{w},x)-r_{\phi}(y_{l},x)\big{]}\Big{]}.roman_ℓ start_POSTSUBSCRIPT BT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) := blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ] = blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_σ [ italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_x ) - italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) ] ] . (3)

The optimized reward r^ϕ(y,x):=argminrϕBT(rϕ)r(y,x)assignsubscript^𝑟italic-ϕ𝑦𝑥subscriptsubscript𝑟italic-ϕsubscriptBTsubscript𝑟italic-ϕsuperscript𝑟𝑦𝑥\hat{r}_{\phi}(y,x):=\arg\min_{r_{\phi}}\ell_{\tiny\mbox{BT}}(r_{\phi})\approx r% ^{*}(y,x)over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y , italic_x ) := roman_arg roman_min start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT BT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ≈ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) can then be applied to fine-tuning the pre-trained reference model πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) as described next.

RL Fine-Tuning with Estimated Reward Function:

The goal here is to improve upon a given πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) using a separate trainable model πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ), the high-level desiderata being: (i) Maximize the previously-estimated reward function r^ϕ(y,x)subscript^𝑟italic-ϕ𝑦𝑥\hat{r}_{\phi}(y,x)over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y , italic_x ) when following πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ), while (ii) Minimizing some measure of distance between πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) and πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) to avoid overfitting merely to preference rewards. These objectives typically materialize through the minimization of

RLHF(πθ,πref,r^ϕ,λ):=𝔼yπθ(y|x),x𝒟x[r^ϕ(y,x)]+λ𝔼x𝒟x[𝕂𝕃[πθ(y|x)||πref(y|x)]],\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},\hat{r}_{\phi}% ,\lambda\right)~{}~{}:=~{}~{}\mathbb{E}_{y\sim\pi_{\theta}(y|x),x\sim{\mathcal% {D}}_{x}}\Big{[}-\hat{r}_{\phi}(y,x)\Big{]}+\lambda~{}\mathbb{E}_{x\sim{% \mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}\pi_{\theta}(y|x)||\pi_{\tiny\mbox{% ref}}(y|x)\big{]}\Big{]},roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_λ ) := blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y , italic_x ) ] + italic_λ blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) ] ] , (4)

where λ>0𝜆0\lambda>0italic_λ > 0 is a trade-off parameter. Although not differentiable, starting from an initialization such as πθ=πrefsubscript𝜋𝜃subscript𝜋ref\pi_{\theta}=\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT, the loss RLHF(πθ,πref,r^ϕ,λ)subscriptRLHFsubscript𝜋𝜃subscript𝜋refsubscript^𝑟italic-ϕ𝜆\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},\hat{r}_{\phi}% ,\lambda\right)roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , over^ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , italic_λ ) can be optimized over πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using various forms of RL [37, 34]

2.2 Direct Preference Optimization (DPO)

Consider now the reward-dependent RLHF loss RLHFsubscriptRLHF\ell_{\tiny\mbox{RLHF}}roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT from (4) defined w.r.t. and arbitrary reward function r(y,x)𝑟𝑦𝑥r(y,x)italic_r ( italic_y , italic_x ). DPO [33] is based on the observation that, provided πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is sufficiently flexible such that we may treat it as an arbitrary function for optimization purposes,333This is a key assumption with non-trivial consequences; Section 3.3 will explore this issue in further detail. the minimum of RLHF(πθ,πref,r,λ)subscriptRLHFsubscript𝜋𝜃subscript𝜋ref𝑟𝜆\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},r,\lambda\right)roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r , italic_λ ) w.r.t. πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can be directly computed as

πr(y|x):=argminπθRLHF(πθ,πref,r,λ)=1Z(x)πref(y|x)exp[1λr(y,x)],assignsubscript𝜋𝑟conditional𝑦𝑥subscriptsubscript𝜋𝜃subscriptRLHFsubscript𝜋𝜃subscript𝜋ref𝑟𝜆1𝑍𝑥subscript𝜋refconditional𝑦𝑥1𝜆𝑟𝑦𝑥\pi_{r}(y|x)~{}:=~{}\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox{RLHF}}\left(\pi_{% \theta},\pi_{\tiny\mbox{ref}},r,\lambda\right)=\frac{1}{Z(x)}\pi_{\tiny\mbox{% ref}}(y|x)\exp\left[\frac{1}{\lambda}r(y,x)\right],italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) := roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r , italic_λ ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp [ divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG italic_r ( italic_y , italic_x ) ] , (5)

where Z(x):=yπref(y|x)exp[1λr(y,x)]assign𝑍𝑥subscript𝑦subscript𝜋refconditional𝑦𝑥1𝜆𝑟𝑦𝑥Z(x):=\sum_{y}\pi_{\tiny\mbox{ref}}(y|x)\exp\left[\frac{1}{\lambda}r(y,x)\right]italic_Z ( italic_x ) := ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp [ divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG italic_r ( italic_y , italic_x ) ] is the partition function ensuring that πr(y|x)subscript𝜋𝑟conditional𝑦𝑥\pi_{r}(y|x)italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) forms a proper distribution [31, 32]. From here, assuming πref(y|x)>0subscript𝜋refconditional𝑦𝑥0\pi_{\tiny\mbox{ref}}(y|x)>0italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) > 0, we can rearrange (5) to equivalently establish that

r(y,x)=λlogπr(y|x)πref(y|x)+λlogZ(x).𝑟𝑦𝑥𝜆subscript𝜋𝑟conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥𝜆𝑍𝑥r(y,x)=\lambda\log\frac{\pi_{r}(y|x)}{\pi_{\tiny\mbox{ref}}(y|x)}+\lambda\log Z% (x).italic_r ( italic_y , italic_x ) = italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG + italic_λ roman_log italic_Z ( italic_x ) . (6)

Because thus far r𝑟ritalic_r has remained unspecified, it naturally follows that these policy/reward relationships hold even for the ground-truth reward rsuperscript𝑟r^{*}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the associated optimal policy π(y|x):=argminπθRLHF(πθ,πref,r,λ)assignsuperscript𝜋absentconditional𝑦𝑥subscriptsubscript𝜋𝜃subscriptRLHFsubscript𝜋𝜃subscript𝜋refsuperscript𝑟𝜆\pi^{**}(y|x):=\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta% },\pi_{\tiny\mbox{ref}},r^{*},\lambda\right)italic_π start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) := roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_λ ). Hence instead of approximating r(y,x)superscript𝑟𝑦𝑥r^{*}(y,x)italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) with rϕ(y,x)subscript𝑟italic-ϕ𝑦𝑥r_{\phi}(y,x)italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_y , italic_x ) as in (2), we may equivalently approximate π(y|x)superscript𝜋absentconditional𝑦𝑥\pi^{**}(y|x)italic_π start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) with some πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) leading to the DPO loss    DPO(πθ,πref,λ):=assignsubscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆absent\ell_{\tiny\mbox{DPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)~{}:=roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) :=

BT(λlogπθ(y|x)πref(y|x))=𝔼{yw,yl,x}𝒟tr[logσ(λlogπθ(yw|x)πref(yw|x)λlogπθ(yl|x)πref(yl|x))],subscriptBT𝜆subscript𝜋𝜃conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟trdelimited-[]𝜎𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\ell_{\tiny\mbox{BT}}\left(\lambda\log\frac{\pi_{\theta}(y|x)}{\pi_{\tiny\mbox% {ref}}(y|x)}\right)=\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox% {tr}}}\left[-\log\sigma\left(\lambda\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{% \tiny\mbox{ref}}(y_{w}|x)}-\lambda\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\tiny% \mbox{ref}}(y_{l}|x)}\right)\right],roman_ℓ start_POSTSUBSCRIPT BT end_POSTSUBSCRIPT ( italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ) = blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_σ ( italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] , (7)

noting that the partition function Z(x)𝑍𝑥Z(x)italic_Z ( italic_x ) conveniently cancels out and can be excluded from further consideration. It is now possible to directly optimize (7) over πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using SGD without the need for any challenging RLHF procedure. The basic intuition here is that the parameterized policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT induces an implicit reward λlog[πθ(y|x)πref1(y|x)]𝜆subscript𝜋𝜃conditional𝑦𝑥superscriptsubscript𝜋ref1conditional𝑦𝑥\lambda\log\left[\pi_{\theta}(y|x)\pi_{\tiny\mbox{ref}}^{-1}(y|x)\right]italic_λ roman_log [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_y | italic_x ) ] that is being optimized via the original BT preference model. Moreover this equivalence is exact assuming data distributed as in (1).

2.3 Identity Preference Optimization (IPO)

Similar to DPO, the identity preference optimization (IPO) formulation [3] avoids both a 2-step learning process and cumbersome, potentially unstable RL training. To accomplish this, IPO is predicated on minimizing the original RLHF loss from (4) but with an alternative reward function. Specifically, the motivating IPO objective is to minimize RLHF(πθ,πref,rIPO,λ)subscriptRLHFsubscript𝜋𝜃subscript𝜋refsubscript𝑟IPO𝜆\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},r_{\tiny\mbox{% IPO}},\lambda\right)roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT , italic_λ ), where

rIPO(y,x):=𝔼yπref(y|x)[p(yy|x,y,y)],assignsubscript𝑟IPO𝑦𝑥subscript𝔼similar-tosuperscript𝑦subscript𝜋refconditional𝑦𝑥delimited-[]superscript𝑝succeeds𝑦conditionalsuperscript𝑦𝑥𝑦superscript𝑦r_{\tiny\mbox{IPO}}(y,x):=\mathbb{E}_{y^{\prime}\sim\pi_{\tiny\mbox{ref}}(y|x)% }\big{[}p^{*}(y\succ y^{\prime}|x,y,y^{\prime})\big{]},italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y , italic_x ) := blackboard_E start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ≻ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x , italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] , (8)

over πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.444Note that in principle the distribution used to draw samples ysuperscript𝑦y^{\prime}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in defining rIPOsubscript𝑟IPOr_{\tiny\mbox{IPO}}italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT need not be set to πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT; however, in practice πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT is a typical choice, which we adopt throughout for simplicity. Because of the special structure of this particular reward function, it turns out that it is possible to minimize RLHF(πθ,πref,rIPO,λ)subscriptRLHFsubscript𝜋𝜃subscript𝜋refsubscript𝑟IPO𝜆\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},r_{\tiny\mbox{% IPO}},\lambda\right)roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT , italic_λ ) over πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT without RL. In brief, this is accomplished by first noting that for any pair of responses y1y2subscript𝑦1subscript𝑦2y_{1}\neq y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the existence of an optimal IPO policy, denoted πIPOsubscript𝜋IPO\pi_{\tiny\mbox{IPO}}italic_π start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT, evaluated at these responses can be computed as a function of the reward rIPOsubscript𝑟IPOr_{\tiny\mbox{IPO}}italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT using (5). Combining y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT dependent terms, after a few algebraic manipulations this then leads to the equivalence relation

log[πIPO(y1|x)πref(y2|x)πIPO(y2|x)πref(y1|x)]=1λ[rIPO(y1,x)rIPO(y2,x)].subscript𝜋IPOconditionalsubscript𝑦1𝑥subscript𝜋refconditionalsubscript𝑦2𝑥subscript𝜋IPOconditionalsubscript𝑦2𝑥subscript𝜋refconditionalsubscript𝑦1𝑥1𝜆delimited-[]subscript𝑟IPOsubscript𝑦1𝑥subscript𝑟IPOsubscript𝑦2𝑥\log\left[\frac{\pi_{\tiny\mbox{IPO}}(y_{1}|x)\pi_{\tiny\mbox{ref}}(y_{2}|x)}{% \pi_{\tiny\mbox{IPO}}(y_{2}|x)\pi_{\tiny\mbox{ref}}(y_{1}|x)}\right]=\frac{1}{% \lambda}\big{[}r_{\tiny\mbox{IPO}}(y_{1},x)-r_{\tiny\mbox{IPO}}(y_{2},x)\big{]}.roman_log [ divide start_ARG italic_π start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG ] = divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG [ italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) - italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] . (9)

However, unlike DPO where an analogous expression is inverted to create an implicit reward for integration within the BT model, IPO instead attempts to approximate this equivalence relation by replacing the unknown πIPO(y|x)subscript𝜋IPOconditional𝑦𝑥\pi_{\tiny\mbox{IPO}}(y|x)italic_π start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y | italic_x ) with some πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ). Although technically rIPOsubscript𝑟IPOr_{\tiny\mbox{IPO}}italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT is also unknown, given samples {yw,yl,x}𝒟trsimilar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟tr\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT, it is nicely shown in [3] that    IPO(πθ,πref,λ):=assignsubscriptIPOsubscript𝜋𝜃subscript𝜋ref𝜆absent\ell_{\tiny\mbox{IPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)~{}~{}:=roman_ℓ start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) :=

𝔼{y1,y2}πref(y|x),x𝒟[(log[πθ(y1|x)πref(y2|x)πθ(y2|x)πref(y1|x)]1λ[rIPO(y1,x)rIPO(y2,x)])2]subscript𝔼formulae-sequencesimilar-tosubscript𝑦1subscript𝑦2subscript𝜋refconditional𝑦𝑥similar-to𝑥𝒟delimited-[]superscriptsubscript𝜋𝜃conditionalsubscript𝑦1𝑥subscript𝜋refconditionalsubscript𝑦2𝑥subscript𝜋𝜃conditionalsubscript𝑦2𝑥subscript𝜋refconditionalsubscript𝑦1𝑥1𝜆delimited-[]subscript𝑟IPOsubscript𝑦1𝑥subscript𝑟IPOsubscript𝑦2𝑥2\displaystyle\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{% \mathcal{D}}}\left[\left(\log\left[\frac{\pi_{\theta}(y_{1}|x)\pi_{\tiny\mbox{% ref}}(y_{2}|x)}{\pi_{\theta}(y_{2}|x)\pi_{\tiny\mbox{ref}}(y_{1}|x)}\right]-% \frac{1}{\lambda}\big{[}r_{\tiny\mbox{IPO}}(y_{1},x)-r_{\tiny\mbox{IPO}}(y_{2}% ,x)\big{]}\right)^{2}\right]blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D end_POSTSUBSCRIPT [ ( roman_log [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG ] - divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG [ italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) - italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼{yw,yl,x}𝒟tr[(log[πθ(yw|x)πref(yl|x)πθ(yl|x)πref(yw|x)]12λ)2]absentsubscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟trdelimited-[]superscriptsubscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥12𝜆2\displaystyle=~{}~{}~{}\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny% \mbox{tr}}}\left[\left(\log\left[\frac{\pi_{\theta}(y_{w}|x)\pi_{\tiny\mbox{% ref}}(y_{l}|x)}{\pi_{\theta}(y_{l}|x)\pi_{\tiny\mbox{ref}}(y_{w}|x)}\right]-% \frac{1}{2\lambda}\right)^{2}\right]= blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( roman_log [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] - divide start_ARG 1 end_ARG start_ARG 2 italic_λ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (10)

provided 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT follows from (1). Note that this closed-form consistency is a direct consequence of how rIPOsubscript𝑟IPOr_{\tiny\mbox{IPO}}italic_r start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT is defined in (8) and will not generally hold for other choices of the reward function. Regardless, it is straightforward to minimize IPO(πθ,πref,λ)subscriptIPOsubscript𝜋𝜃subscript𝜋ref𝜆\ell_{\tiny\mbox{IPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ start_POSTSUBSCRIPT IPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) in its present form via SGD as with DPO.

2.4 Flexible Quasi-Convex Generalizations

From the expressions above, it is clear that both DPO and IPO reduce to functions of log[πθ(yw|x)πref(yl|x)πθ(yl|x)πref(yw|x)]subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥\log\left[\frac{\pi_{\theta}(y_{w}|x)\pi_{\tiny\mbox{ref}}(y_{l}|x)}{\pi_{% \theta}(y_{l}|x)\pi_{\tiny\mbox{ref}}(y_{w}|x)}\right]roman_log [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] and a tunable hyperparameter λ𝜆\lambdaitalic_λ. As such, it is natural to consider extensions to broader choices in the form

QPO(πθ,πref,ψ,μ,λ):=𝔼{yw,yl,x}𝒟trψ(μ[πθ(yw|x)πref(yw|x)]μ[πθ(yl|x)πref(yl|x)],λ),assignsubscriptQPOsubscript𝜋𝜃subscript𝜋ref𝜓𝜇𝜆subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟tr𝜓𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥𝜆\ell_{\tiny\mbox{QPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\psi,\mu,\lambda):=% \mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}}~{}\psi\left(% \mu\left[\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\tiny\mbox{ref}}(y_{w}|x)}\right]-% \mu\left[\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\tiny\mbox{ref}}(y_{l}|x)}\right],% \lambda\right),roman_ℓ start_POSTSUBSCRIPT QPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_ψ , italic_μ , italic_λ ) := blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ ( italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] - italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ] , italic_λ ) , (11)

where μ:+:𝜇superscript\mu:\mathbb{R}^{+}\rightarrow\mathbb{R}italic_μ : blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R is a monotonically increasing function (which generalizes the logarithm), and the function ψ:×+:𝜓superscript\psi:\mathbb{R}\times\mathbb{R}^{+}\rightarrow\mathbb{R}italic_ψ : blackboard_R × blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R influences the overall loss shape. We stipulate that ψ𝜓\psiitalic_ψ is a differentiable quasi-convex function [20]; hence the chosen loss notation QPOsubscriptQPO\ell_{\tiny\mbox{QPO}}roman_ℓ start_POSTSUBSCRIPT QPO end_POSTSUBSCRIPT for quasi-convex preference optimization. By definition of quasi-convexity, ψ𝜓\psiitalic_ψ monotonically increases to the right or left away from the minimum.

These specifications cover DPO and IPO as representative special cases, and include essentially all reasonable choices for a loss within this family, e.g., it is nonsensical to include multi-modal losses. The generalized preference optimization (GPO) [40] and f-DPO [41] frameworks are also special cases of QPO as defined herein. With GPO, μ𝜇\muitalic_μ is a logarithm and ψ𝜓\psiitalic_ψ is chosen as an arbitrary convex function (such as used by SLiC [45]). Meanwhile f-DPO involves ψ(,λ)=logσ[λ()]𝜓𝜆𝜎delimited-[]𝜆\psi(\cdot,\lambda)=-\log\sigma[\lambda(\cdot)]italic_ψ ( ⋅ , italic_λ ) = - roman_log italic_σ [ italic_λ ( ⋅ ) ] analogous to DPO but with μ=f𝜇superscript𝑓\mu=f^{\prime}italic_μ = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denotes the derivative of an f𝑓fitalic_f-divergence [36]; given that f𝑓fitalic_f must be convex, its derivative will necessarily be monotonically increasing. In this way, the RLHF objective from (4) is still optimized via f𝑓fitalic_f-DPO, but with an f𝑓fitalic_f-divergence replacing the KL term.

While overall quite general, we will nonetheless later demonstrate that any loss in the form of (11) will unavoidably be saddled with certain limitations. See also Appendix B for additional context w.r.t. very recent and/or concurrent DPO enhancements that lie outside the scope of our present work.

3 Comparative Analysis of Existing Approaches

We now turn to comparative analysis of existing approaches, which all have ties relating back to the BT preference model. Throughout this section we say that a policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is BT-optimal at prompt x𝑥xitalic_x if p(y1y2|x)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥p^{*}(y_{1}\succ y_{2}|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) implies that π(y1|x)>π(y2|x)superscript𝜋conditionalsubscript𝑦1𝑥superscript𝜋conditionalsubscript𝑦2𝑥\pi^{*}(y_{1}|x)>\pi^{*}(y_{2}|x)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) > italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) for all response pairs {y1,y2}subscript𝑦1subscript𝑦2\{y_{1},y_{2}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } with nonzero probability (as determined by the reference policy generating the preference data). Appendix F.1 introduces how πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be formed.

3.1 Selective Preservation of Optimal Policies

Consider the following plausible scenario, variations of which are likely to occur (at least in varying degrees) with real-world data. Suppose the support of prompts generated by 𝒟xsubscript𝒟𝑥{\mathcal{D}}_{x}caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT partitions as dxgooddxbadsuperscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑superscriptsubscript𝑑𝑥𝑏𝑎𝑑d_{x}^{good}\cup d_{x}^{bad}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ∪ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT, with dxgooddxbad=superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑superscriptsubscript𝑑𝑥𝑏𝑎𝑑d_{x}^{good}\cap d_{x}^{bad}=\emptysetitalic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ∩ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT = ∅. Furthermore, assume we have access to a reference policy πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT such that πref=πsubscript𝜋refsuperscript𝜋\pi_{\tiny\mbox{ref}}=\pi^{*}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT and dist[πref,π]0much-greater-thandistsubscript𝜋refsuperscript𝜋0\mbox{dist}[\pi_{\tiny\mbox{ref}},~{}\pi^{*}]\gg 0dist [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] ≫ 0 for xdxbad𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT, where dist[,]dist\mbox{dist}[\cdot,\cdot]dist [ ⋅ , ⋅ ] is an arbitrary distance measure. In other words, when evaluated w.r.t. a policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT that proportionally reflects human preferences, πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT performs ideally on a subset of prompts but not on others.

This dichotomy provides a useful lens for examining certain loss function properties. In particular, we would like any policy that minimizes a candidate loss to preserve πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT for prompts xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT, while pushing away from πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT towards πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for prompts xdxbad𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT. However, because of uniform regularization effects intrinsic to the QPO loss, it is not actually possible to achieve even this modest objective.

Theorem 1

(Informal version)  Given the prompt partitioning, reference policy, and optimal policy described above, define π^θQPO:=argminπθQPO(πθ,πref,ψ,λ)assignsuperscriptsubscript^𝜋𝜃QPOsubscriptsubscript𝜋𝜃subscriptQPOsubscript𝜋𝜃subscript𝜋ref𝜓𝜆\hat{\pi}_{\theta}^{\tiny\mbox{QPO}}:=\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox{% QPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\psi,\lambda)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT QPO end_POSTSUPERSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT QPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_ψ , italic_λ ) for any fixed selection of (ψ,λ)𝜓𝜆(\psi,\lambda)( italic_ψ , italic_λ ). Then under relatively mild assumptions on the labeled responses in 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, if dist[π^θQPO,π]<dist[πref,π]distsuperscriptsubscript^𝜋𝜃QPOsuperscript𝜋distsubscript𝜋refsuperscript𝜋\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{QPO}},~{}\pi^{*}]<\mbox{dist}[\pi_{% \tiny\mbox{ref}},~{}\pi^{*}]dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT QPO end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] < dist [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] for xdxbad𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT, then dist[π^θQPO,π]>0distsuperscriptsubscript^𝜋𝜃QPOsuperscript𝜋0\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{QPO}},~{}\pi^{*}]>0dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT QPO end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] > 0  for xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT.

The proof and formal version are provided in Appendix E.1, while Figure 1(left) below provides an illustration. The somewhat unexpected implication here is that if we minimize any possible QPO loss in the form of (11) and improve the policy quality in areas where πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT performs poorly w.r.t. πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then it must also be the case that performance becomes worse in areas where πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT was originally optimal. This phenomena represents an unavoidable trade-off when we restrict ourselves to using a QPO loss, of which DPO and IPO (as well as GPO and f𝑓fitalic_f-DPO) are special cases inheriting the same limitation. The core issue here is that QPO losses unselectively apply the same regularization, starting from the same initialization point, to both good and bad cases relative to πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

3.2 Interpolation Capabilities

As the underlying goal shared by all approaches is to balance proximity to a reference policy πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT with respect for the human preference model psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, a non-negative trade-off parameter λ[a,b]𝜆𝑎𝑏\lambda\in[a,b]italic_λ ∈ [ italic_a , italic_b ] that allows for interpolating between these competing objectives is inevitable, where a𝑎a\in\mathbb{R}italic_a ∈ blackboard_R and b𝑏b\in\mathbb{R}italic_b ∈ blackboard_R are lower and upper bounds respectively.555Depending on the method, if a=0𝑎0a=0italic_a = 0 or b=𝑏b=\inftyitalic_b = ∞ we may replace the λ𝜆\lambdaitalic_λ range with an open set. In this section we examine more closely the nature of loss function minimizers as λ𝜆\lambdaitalic_λ is varied, zooming in on their behavior in the limit as λa𝜆𝑎\lambda\rightarrow aitalic_λ → italic_a and λb𝜆𝑏\lambda\rightarrow bitalic_λ → italic_b. To this end, we first introduce the following definitions :

Definition 1

We say that an arbitrary preference optimization loss (πθ,πref,λ)subscript𝜋𝜃subscript𝜋ref𝜆\ell(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) satisfies the strong interpolation criteria (SIC) if the following conditions hold:

  1. 1.

    limλaargminπθ(πθ,πref,λ)=πsubscript𝜆𝑎subscriptsubscript𝜋𝜃subscript𝜋𝜃subscript𝜋ref𝜆superscript𝜋\lim_{\lambda\rightarrow a}\arg\min_{\pi_{\theta}}\ell(\pi_{\theta},\pi_{\tiny% \mbox{ref}},\lambda)=\pi^{*}roman_lim start_POSTSUBSCRIPT italic_λ → italic_a end_POSTSUBSCRIPT roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT;

  2. 2.

    limλbargminπθ(πθ,πref,λ)=πrefsubscript𝜆𝑏subscriptsubscript𝜋𝜃subscript𝜋𝜃subscript𝜋ref𝜆subscript𝜋ref\lim_{\lambda\rightarrow b}\arg\min_{\pi_{\theta}}\ell(\pi_{\theta},\pi_{\tiny% \mbox{ref}},\lambda)=\pi_{\tiny\mbox{ref}}roman_lim start_POSTSUBSCRIPT italic_λ → italic_b end_POSTSUBSCRIPT roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) = italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT;

  3. 3.

    For all other λ(a,b)𝜆𝑎𝑏\lambda\in(a,b)italic_λ ∈ ( italic_a , italic_b ), the optimal policy interpolates between the above two extremes.

Definition 2

For any prompt x𝑥xitalic_x and response y𝑦yitalic_y define666See Appendix F.1 for the derivation of the right-hand equality in (12).

πδ(y|x):=argmaxπθ𝔼yπθ(y|x)[r(y,x)]={1ify=argmaxyπ(y|x)0otherwise.assignsuperscript𝜋𝛿conditional𝑦𝑥subscriptsubscript𝜋𝜃subscript𝔼similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]superscript𝑟𝑦𝑥cases1if𝑦subscriptsuperscript𝑦superscript𝜋conditional𝑦𝑥0otherwise.\pi^{\delta}(y|x)~{}~{}:=~{}~{}\arg\max_{\pi_{\theta}}\mathbb{E}_{y\sim\pi_{% \theta}(y|x)}\big{[}r^{*}(y,x)\big{]}~{}~{}=~{}~{}\left\{\begin{array}[]{cc}1&% \mbox{if}~{}~{}y=\arg\max_{y^{\prime}}\pi^{*}(y|x)\\ 0&\mbox{otherwise.}\end{array}\right.italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_y | italic_x ) := roman_arg roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] = { start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if italic_y = roman_arg roman_max start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW end_ARRAY (12)

In this way, πδ(y|x)superscript𝜋𝛿conditional𝑦𝑥\pi^{\delta}(y|x)italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_y | italic_x ) assigns probability one to the mode of π(y|x)superscript𝜋conditional𝑦𝑥\pi^{*}(y|x)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ), i.e., akin to a delta function with no generation diversity. We then say that a loss (πθ,πref,λ)subscript𝜋𝜃subscript𝜋ref𝜆\ell(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) satisfies the weak interpolation criteria (WIC) analogously to the SIC, only for the lower bound we instead require    limλaargminπθ(πθ,πref,λ)=πδsubscript𝜆𝑎subscriptsubscript𝜋𝜃subscript𝜋𝜃subscript𝜋ref𝜆superscript𝜋𝛿\lim_{\lambda\rightarrow a}\arg\min_{\pi_{\theta}}\ell(\pi_{\theta},\pi_{\tiny% \mbox{ref}},\lambda)=\pi^{\delta}roman_lim start_POSTSUBSCRIPT italic_λ → italic_a end_POSTSUBSCRIPT roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) = italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT.

In summary, the only difference between these interpolation criteria is their limiting behavior w.r.t. the lower bounding λ𝜆\lambdaitalic_λ; for the SIC we approach the BT-optimal policy, while for the WIC we approach a degenerate policy with all probability mass restricted to the mode of the BT-optimal policy. We remark that both the SIC and WIC cannot be simultaneously satisfied unless πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT itself is a degenerate delta function. We now explore how these distinctions are reflected in the behavior of DPO and IPO loss minimizers, with Figure 1(middle) illustrating the basic concepts.

Proposition 1

Assume preference data distributed according to 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT from (1), and that p(y1y2|x)(0,1)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥01p^{*}(y_{1}\succ y_{2}|x)\in(0,1)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) ∈ ( 0 , 1 ) for all responses with πref(y|x)>0subscript𝜋refconditional𝑦𝑥0\pi_{\tiny\mbox{ref}}(y|x)>0italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) > 0. Then the DPO loss from (7) satisfies the WIC (but not the SIC).

In terms of practical applicability of this result, there exists one important caveat: the empirical distribution of a finite set of labeled preference data need not actually satisfy the conditions of Proposition 1. For example, suppose for each prompt x𝒟x𝑥subscript𝒟𝑥x\in{\mathcal{D}}_{x}italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT we collect only two responses {y1,y2}subscript𝑦1subscript𝑦2\{y_{1},y_{2}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } along with a single preference label z𝑧zitalic_z, which together produce the tuple {yw,yl,x}subscript𝑦𝑤subscript𝑦𝑙𝑥\{y_{w},y_{l},x\}{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x }. In this scenario, which reflects certain publicly-available human preference datasets [4, 18], the empirical distribution of preferences will be p(ywyl|x)=1(0,1)superscript𝑝succeedssubscript𝑦𝑤conditionalsubscript𝑦𝑙𝑥101p^{*}(y_{w}\succ y_{l}|x)=1\notin(0,1)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) = 1 ∉ ( 0 , 1 ) for all x𝒟x𝑥subscript𝒟𝑥x\in{\mathcal{D}}_{x}italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT. Notably, Proposition 1 will not hold, and in particular, it can be easily shown that minimizers of any valid f𝑓fitalic_f-DPO loss will be completely independent of πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT for all λ(0,)𝜆0\lambda\in(0,\infty)italic_λ ∈ ( 0 , ∞ ); in other words, no interpolation occurs at all; see Appendix F.2 for the derivation. A similar observation specific to DPO (but not f𝑓fitalic_f-DPO) can be found in [1]. The fact that DPO-based solutions may still reflect πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT in practice, and more-so as λ𝜆\lambdaitalic_λ increases, relates to implicit constraints and subtle regularization effects as discussed further in Section 3.3 and Appendix C.

Proposition 2

Assume preference data distributed according to 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT from (1). Then the IPO loss from (2.3) satisfies the WIC (but not the SIC).

Comparing Proposition 2 with Proposition 1, we observe that IPO maintains its ability to interpolate under broader conditions than DPO, particularly in the empirical sampling regime involving binary probability values. That being said, neither DPO nor IPO satisfy the SIC, which motivates consideration of alternative losses that do, at least if our priority is to actually achieve the SIC (which of course may depend on the application scenario). For this purpose, it turns out that selections beyond the family of QPO objectives (which includes DPO, f𝑓fitalic_f-DPO, and IPO) are necessary per the following:

Theorem 2

Assume preference data distributed according to 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT from (1). Then no possible QPO loss from (11) will satisfy the SIC.

Section 4 will consider objectives outside of the QPO family which circumvent this limitation.

3.3 Impact of Optimization Constraints

Originally in [33], and later supported by follow-up analysis [3], it has been shown that minimizing the DPO loss DPO(πθ,πref,λ)subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\ell_{\tiny\mbox{DPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) is effectively the same as minimizing the RLHF loss RLHF(πθ,πref,r,λ)subscriptRLHFsubscript𝜋𝜃subscript𝜋refsuperscript𝑟𝜆\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},r^{*},\lambda\right)roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_λ ) with optimal reward model rsuperscript𝑟r^{*}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. But there is a pivotal assumption underlying this association which previous analysis has not rigorously accounted for. Specifically, the key equalities that facilitate the DPO and IPO reparameterizations, namely (6) and (9) (and the analogous for f𝑓fitalic_f-DPO), are all predicated on the solution of an uncononstrained optimization problem over an arbitrary policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

However, when actually training models in real-world settings, constraints will always exist, whether implicitly or explicitly. Such constraints stem from any number of factors including the model architecture/capacity limitations, early stopping, weight decay, drop-out regularization, machine precision, and so on. Hence in reality we are never exactly minimizing some preference loss (πθ,πref,λ)subscript𝜋𝜃subscript𝜋ref𝜆\ell\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda\right)roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) over any possible πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (as assumed by DPO, IPO, and f𝑓fitalic_f-DPO derivations). Instead, we must consider properties of the constrained problem minπθ𝒮π(πθ,πref,λ)subscriptsubscript𝜋𝜃subscript𝒮𝜋subscript𝜋𝜃subscript𝜋ref𝜆\min_{\pi_{\theta}\in{\mathcal{S}}_{\pi}}\ell\left(\pi_{\theta},\pi_{\tiny% \mbox{ref}},\lambda\right)roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ), where 𝒮πsubscript𝒮𝜋{\mathcal{S}}_{\pi}caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is a constraint set. For example, if we restrict training to a single epoch with a fixed learning rate, then 𝒮πsubscript𝒮𝜋{\mathcal{S}}_{\pi}caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT can be viewed as the set of all points reachable within a limited number of SGD updates.

Theorem 3

Let 𝒮πsubscript𝒮𝜋{\mathcal{S}}_{\pi}caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT denote a constraint set on the learnable policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Then we can have that

argminπθ𝒮πRLHF(πθ,πref,r,λ)argminπθ𝒮πDPO(πθ,πref,λ).subscriptsubscript𝜋𝜃subscript𝒮𝜋subscriptRLHFsubscript𝜋𝜃subscript𝜋refsuperscript𝑟𝜆subscriptsubscript𝜋𝜃subscript𝒮𝜋subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\arg\min_{\pi_{\theta}\in{\mathcal{S}}_{\pi}}\ell_{\tiny\mbox{RLHF}}\left(\pi_% {\theta},\pi_{\tiny\mbox{ref}},r^{*},\lambda\right)~{}\neq~{}\arg\min_{\pi_{% \theta}\in{\mathcal{S}}_{\pi}}\ell_{\tiny\mbox{DPO}}(\pi_{\theta},\pi_{\tiny% \mbox{ref}},\lambda).roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_λ ) ≠ roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) . (13)

As can be observed by the proof in Appendix E.5, the difference between the two is akin to the difference between applying a constraint to a trainable policy with respect to either the forward or backward KL divergence, which are generally quite distinct [7]; see also Figure 1(right). There are several important consequences of this result worth considering:

  • As discussed in Section 3.2, the DPO-based losses can have degenerate unconstrained minimizers that completely ignore πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT on certain real-world datasets; therefore counter-measures like early stopping are imposed that effectively introduce a 𝒮πsubscript𝒮𝜋{\mathcal{S}}_{\pi}caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT that dramatically alters the estimated policy. But in doing so, the inequality from (13) is introduced and so we can no longer say that DPO provides an optimal implicit reward for the original RLHF problem, i.e., the original connection is now ambiguous.

  • As such, the value of DPO in practice (and indeed it often does work well) cannot be unreservedly attributed to its original affiliation with an optimal RLHF solution, and instead, should be evaluated based on properties of minπθ𝒮πDPO(πθ,πref,λ)subscriptsubscript𝜋𝜃subscript𝒮𝜋subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\min_{\pi_{\theta}\in{\mathcal{S}}_{\pi}}\ell_{\tiny\mbox{DPO}}(\pi_{\theta},% \pi_{\tiny\mbox{ref}},\lambda)roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ). See Appendix C for one step in this direction.

  • To further illustrate the above points, in Appendix D we rederive the DPO loss from scratch based solely on a Gaussian estimation perspective that is completely unrelated to RLHF. But of course we do not actually believe that binary human preference data are really Gaussian. Instead, this exercise serves to highlight that what matters are properties of the underlying loss when deployed in practice, not necessarily the assumptions made in deriving the loss in the first place.

  • Other losses based on unconstrained RLHF-based reparameterizations in the f𝑓fitalic_f-DPO and IPO families may be similarly influenced by the inevitable introduction of policy constraints.

Refer to caption
Figure 1: Desiderata visualizations, including added context w.r.t. our proposed TYPO approach.

4 New Objectives for Human Preference Optimization

Motivated by the analysis in Section 3 and illustrated in Figure 1, we next examine alternative objective functions adhering to the following desiderata:

  1. 1.

    Perservation:  Capable of selectively preserving an optimal policy in ideal regimes, while simultaneously improving the policy in regions of poor performance (from Section 3.1);

  2. 2.

    Interpolation:  Smoothly interpolates between the BT-optimal policy and the reference policy, i.e., it achieves the SIC (from Section 3.2);

  3. 3.

    Constraints:  Independent of any derivation or required equivalence/reparameterization that no longer holds upon the introduction of constraints (from Section 3.3).

We label the our new objective TYPOsubscriptTYPO\ell_{\tiny\mbox{TYPO}}roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT to highlight the potential ability to “tame your preference optimization” (and “lower typos”) by explicitly targeting these desiderata.

4.1 TYPO Objective Function

Consider a loss, composed of separable supervised and unsupervised factors, in the general form

TYPO(πθ,πref,λ):=sup(πθ)+λunsup(πθ,πref)=assignsubscriptTYPOsubscript𝜋𝜃subscript𝜋ref𝜆subscriptsupsubscript𝜋𝜃𝜆subscriptunsupsubscript𝜋𝜃subscript𝜋refabsent\displaystyle\ell_{\tiny\mbox{TYPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},% \lambda)~{}~{}:=~{}~{}\ell_{\tiny\mbox{sup}}(\pi_{\theta})~{}+~{}\lambda\ell_{% \tiny\mbox{unsup}}(\pi_{\theta},\pi_{\tiny\mbox{ref}})~{}~{}=roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) := roman_ℓ start_POSTSUBSCRIPT sup end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) + italic_λ roman_ℓ start_POSTSUBSCRIPT unsup end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) = (14)
𝔼{yw,yl,x}𝒟tr[dsup[πθ(yw|x),πθ(yl|x)]]+λ𝔼yπref(y|x),x𝒟x[dunsup[πθ(y|x),πref(y|x),]],\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}}\Big{[}d_{% \tiny\mbox{sup}}\big{[}\pi_{\theta}(y_{w}|x),\pi_{\theta}(y_{l}|x)\big{]}\Big{% ]}~{}+~{}\lambda\mathbb{E}_{y\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{\mathcal{D}}% _{x}}\Big{[}d_{\tiny\mbox{unsup}}\big{[}\pi_{\theta}(y|x),\pi_{\tiny\mbox{ref}% }(y|x),\big{]}\Big{]},blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT sup end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) ] ] + italic_λ blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_d start_POSTSUBSCRIPT unsup end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , ] ] ,

where dsupsubscript𝑑supd_{\tiny\mbox{sup}}italic_d start_POSTSUBSCRIPT sup end_POSTSUBSCRIPT serves as a supervised penalty over labeled training tuples (x,yw,yl)𝑥subscript𝑦𝑤subscript𝑦𝑙(x,y_{w},y_{l})( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) while dunsupsubscript𝑑unsupd_{\tiny\mbox{unsup}}italic_d start_POSTSUBSCRIPT unsup end_POSTSUBSCRIPT represents an additional regularization term independent of labeled preferences. We remark that objectives in the form of (14) are natural candidates for SGD given that all sampling is independent of θ𝜃\thetaitalic_θ, unlike the typical regularized loss adopted by RLHF, which requires samples from πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ).

Supervised Term:

After first defining

pθ(z|y1,y2,x):={πθ(y1|x)πθ(y1|x)+πθ(y2|x)if z=1πθ(y2|x)πθ(y1|x)+πθ(y2|x)if z=0assignsubscript𝑝𝜃conditional𝑧subscript𝑦1subscript𝑦2𝑥casessubscript𝜋𝜃conditionalsubscript𝑦1𝑥subscript𝜋𝜃conditionalsubscript𝑦1𝑥subscript𝜋𝜃conditionalsubscript𝑦2𝑥if 𝑧1subscript𝜋𝜃conditionalsubscript𝑦2𝑥subscript𝜋𝜃conditionalsubscript𝑦1𝑥subscript𝜋𝜃conditionalsubscript𝑦2𝑥if 𝑧0p_{\theta}(z|y_{1},y_{2},x):=\left\{\begin{array}[]{cc}\frac{\pi_{\theta}(y_{1% }|x)}{\pi_{\theta}(y_{1}|x)+\pi_{\theta}(y_{2}|x)}&\mbox{if }z=1\\ \frac{\pi_{\theta}(y_{2}|x)}{\pi_{\theta}(y_{1}|x)+\pi_{\theta}(y_{2}|x)}&% \mbox{if }z=0\end{array}\right.italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) := { start_ARRAY start_ROW start_CELL divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) + italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG end_CELL start_CELL if italic_z = 1 end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) + italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG end_CELL start_CELL if italic_z = 0 end_CELL end_ROW end_ARRAY (15)

we then consider the supervised term

sup(πθ)subscriptsupsubscript𝜋𝜃\displaystyle\ell_{\tiny\mbox{sup}}(\pi_{\theta})roman_ℓ start_POSTSUBSCRIPT sup end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) =\displaystyle== 𝔼{y1,y2}πref(y|x),x𝒟x[𝕂𝕃[p(z|y1,y2,x)||pθ(z|y1,y2,x)]]\displaystyle\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{% \mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}p^{*}(z|y_{1},y_{2},x)||p_{\theta}(z% |y_{1},y_{2},x)\big{]}\Big{]}blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] ] (16)
\displaystyle\equiv 𝔼{yw,yl,x}𝒟tr[log(1+πθ(yl|x)πθ(yw|x))].subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟delimited-[]1subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}}\left[\log% \left(1+\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\theta}(y_{w}|x)}\right)\right].blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 + divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] .

Please see Appendix F.3 for the derivation of this equivalence. Importantly here, because the KL-divergence is minimized iff p(z|y1,y2,x)=pθ(z|y1,y2,x)superscript𝑝conditional𝑧subscript𝑦1subscript𝑦2𝑥subscript𝑝𝜃conditional𝑧subscript𝑦1subscript𝑦2𝑥p^{*}(z|y_{1},y_{2},x)=p_{\theta}(z|y_{1},y_{2},x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ), unlike an arbitrary reward, the optimal solution to sup(πθ)subscriptsupsubscript𝜋𝜃\ell_{\tiny\mbox{sup}}(\pi_{\theta})roman_ℓ start_POSTSUBSCRIPT sup end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) will necessarily recover the BT-optimal distribution as will be analyzed below.

Unsupervised Term:

For the unsupervised term in (14) we simply adopt

unsup(πθ,πref)=𝔼x𝒟x[𝕂𝕃[πref(y|x)||πθ(y|x)]]𝔼yπref(y|x),x𝒟x[logπθ(y|x)],\ell_{\tiny\mbox{unsup}}(\pi_{\theta},\pi_{\tiny\mbox{ref}})=\mathbb{E}_{x\sim% {\mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}\pi_{\tiny\mbox{ref}}(y|x)||\pi_{% \theta}(y|x)\big{]}\Big{]}\equiv-\mathbb{E}_{y\sim\pi_{\tiny\mbox{ref}}(y|x),x% \sim{\mathcal{D}}_{x}}\Big{[}\log\pi_{\theta}(y|x)\Big{]},roman_ℓ start_POSTSUBSCRIPT unsup end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] ] ≡ - blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] , (17)

ignoring terms independent of πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Like (16), this expression also does not require sampling from πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. That being said, (17) can exploit out-of-preference data (meaning unlabeled responses), and prior work [25] has argued for the merits of using such data in broader RLHF contexts. (It may also be reasonable to consider switching unsup(πθ,πref)subscriptunsupsubscript𝜋𝜃subscript𝜋ref\ell_{\tiny\mbox{unsup}}(\pi_{\theta},\pi_{\tiny\mbox{ref}})roman_ℓ start_POSTSUBSCRIPT unsup end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) to a reverse-KL term and optimize with REINFORCE per general observations from [1]; however, we do not pursue this direction further here.)

4.2 TYPOsubscriptTYPO\ell_{\tiny\mbox{TYPO}}roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT Properties

Notable attributes of TYPO(πθ,πref,λ)TYPOsubscript𝜋𝜃subscript𝜋ref𝜆\ell{\tiny\mbox{TYPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ TYPO ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) w.r.t. the three desiderata from above are as follows:

Proposition 3

Under the same setup as Theorem 1, let π^θTYPO:=argminπθTYPO(πθ,πref,λ)assignsuperscriptsubscript^𝜋𝜃TYPOsubscriptsubscript𝜋𝜃subscriptTYPOsubscript𝜋𝜃subscript𝜋ref𝜆\hat{\pi}_{\theta}^{\tiny\mbox{TYPO}}:=\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox% {TYPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT TYPO end_POSTSUPERSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ), instantiated using (16) and (17). Then π^θTYPO=πsuperscriptsubscript^𝜋𝜃TYPOsuperscript𝜋\hat{\pi}_{\theta}^{\tiny\mbox{TYPO}}=\pi^{*}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT TYPO end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT  for all xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT including in cases where dist[π^θTYPO,π]<dist[πref,π]distsuperscriptsubscript^𝜋𝜃TYPOsuperscript𝜋distsubscript𝜋refsuperscript𝜋\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{TYPO}},~{}\pi^{*}]<\mbox{dist}[\pi_% {\tiny\mbox{ref}},~{}\pi^{*}]dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT TYPO end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] < dist [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] for xdxbad𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT.

Per this result, minimizers of TYPO(πθ,πref,λ)subscriptTYPOsubscript𝜋𝜃subscript𝜋ref𝜆\ell_{\tiny\mbox{TYPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) are capable of preserving πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT in regions dxgoodsuperscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑d_{x}^{good}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT where performance is strong relative to πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, while concurrently improving performance in other areas where it is not. Figure 1(left) visualizes this unique TYPO capability.

Proposition 4

The loss TYPO(πθ,πref,λ)TYPOsubscript𝜋𝜃subscript𝜋ref𝜆\ell{\tiny\mbox{TYPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ TYPO ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ), when instantiated using (16) and (17), satisfies the SIC.

Figure 1(middle) contrasts this property with the WIC achieved by prior methods. We also remark that none of the derivations used to motivate TYPO(πθ,πref,λ)TYPOsubscript𝜋𝜃subscript𝜋ref𝜆\ell{\tiny\mbox{TYPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ TYPO ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) rely on unconstrained optimization to form a reparameterized objective function as with DPO, f𝑓fitalic_f-DPO, and IPO. As such, the inevitable introduction of such constraints in practice does not compromise the TYPO origin story. In other words, since TYPO is not based on any implicit association with RLHF in the first place, adding constraints that might otherwise compromise such an association pose no issue.

5 Empirical Validation

Although more of an analysis-driven contribution, our core insights from Sections 3 and 4 can nonetheless benefit from empirical corroboration. To this end, we first present a series of experiments adapted from [3] to highlight aspects of TYPO behavior vis-à-vis our proposed desiderata. As the most relevant published points of reference, we contrast with DPO, IPO, and f𝑓fitalic_f-DPO; for the latter we choose the Jensen–Shannon divergence, which next to the reverse-KL implicitly assumed by DPO, performed well in prior experiments [41]. Later we test using the Anthropic Helpfulness and Harmlessness (HH) real-world preference dataset [4, 18]. For space considerations, some experiment details, including hyperparameters and training setups, are deferred to Appendix A.

Interpolation Tests:

As in [3] we consider the bandit setting with a discrete space of three responses/actions 𝒴={ya,yb,yc}𝒴subscript𝑦𝑎subscript𝑦𝑏subscript𝑦𝑐{\mathcal{Y}}=\{y_{a},y_{b},y_{c}\}caligraphic_Y = { italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } and create a dataset of labeled pairs as {{ya,yb},{yb,yc},{ya,yc}}subscript𝑦𝑎subscript𝑦𝑏subscript𝑦𝑏subscript𝑦𝑐subscript𝑦𝑎subscript𝑦𝑐\big{\{}\{y_{a},y_{b}\},\{y_{b},y_{c}\},\{y_{a},y_{c}\}\big{\}}{ { italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } , { italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } , { italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } }, i.e., a total ordering consistent with the BT model. Preferences are assigned via p(y1y2)𝑝succeedssubscript𝑦1subscript𝑦2p(y_{1}\succ y_{2})italic_p ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) computed using (55) with π(ya)=0.6superscript𝜋subscript𝑦𝑎0.6\pi^{*}(y_{a})=0.6italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = 0.6, π(yb)=0.3superscript𝜋subscript𝑦𝑏0.3\pi^{*}(y_{b})=0.3italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.3, and π(yc)=0.1superscript𝜋subscript𝑦𝑐0.1\pi^{*}(y_{c})=0.1italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = 0.1. Furthermore, again following [3] we form our trainable policy as πθ(yi)=softmax[θi]subscript𝜋𝜃subscript𝑦𝑖softmaxdelimited-[]subscript𝜃𝑖\pi_{\theta}(y_{i})=\mbox{softmax}[\theta_{i}]italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = softmax [ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] with θ3𝜃superscript3\theta\in\mathbb{R}^{3}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT optimized using Adam over each different preference loss. Results using a small λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT are shown in Figure 2, where we observe that TYPO closely converges to the BT-optimal solution, while DPO and IPO converge to πδsuperscript𝜋𝛿\pi^{\delta}italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT (the mode of πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) consistent with Propositions 1 (DPO), 2 (IPO), and 4 (TYPO), as well as Theorem 2 which applies to f𝑓fitalic_f-DPO. Additional interpolation results traversing different λ𝜆\lambdaitalic_λ towards the upper limit are presented in Appendix A.2.

Refer to caption
Figure 2: Support for Sections 3.2 and 4.2 interpolation analysis. Dashed lines represent BT-optimal preference probabilities πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, while solid lines are model learning curves for λ=105𝜆superscript105\lambda=10^{-5}italic_λ = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT (small). Only TYPO converges to πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, others converge to πδsuperscript𝜋𝛿\pi^{\delta}italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT.

Preservation Tests:

We next modify the setting from above to include two input prompts {xg,xb}subscript𝑥𝑔subscript𝑥𝑏\{x_{g},x_{b}\}{ italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT } chosen such that xgdxgoodsubscript𝑥𝑔superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x_{g}\in d_{x}^{good}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT and xbdxbadsubscript𝑥𝑏superscriptsubscript𝑑𝑥𝑏𝑎𝑑x_{b}\in d_{x}^{bad}italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT sampled with equal probability. We then specify the corresponding response space 𝒴(xg)={yga,ygb,ygc};𝒴(xb)={yba,ybb,ybc}formulae-sequence𝒴subscript𝑥𝑔subscript𝑦𝑔𝑎subscript𝑦𝑔𝑏subscript𝑦𝑔𝑐𝒴subscript𝑥𝑏subscript𝑦𝑏𝑎subscript𝑦𝑏𝑏subscript𝑦𝑏𝑐{\mathcal{Y}}(x_{g})=\{y_{ga},y_{gb},y_{gc}\};~{}~{}~{}{\mathcal{Y}}(x_{b})=\{% y_{ba},y_{bb},y_{bc}\}caligraphic_Y ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = { italic_y start_POSTSUBSCRIPT italic_g italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g italic_c end_POSTSUBSCRIPT } ; caligraphic_Y ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = { italic_y start_POSTSUBSCRIPT italic_b italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT } and prompt-dependent probabilities (see Appendix A.1). For the reference policy we set πref(y|xg)=π(y|xg)subscript𝜋refconditional𝑦subscript𝑥𝑔superscript𝜋conditional𝑦subscript𝑥𝑔\pi_{\tiny\mbox{ref}}(y|x_{g})=\pi^{*}(y|x_{g})italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) and πref(y|xb)π(y|xb)subscript𝜋refconditional𝑦subscript𝑥𝑏superscript𝜋conditional𝑦subscript𝑥𝑏\pi_{\tiny\mbox{ref}}(y|x_{b})\neq\pi^{*}(y|x_{b})italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ≠ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ). We generate pair-wise preference data as before, only now with prompt-dependent responses. Results shown in Figure 3(left & middle) are in direct accordance with Theorem 1 and Proposition 3, whereby TYPO is the only approach that preserves a strong policy with prompt xgdxgoodsubscript𝑥𝑔superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x_{g}\in d_{x}^{good}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT while at the same time improving performance relative to πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT for xbdxbadsubscript𝑥𝑏superscriptsubscript𝑑𝑥𝑏𝑎𝑑x_{b}\in d_{x}^{bad}italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT over all λ𝜆\lambdaitalic_λ.

Refer to caption
Figure 3: Preservation tests varying λ𝜆\lambdaitalic_λ (left and middle plots); unlike TYPO, existing approaches are unable to both retain negligible error on the good cases while improving performance (over the dashed line representing the reference model) on the bad cases. Constraint test varying α𝛼\alphaitalic_α and plotting dist[π^θDPO,π^θRLHF]distsuperscriptsubscript^𝜋𝜃DPOsuperscriptsubscript^𝜋𝜃RLHF\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{DPO}},~{}\hat{\pi}_{\theta}^{\tiny% \mbox{RLHF}}]dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT DPO end_POSTSUPERSCRIPT , over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RLHF end_POSTSUPERSCRIPT ] (right plot); DPO is no longer equivalent to RLHF with an optimal reward once an additional constraint/regularization factor is introduced.

Constraint Tests:

We probe the extent to which learning constraints can interfere with the equivalence between DPO and RLHF implemented with an optimal reward function. To this end, we adopted the same data generation setup as in the interpolation experiments from above. We then train policies to separately minimize the right- and left-hand sides of (13), but with one key modification: we added an identical penalty function απθ22𝛼superscriptsubscriptnormsubscript𝜋𝜃22\alpha\|\pi_{\theta}\|_{2}^{2}italic_α ∥ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to both models to instantiate weight decay (a typical form of constraint used in practice), where α0𝛼0\alpha\geq 0italic_α ≥ 0 is a tunable hyperparameter. Figure 3(right) plots the distance (y𝑦yitalic_y-axis) between learned policies from RLHF and DPO as α𝛼\alphaitalic_α is varied. Consistent with the original DPO derivations and analysis from [33], we observe negligible error when α=0𝛼0\alpha=0italic_α = 0 given that unconstrained DPO is explicitly designed to mimic RLHF with an optimal reward rsuperscript𝑟r^{*}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. However, in accordance with our Theorem 3, as α>0𝛼0\alpha>0italic_α > 0 increases, the distance between RLHF and DPO grows considerably, and their relationship is no longer clear-cut.

Refer to caption
Figure 4: Real-world example.

Testing on Anthropic HH Preference Data:

Finally, to explore TYPO capabilities in a real-world scenario, we train a Pythia 2.8B model [6] on the Anthropic Helpfulness and Harmlessness (HH) preference dataset [4, 18] as previously used in [33]. Following their settings, we first execute supervised fine-tuning (SFT) on the Pythia model using ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT values as the target response. We then use this SFT model as πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT for training DPO, IPO and TYPO. Given that alignment results (our focus) from [41] already show that reverse KL (i.e., DPO) works best among f𝑓fitalic_f-divergences, we do not compare with other f𝑓fitalic_f-DPO selections here. We use GPT-4 to evaluate the win rate of the generated responses from each model against the chosen ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT on the test set for single turn dialogues. We emphasize that our comparisons cover both helpfulness and harmlessness (see Appendix A.3), whereas the original DPO paper [33] only tests the former.

6 Conclusions

In this work we have proposed multiple desiderata that existing methodology for human preference optimization does not satisfy and yet our proposed TYPO approach does.

References

  • [1] Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024.
  • [2] Afra Amini, Tim Vieira, and Ryan Cotterell. Direct preference optimization with an offset. arXiv preprint arXiv:2402.10571, 2024.
  • [3] Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
  • [4] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • [5] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  • [6] Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  • [7] C.M. Bishop. Pattern recognition and machine learning. Springer, New York, 2006.
  • [8] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • [9] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  • [10] Emmanuel J Candes, Michael B Wakin, and Stephen P Boyd. Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, 14:877–905, 2008.
  • [11] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024.
  • [12] Rick Chartrand and Wotao Yin. Iteratively reweighted algorithms for compressive sensing. International Conference on Accoustics, Speech, and Signal Processing, 2008.
  • [13] Yichen Chen, Dongdong Ge, Mengdi Wang, Zizhuo Wang, Yinyu Ye, and Hao Yin. Strong NP-hardness for sparse optimization with concave penalty functions. In International Confernece on Machine Learning, 2017.
  • [14] Bin Dai, Chen Zhu, Baining Guo, and David Wipf. Compressing neural networks using the variational information bottleneck. In International Conference on Machine Learning, pages 1135–1144. PMLR, 2018.
  • [15] Jianqing Fan and Runze Li. Variable selection via nonconcave penalized likelihood and its oracle properties. JASTA, 96(456):1348–1360, 2001.
  • [16] Duanyu Feng, Bowen Qin, Chen Huang, Zheng Zhang, and Wenqiang Lei. Towards analyzing and understanding the limitations of dpo: A theoretical perspective. arXiv preprint arXiv:2404.04626, 2024.
  • [17] Isabel O Gallegos, Ryan A Rossi, Joe Barrow, Md Mehrab Tanjim, Sungchul Kim, Franck Dernoncourt, Tong Yu, Ruiyi Zhang, and Nesreen K Ahmed. Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770, 2023.
  • [18] Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  • [19] Alexey Gorbatovski, Boris Shaposhnikov, Alexey Malakhov, Nikita Surnachev, Yaroslav Aksenov, Ian Maksimov, Nikita Balagansky, and Daniil Gavrilov. Learn your reference model for real good alignment. arXiv preprint arXiv:2404.09656, 2024.
  • [20] Harvey Greenberg and William Pierskalla. A review of quasi-convex functions. Operations research, 19(7):1553–1570, 1971.
  • [21] Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2024.
  • [22] Shawn Im and Yixuan Li. Understanding the learning dynamics of alignment with human feedback. arXiv preprint arXiv:2403.18742, 2024.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  • [25] Ziniu Li, Tian Xu, and Yang Yu. Policy optimization in rlhf: The impact of out-of-preference data. arXiv preprint arXiv:2312.10584v2, 2024.
  • [26] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2024.
  • [27] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • [28] Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024.
  • [29] Jason Palmer. Relative convexity. UC San Diego Technical Report, 2003.
  • [30] Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024.
  • [31] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  • [32] Jan Peters and Stefan Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, pages 745–750, 2007.
  • [33] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  • [34] Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, and Yejin Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
  • [35] B.D. Rao, K. Engan, S. F. Cotter, J. Palmer, and K. Kreutz-Delgado. Subset selection in noise based on diversity measure minimization. IEEE Trans. Signal Processing, 51(3):760–770, March 2003.
  • [36] Paul Rubenstein, Olivier Bousquet, Josip Djolonga, Carlos Riquelme, and Ilya O Tolstikhin. Practical and consistent estimation of f-divergences. Advances in Neural Information Processing Systems, 32, 2019.
  • [37] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [38] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback, 2020. URL https://arxiv. org/abs, 2009.
  • [39] Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024.
  • [40] Yunhao Tang, Zhaohan Daniel Guo, Zeyu Zheng, Daniele Calandriello, Rémi Munos, Mark Rowland, Pierre Harvey Richemond, Michal Valko, Bernardo Ávila Pires, and Bilal Piot. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
  • [41] Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, and Yuxin Chen. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. International Conference on Learning Representations, 2024.
  • [42] David Wipf and Srikantan Nagarajan. Iterative reweighted 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT methods for finding sparse solutions. Journal of Selected Topics in Signal Processing (Special Issue on Compressive Sensing), 4(2), 2010.
  • [43] David Wipf and Haichao Zhang. Revisiting Bayesian blind deconvolution. Journal of Machine Learning Research (JMLR), 2014.
  • [44] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  • [45] Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023.
  • [46] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Appendix A Additional Experimental Details and Results

This section describes experiment details/settings and additional results.

A.1 Details of the Tests with Synthetic Data

  • For the tests of interpolation, preservation and constraints, we train the models with Adam optimizer [23] and clip the gradients via a max norm of 10. And we run the experiments of the tests on a single A10 GPU. Unless otherwise mentioned, we use batch size of 1.

  • For the interpolation tests, we use batch size of 20 and choose πref(ya)=0.4subscript𝜋refsubscript𝑦𝑎0.4\pi_{\tiny\mbox{ref}}(y_{a})=0.4italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = 0.4, πref(yb)=0.4subscript𝜋refsubscript𝑦𝑏0.4\pi_{\tiny\mbox{ref}}(y_{b})=0.4italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.4, and πref(yc)=0.2subscript𝜋refsubscript𝑦𝑐0.2\pi_{\tiny\mbox{ref}}(y_{c})=0.2italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = 0.2. We use learning rate of 1e31𝑒31e-31 italic_e - 3 for DPO, IPO and f𝑓fitalic_f-DPO and 5e45𝑒45e-45 italic_e - 4 for TYPO; we train DPO, IPO and TYPO for 1,000 epochs and f𝑓fitalic_f-DPO for 3,000 epochs as it converges slower.

  • For the preservation test, we choose

    𝒴(xg)={yga,ygb,ygc};𝒴(xb)={yba,ybb,ybc}formulae-sequence𝒴subscript𝑥𝑔subscript𝑦𝑔𝑎subscript𝑦𝑔𝑏subscript𝑦𝑔𝑐𝒴subscript𝑥𝑏subscript𝑦𝑏𝑎subscript𝑦𝑏𝑏subscript𝑦𝑏𝑐\displaystyle{\mathcal{Y}}(x_{g})=\{y_{ga},y_{gb},y_{gc}\};~{}~{}~{}{\mathcal{% Y}}(x_{b})=\{y_{ba},y_{bb},y_{bc}\}caligraphic_Y ( italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = { italic_y start_POSTSUBSCRIPT italic_g italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_g italic_c end_POSTSUBSCRIPT } ; caligraphic_Y ( italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = { italic_y start_POSTSUBSCRIPT italic_b italic_a end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT }
    π(yga|xg)=0.6;π(ygb|xg)=0.3;π(ygc|xg)=0.1;formulae-sequencesuperscript𝜋conditionalsubscript𝑦𝑔𝑎subscript𝑥𝑔0.6formulae-sequencesuperscript𝜋conditionalsubscript𝑦𝑔𝑏subscript𝑥𝑔0.3superscript𝜋conditionalsubscript𝑦𝑔𝑐subscript𝑥𝑔0.1\displaystyle\pi^{*}(y_{ga}|x_{g})=0.6;~{}~{}\pi^{*}(y_{gb}|x_{g})=0.3;~{}~{}% \pi^{*}(y_{gc}|x_{g})=0.1;italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_g italic_a end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = 0.6 ; italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_g italic_b end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = 0.3 ; italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_g italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = 0.1 ; (18)
    π(yba|xb)=0.4;π(ybb|xb)=0.2;π(ybc|xb)=0.4.formulae-sequencesuperscript𝜋conditionalsubscript𝑦𝑏𝑎subscript𝑥𝑏0.4formulae-sequencesuperscript𝜋conditionalsubscript𝑦𝑏𝑏subscript𝑥𝑏0.2superscript𝜋conditionalsubscript𝑦𝑏𝑐subscript𝑥𝑏0.4\displaystyle\pi^{*}(y_{ba}|x_{b})=0.4;~{}~{}\pi^{*}(y_{bb}|x_{b})=0.2;~{}~{}% \pi^{*}(y_{bc}|x_{b})=0.4.italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b italic_a end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.4 ; italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.2 ; italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.4 .

    And for the reference model we select πref(yba|xb)=0.6subscript𝜋refconditionalsubscript𝑦𝑏𝑎subscript𝑥𝑏0.6\pi_{\tiny\mbox{ref}}(y_{ba}|x_{b})=0.6italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b italic_a end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.6, πref(ybb|xb)=0.2subscript𝜋refconditionalsubscript𝑦𝑏𝑏subscript𝑥𝑏0.2\pi_{\tiny\mbox{ref}}(y_{bb}|x_{b})=0.2italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b italic_b end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.2 and πref(ybc|xb)=0.2subscript𝜋refconditionalsubscript𝑦𝑏𝑐subscript𝑥𝑏0.2\pi_{\tiny\mbox{ref}}(y_{bc}|x_{b})=0.2italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_b italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) = 0.2. We randomly sample examples for good and bad prompts respectively. The model parameters are θ2×3𝜃superscript23\theta\in\mathbb{R}^{2\times 3}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 3 end_POSTSUPERSCRIPT and we set the values of xgsubscript𝑥𝑔x_{g}italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and xbsubscript𝑥𝑏x_{b}italic_x start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT as vectors of [1,0]10[1,0][ 1 , 0 ] and [0,1]01[0,1][ 0 , 1 ].

  • In the constraint test, we use the same setting and data as the interpolation test. We use β=0.1𝛽0.1\beta=0.1italic_β = 0.1 for both RLHF and DPO and train them for 100 epochs for all the values of α𝛼\alphaitalic_α.

A.2 Additional Results with Synthetic Data

We conduct additional experiments for the interpolation test by varying λ𝜆\lambdaitalic_λ from very small to very large values as shown in Figure 5 and Figure 6.

Refer to caption
Figure 5: Converged probability distributions of πθ(y)subscript𝜋𝜃𝑦\pi_{\theta}(y)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) for DPO, IPO, f𝑓fitalic_f-DPO and TYPO with large λ𝜆\lambdaitalic_λ. All methods stabilize around πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT as expected.
Refer to caption
Figure 6: Interpolation of converged probability distributions πθ(y)subscript𝜋𝜃𝑦\pi_{\theta}(y)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) for DPO, IPO and TYPO across varying λ𝜆\lambdaitalic_λ. As λ𝜆\lambdaitalic_λ becomes small, only TYPO converges to the BT-optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The others converge to the mode of the optimal policy consistent with expectations. Meanwhile, as λ𝜆\lambdaitalic_λ grows all methods converge to πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT.

A.3 Details of Experiments on Anthropic HH Dataset

We train the SFT model with 2 epochs and 1 epoch for all the other models with a learning rate of 1e61𝑒61e-61 italic_e - 6 and batch size of 40. We set β=0.1𝛽0.1\beta=0.1italic_β = 0.1 for DPO, τ=0.1𝜏0.1\tau=0.1italic_τ = 0.1 for IPO and λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 for TYPO. We evaluate the win rate on the single turn dialogues in the test set with GPT-4 using modified version used in the DPO paper to cover harmlessness examples as shown in Figure 7. All the experiments are conducted in a 8×\times×A100 40G GPU instance.

For the training of TYPO, we first sample responses from the reference model, i.e. the SFT model, for the unsupervised term. We apply vLLM [24] to randomly sample responses from the Anthropic HH dataset by setting temperature=1, top_k=60, top_p=0.8, max_tokens=256 and repetition_penalty=1.1. During the training, we use one sampled response for each prompt in the unsupervised term.

Refer to caption
Figure 7: Prompt used for evaluate win rate of the generated responses against the chosen responses for single turn dialogues on the test set of Anthropic HH dataset.

Appendix B Extended Related Work

There has been a flurry of interesting recent work on DPO-related topics, with numerous papers appearing on arXiv not long before the NeurIPS deadline. In this section we call attention to several notable examples that propose modifications of the original DPO paradigm, or else provide relevant analysis of its properties. We believe these efforts to be complementary to our contribution, as well as the existing DPO-like extensions by others discussed in the main body of paper.

Algorithmic Enhancements to DPO:

There exist multiple DPO extensions that involving supplementing the original loss from (7) with additional penalty factors targeting potential failure modes. For example, based on the observation that DPO may exhibit a decrease in accuracy when applied to preference data with small edit distances between responses, the Smaug framework [28] augments the DPO loss with an additional factor designed to maintain high log-likelihoods in such cases. Meanwhile, sensitivity to response lengths are investigated in [30], where as a counter-measure, the DPO loss is supplemented with a penalty on length differences between winning and losing responses. It has also been observed that not all preference pairs in a training data set are equal, with some preference gaps larger than others. As a mitigation strategy for this discrepancy, the ODPO approach [2] introduces a preference offset term during model training. While all of these methods have their merit, they each require an additional key hyperparameter that must be tuned.

Somewhat differently, the ORPO algorithm [21] proposes an alternative to DPO that combines an odds ratio-based penalty with a conventional negative log-likelihood SFT (i.e., supervised fine-tuning) loss. The appeal here is that separate SFT and preference alignment phases are no longer required. Another deviation from DPO is proposed in [19], whereby the reference policy itself is no longer fixed, but iteratively updated during training.

Analysis of DPO:

Topics addressed by recent work include analysis of DPO learning dynamics [22], the impact of out-of-preference data on estimation errors [25], and the disproportionate rates with which the DPO loss gradients favor reducing the probability of dispreferred responses relative to increasing the probability of desired responses [16]. Broader consideration of preference optimization spanning various DPO-based and RLHF-based approaches is presented in [39]

Appendix C DPO Loss Induces Noise Adaptive Regularization

Using several straightforward algebraic manipulations, the DPO loss from (7) can be modified as

DPO(πθ,πref,λ)subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\displaystyle\ell_{\tiny\mbox{DPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) =\displaystyle== 𝔼{yw,yl,x}𝒟tr[logσ(λlogπθ(yw|x)πref(yw|x)λlogπθ(yl|x)πref(yl|x))]subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟trdelimited-[]𝜎𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}}% \left[-\log\sigma\left(\lambda\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\tiny\mbox% {ref}}(y_{w}|x)}-\lambda\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\tiny\mbox{ref}}% (y_{l}|x)}\right)\right]blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_σ ( italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] (19)
\displaystyle\equiv 𝔼{yw,yl,x}𝒟tr[log([πref(yl|x)πref(yw|x)]λ+[πθ(yl|x)πθ(yw|x)]λ)],subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟trdelimited-[]superscriptdelimited-[]subscript𝜋refconditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆superscriptdelimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥𝜆\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}}% \left[\log\left(\left[\frac{\pi_{\tiny\mbox{ref}}(y_{l}|x)}{\pi_{\tiny\mbox{% ref}}(y_{w}|x)}\right]^{\lambda}+\left[\frac{\pi_{\theta}(y_{l}|x)}{\pi_{% \theta}(y_{w}|x)}\right]^{\lambda}\right)\right],blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( [ divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT + [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ) ] ,

excluding constants independent of πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. This expression represents an expectation over a regularization factor in the form log(γ+u)𝛾𝑢\log(\gamma+u)roman_log ( italic_γ + italic_u ), where γ𝛾\gammaitalic_γ corresponding to [πref(yl|x)πref(yw|x)]λsuperscriptdelimited-[]subscript𝜋refconditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆\left[\frac{\pi_{\tiny\mbox{ref}}(y_{l}|x)}{\pi_{\tiny\mbox{ref}}(y_{w}|x)}% \right]^{\lambda}[ divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT is fixed, and u𝑢uitalic_u corresponding to [πθ(yl|x)πθ(yw|x)]λsuperscriptdelimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥𝜆\left[\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\theta}(y_{w}|x)}\right]^{\lambda}[ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT is the variable of interest to be optimized. We will now examine several notable properties of log(γ+u)𝛾𝑢\log(\gamma+u)roman_log ( italic_γ + italic_u ) that serve to elucidate underappreciated DPO regularization characteristics. For this purpose, we first introduce the following definition from [29]:

Definition 3

Let f𝑓fitalic_f be a strictly increasing differentiable function on the interval [a,b]𝑎𝑏[a,b][ italic_a , italic_b ]. Then the differentiable function g𝑔gitalic_g is concave relative to f𝑓fitalic_f on [a,b]𝑎𝑏[a,b][ italic_a , italic_b ] iff

g(u2)g(u1)+g(u1)f(u1)[f(u2)f(u1)],𝑔subscript𝑢2𝑔subscript𝑢1superscript𝑔subscript𝑢1superscript𝑓subscript𝑢1delimited-[]𝑓subscript𝑢2𝑓subscript𝑢1g(u_{2})\leq g(u_{1})+\frac{g^{\prime}(u_{1})}{f^{\prime}(u_{1})}\left[f(u_{2}% )-f(u_{1})\right],italic_g ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≤ italic_g ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + divide start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG [ italic_f ( italic_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_f ( italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] , (20)

where gsuperscript𝑔g^{\prime}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote the respective derivatives.

Intuitively, this definition indicates that if g𝑔gitalic_g is concave relative to f𝑓fitalic_f, it has greater curvature at any evaluation point u𝑢uitalic_u once normalizing (via an affine transformation of f𝑓fitalic_f or g𝑔gitalic_g) such that g(u)=f(u)𝑔𝑢𝑓𝑢g(u)=f(u)italic_g ( italic_u ) = italic_f ( italic_u ) and g(u)=f(u)superscript𝑔𝑢superscript𝑓𝑢g^{\prime}(u)=f^{\prime}(u)italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ). Equipped with this definition, we then point out the following observations linking DPO with prior work on robust estimation in the presence of noise:

  • log(γ+u)𝛾𝑢\log(\gamma+u)roman_log ( italic_γ + italic_u ) is a concave non-decreasing function of u[0,)𝑢0u\in[0,\infty)italic_u ∈ [ 0 , ∞ ), which represents a well-known characteristic of sparsity-favoring penalty factors commonly used in robust estimation [12, 13, 15, 35].777Most prior work involves parameters that can be negative, which can be accommodated by simply replacing u𝑢uitalic_u with |u|𝑢|u|| italic_u |. Such penalties introduce a steep gradient around zero, but then flatten away from zero to avoid incurring significant additional loss (as would occur, for example, with a common quadratic loss).

  • For any γ1<γ2subscript𝛾1subscript𝛾2\gamma_{1}<\gamma_{2}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, log(γ1+u)subscript𝛾1𝑢\log(\gamma_{1}+u)roman_log ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_u ) is concave relative to log(γ2+u)subscript𝛾2𝑢\log(\gamma_{2}+u)roman_log ( italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_u ) per Definition 3. Figure 8 illustrates this phenomena by contrasting with two extremes producing the convex 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm and the non-convex 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm.

  • Prior work [10, 42] has investigated general optimization problems of the form

    min{ui}𝒮uilog(γ+|ui|),subscriptsubscript𝑢𝑖subscript𝒮𝑢subscript𝑖𝛾subscript𝑢𝑖\min_{\{u_{i}\}\in{\mathcal{S}}_{u}}\sum_{i}\log(\gamma+|u_{i}|),roman_min start_POSTSUBSCRIPT { italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_γ + | italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ) , (21)

    sometimes generalized to min{ui}𝒮uif(|ui|,γ)subscriptsubscript𝑢𝑖subscript𝒮𝑢subscript𝑖𝑓subscript𝑢𝑖𝛾\min_{\{u_{i}\}\in{\mathcal{S}}_{u}}\sum_{i}f(|u_{i}|,\gamma)roman_min start_POSTSUBSCRIPT { italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ∈ caligraphic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( | italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | , italic_γ ) over a concave, non-decreasing function f𝑓fitalic_f of |ui|subscript𝑢𝑖|u_{i}|| italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |, where Susubscript𝑆𝑢S_{u}italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is some constraint set.888In some applications the constraint set may be replaced by an additional regularization factor, and there is often an equivalency between the two. Moreover, γ𝛾\gammaitalic_γ reflects a noise parameter or an analogous measure of uncertainty, with relative concavity dictated by γ𝛾\gammaitalic_γ as above. In these contexts, it has been argued that adjusting the curvature of the regularization factor based on noise levels can provide additional robustness to bad local minima and high noise regimes [10, 14, 43]. The basic intuition here is that when noise is high, a more convex shape is preferable, while when the noise is low, a more concave alternative may be appropriate.

  • Regarding DPO, it is natural to treat [πref(yl|x)πref(yw|x)]λsuperscriptdelimited-[]subscript𝜋refconditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆\left[\frac{\pi_{\tiny\mbox{ref}}(y_{l}|x)}{\pi_{\tiny\mbox{ref}}(y_{w}|x)}% \right]^{\lambda}[ divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT as an analogous noise factor, given that whenever this ratio is large, it implies that our reference policy is poor. Hence, once we introduce a constraint 𝒮πsubscript𝒮𝜋{\mathcal{S}}_{\pi}caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT on πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (as will always occur in practice; see Section 3.3), solving

    minπθ𝒮πDPO(πθ,πref,λ)subscriptsubscript𝜋𝜃subscript𝒮𝜋subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\min_{\pi_{\theta}\in{\mathcal{S}}_{\pi}}\ell_{\tiny\mbox{DPO}}(\pi_{\theta},% \pi_{\tiny\mbox{ref}},\lambda)roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) (22)

    can be viewed as a special case of (21), involving a robust regularization factor with noise-adaptive curvature.

Refer to caption
Figure 8: Visualization of different penalty factors associated with the DPO loss. When γ0𝛾0\gamma\rightarrow 0italic_γ → 0, log(γ+|u|)log|u|=limp01p[|u|p1]𝕀[u0]𝛾𝑢𝑢subscript𝑝01𝑝delimited-[]superscript𝑢𝑝1proportional-to𝕀delimited-[]𝑢0\log(\gamma+|u|)\rightarrow\log|u|=\lim_{p\rightarrow 0}\frac{1}{p}[|u|^{p}-1]% \propto\mathbb{I}[u\neq 0]roman_log ( italic_γ + | italic_u | ) → roman_log | italic_u | = roman_lim start_POSTSUBSCRIPT italic_p → 0 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG [ | italic_u | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT - 1 ] ∝ blackboard_I [ italic_u ≠ 0 ] mimicking an 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm (red curve) w.r.t. relative concavity (if u0𝑢0u\geq 0italic_u ≥ 0 as with DPO, can remove absolute value, but we nonetheless include the general case here.). In contrast, limγγlog(γ+|u|)=|u|subscript𝛾𝛾𝛾𝑢𝑢\lim_{\gamma\rightarrow\infty}\gamma\log(\gamma+|u|)=|u|roman_lim start_POSTSUBSCRIPT italic_γ → ∞ end_POSTSUBSCRIPT italic_γ roman_log ( italic_γ + | italic_u | ) = | italic_u | reflecting the relative concavity of the convex 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm (green curve). Note that in both limiting cases, affine transformations do not impact relative concavity. For a fixed γ𝛾\gammaitalic_γ value, the relative concavity of log(γ+|u|)𝛾𝑢\log(\gamma+|u|)roman_log ( italic_γ + | italic_u | ) lies within these two extremes.

Appendix D DPO from a Naive Gaussian Estimation Perspective

Any preference probability given by the BT model in (2) can be equivalently re-expressed as

p(y1y2|x)=μ[π(y2|x)π(y1|x)],superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥𝜇delimited-[]superscript𝜋conditionalsubscript𝑦2𝑥superscript𝜋conditionalsubscript𝑦1𝑥p^{*}(y_{1}\succ y_{2}|x)=\mu\left[\frac{\pi^{*}(y_{2}|x)}{\pi^{*}(y_{1}|x)}% \right],italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = italic_μ [ divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG ] , (23)

where π(y|x)superscript𝜋conditional𝑦𝑥\pi^{*}(y|x)italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) is a conditional probability of y𝑦yitalic_y given x𝑥xitalic_x (i.e., the BT-optimal policy introduced in Section 3) and μ:[0,1]:𝜇01\mu:\mathbb{R}\rightarrow[0,1]italic_μ : blackboard_R → [ 0 , 1 ] is a monotonically increasing function. While we may optionally choose μ𝜇\muitalic_μ to exactly reproduce the BT model, it is of course reasonable to consider other monotonically increasing choices to explore the additional generality of (23) (and indeed we will exploit one such alternative choice below).

Given a trainable policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT we can always minimize the negative log-likelihood logμ[πθ(y2|x)πθ(y1|x)]𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦2𝑥subscript𝜋𝜃conditionalsubscript𝑦1𝑥-\log\mu\left[\frac{\pi_{\theta}(y_{2}|x)}{\pi_{\theta}(y_{1}|x)}\right]- roman_log italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG ] averaged over preference samples {yw,yl,x}𝒟trsimilar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT to approximate p(y1y2|x)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥p^{*}(y_{1}\succ y_{2}|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ); however, this procedure would be completely independent of any regularization effects of a reference policy πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT. We now examine how to introduce the reference policy by relying only on a simple Gaussian model with trainable variances, rather than any association with RLHF or implicit reward modeling. The end result is an independent re-derivation of DPO using basic Gaussian assumptions.

For convenience, we first define the functions ξθsubscript𝜉𝜃\xi_{\theta}italic_ξ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ξrefsubscript𝜉ref\xi_{\tiny\mbox{ref}}italic_ξ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT as

ξθ(y1,y2,x):=μ[πθ(y2|x)πθ(y1|x)],ξref(y1,y2,x):=μ[πref(y2|x)πref(y1|x)].formulae-sequenceassignsubscript𝜉𝜃subscript𝑦1subscript𝑦2𝑥𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦2𝑥subscript𝜋𝜃conditionalsubscript𝑦1𝑥assignsubscript𝜉refsubscript𝑦1subscript𝑦2𝑥𝜇delimited-[]subscript𝜋refconditionalsubscript𝑦2𝑥subscript𝜋refconditionalsubscript𝑦1𝑥\xi_{\theta}(y_{1},y_{2},x):=\mu\left[\frac{\pi_{\theta}(y_{2}|x)}{\pi_{\theta% }(y_{1}|x)}\right],~{}~{}~{}\xi_{\tiny\mbox{ref}}(y_{1},y_{2},x):=\mu\left[% \frac{\pi_{\tiny\mbox{ref}}(y_{2}|x)}{\pi_{\tiny\mbox{ref}}(y_{1}|x)}\right].italic_ξ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) := italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG ] , italic_ξ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) := italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG ] . (24)

Now suppose we assume the naive joint distribution given by

p([ξθ(y1,y2,x)ξref(y1,y2,x)])=𝒩([ξθ(y1,y2,x)ξref(y1,y2,x)]|0,γ(y1,y2,x)I),𝑝delimited-[]subscript𝜉𝜃subscript𝑦1subscript𝑦2𝑥subscript𝜉refsubscript𝑦1subscript𝑦2𝑥𝒩conditionaldelimited-[]subscript𝜉𝜃subscript𝑦1subscript𝑦2𝑥subscript𝜉refsubscript𝑦1subscript𝑦2𝑥0𝛾subscript𝑦1subscript𝑦2𝑥𝐼p\left(\left[\begin{array}[]{c}\xi_{\theta}(y_{1},y_{2},x)\\ \xi_{\tiny\mbox{ref}}(y_{1},y_{2},x)\end{array}\right]\right)={\mathcal{N}}% \left(\left.\left[\begin{array}[]{c}\xi_{\theta}(y_{1},y_{2},x)\\ \xi_{\tiny\mbox{ref}}(y_{1},y_{2},x)\end{array}\right]\right|0,\gamma(y_{1},y_% {2},x)I\right),italic_p ( [ start_ARRAY start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW end_ARRAY ] ) = caligraphic_N ( [ start_ARRAY start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW end_ARRAY ] | 0 , italic_γ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) italic_I ) , (25)

where 𝒩(|0,Σ){\mathcal{N}}(\cdot|0,\Sigma)caligraphic_N ( ⋅ | 0 , roman_Σ ) denotes a 2D, zero-mean Gaussian with covariance Σ2×2Σsuperscript22\Sigma\in\mathbb{R}^{2\times 2}roman_Σ ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT, and γ(y1,y2,x)+𝛾subscript𝑦1subscript𝑦2𝑥superscript\gamma(y_{1},y_{2},x)\in\mathbb{R}^{+}italic_γ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT is a variance parameter that depends on the tuple {y1,y2,x}subscript𝑦1subscript𝑦2𝑥\{y_{1},y_{2},x\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x }. Since each γ(y1,y2,x)𝛾subscript𝑦1subscript𝑦2𝑥\gamma(y_{1},y_{2},x)italic_γ ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) is unknown, we can group them together with πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and estimate all unknowns jointly. In the context of labeled human preference data drawn from 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, this involves minimizing

minπθ𝒮π,{γ(yw,yl,x)>0}{𝔼{yw,yl,x}𝒟trlog𝒩([ξθ(yw,yl,x)ξref(yw,yl,x)]|0,γ(yw,yl,x)I)},subscriptsubscript𝜋𝜃subscript𝒮𝜋𝛾subscript𝑦𝑤subscript𝑦𝑙𝑥0subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟tr𝒩conditionaldelimited-[]subscript𝜉𝜃subscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝜉refsubscript𝑦𝑤subscript𝑦𝑙𝑥0𝛾subscript𝑦𝑤subscript𝑦𝑙𝑥𝐼\min_{\pi_{\theta}\in{\mathcal{S}}_{\pi},~{}\{\gamma(y_{w},y_{l},x)>0\}}\left% \{\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}}-\log{% \mathcal{N}}\left(\left[\begin{array}[]{c}\xi_{\theta}(y_{w},y_{l},x)\\ \xi_{\tiny\mbox{ref}}(y_{w},y_{l},x)\end{array}\right]~{}\Big{|}~{}0,\gamma(y_% {w},y_{l},x)I\right)\right\},roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∈ caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT , { italic_γ ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) > 0 } end_POSTSUBSCRIPT { blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_log caligraphic_N ( [ start_ARRAY start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW start_ROW start_CELL italic_ξ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) end_CELL end_ROW end_ARRAY ] | 0 , italic_γ ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) italic_I ) } , (26)

where I𝐼Iitalic_I is a 2×2222\times 22 × 2 identity matrix and 𝒮πsubscript𝒮𝜋{\mathcal{S}}_{\pi}caligraphic_S start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT is any constraint set on πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT as introduced in Section 3.3. The intuition here is that, although γ(yw,yl,x)𝛾subscript𝑦𝑤subscript𝑦𝑙𝑥\gamma(y_{w},y_{l},x)italic_γ ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) is unknown, sharing this parameter across both ξθsubscript𝜉𝜃\xi_{\theta}italic_ξ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ξrefsubscript𝜉ref\xi_{\tiny\mbox{ref}}italic_ξ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT and estimating jointly will induce a reference policy-dependent regularization effect.

And indeed, this simple Gaussian model exactly reproduces DPO. More concretely, the stated equivalence follows from the fact that, for an arbitrary vector v𝑣vitalic_v we have that

argminγ>0log𝒩(v|0,γI)argminγ>0[vvγ+log|γI|]=12vv.subscript𝛾0𝒩conditional𝑣0𝛾𝐼subscript𝛾0superscript𝑣top𝑣𝛾𝛾𝐼12superscript𝑣top𝑣\arg\min_{\gamma>0}-\log{\mathcal{N}}(v|0,\gamma I)\equiv\arg\min_{\gamma>0}% \left[\frac{v^{\top}v}{\gamma}+\log|\gamma I|\right]=\frac{1}{2}v^{\top}v.roman_arg roman_min start_POSTSUBSCRIPT italic_γ > 0 end_POSTSUBSCRIPT - roman_log caligraphic_N ( italic_v | 0 , italic_γ italic_I ) ≡ roman_arg roman_min start_POSTSUBSCRIPT italic_γ > 0 end_POSTSUBSCRIPT [ divide start_ARG italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v end_ARG start_ARG italic_γ end_ARG + roman_log | italic_γ italic_I | ] = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v . (27)

And therefore, we have

minγ>0log𝒩(v|0,γI)log(vv)subscript𝛾0𝒩conditional𝑣0𝛾𝐼superscript𝑣top𝑣\min_{\gamma>0}-\log{\mathcal{N}}(v|0,\gamma I)~{}\equiv~{}\log(v^{\top}v)roman_min start_POSTSUBSCRIPT italic_γ > 0 end_POSTSUBSCRIPT - roman_log caligraphic_N ( italic_v | 0 , italic_γ italic_I ) ≡ roman_log ( italic_v start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_v ) (28)

excluding irrelevant constants. Returning to (26), if we first optimize over γ(yw,yl,x)𝛾subscript𝑦𝑤subscript𝑦𝑙𝑥\gamma(y_{w},y_{l},x)italic_γ ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) for each tuple, we obtain the loss factor

log[ξref(yw,yl,x)2+ξθ(yw,yl,x)2]=log[μ[πθ(yl|x)πθ(yw|x)]2+μ[πref(yl|x)πref(yw|x)]2].subscript𝜉refsuperscriptsubscript𝑦𝑤subscript𝑦𝑙𝑥2subscript𝜉𝜃superscriptsubscript𝑦𝑤subscript𝑦𝑙𝑥2𝜇superscriptdelimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥2𝜇superscriptdelimited-[]subscript𝜋refconditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥2\log\left[\xi_{\tiny\mbox{ref}}(y_{w},y_{l},x)^{2}+\xi_{\theta}(y_{w},y_{l},x)% ^{2}\right]~{}=~{}\log\left[\mu\left[\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\theta}% (y_{w}|x)}\right]^{2}+\mu\left[\frac{\pi_{\tiny\mbox{ref}}(y_{l}|x)}{\pi_{% \tiny\mbox{ref}}(y_{w}|x)}\right]^{2}\right].roman_log [ italic_ξ start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_log [ italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (29)

From here, by choosing μ()=()λ2𝜇superscript𝜆2\mu(\cdot)=(\cdot)^{\frac{\lambda}{2}}italic_μ ( ⋅ ) = ( ⋅ ) start_POSTSUPERSCRIPT divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT we can modify (29) as

log[πθ(yl|x)λπθ(yw|x)λ+πref(yl|x)λπref(yw|x)λ]subscript𝜋𝜃superscriptconditionalsubscript𝑦𝑙𝑥𝜆subscript𝜋𝜃superscriptconditionalsubscript𝑦𝑤𝑥𝜆subscript𝜋refsuperscriptconditionalsubscript𝑦𝑙𝑥𝜆subscript𝜋refsuperscriptconditionalsubscript𝑦𝑤𝑥𝜆\displaystyle\log\left[\frac{\pi_{\theta}(y_{l}|x)^{\lambda}}{\pi_{\theta}(y_{% w}|x)^{\lambda}}+\frac{\pi_{\tiny\mbox{ref}}(y_{l}|x)^{\lambda}}{\pi_{\tiny% \mbox{ref}}(y_{w}|x)^{\lambda}}\right]roman_log [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT end_ARG ] =\displaystyle== log[1+(πθ(yl|x)πref(yl|x))λ(πref(yw|x)πθ(yw|x))λ]+C1superscriptsubscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥𝜆superscriptsubscript𝜋refconditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥𝜆𝐶\displaystyle\log\left[1+\left(\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\tiny\mbox{% ref}}(y_{l}|x)}\right)^{\lambda}\left(\frac{\pi_{\tiny\mbox{ref}}(y_{w}|x)}{% \pi_{\theta}(y_{w}|x)}\right)^{\lambda}\right]+Croman_log [ 1 + ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ( divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT ] + italic_C (30)
=\displaystyle== logσ(λlogπθ(yw|x)πref(yw|x)λlogπθ(yl|x)πref(yl|x)),𝜎𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\displaystyle-\log\sigma\left(\lambda\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{% \tiny\mbox{ref}}(y_{w}|x)}-\lambda\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\tiny% \mbox{ref}}(y_{l}|x)}\right),- roman_log italic_σ ( italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ,

ignoring the irrelevant constant C𝐶Citalic_C which is independent of πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Hence we have recovered the DPO loss for each tuple {yw,yl,x}subscript𝑦𝑤subscript𝑦𝑙𝑥\{y_{w},y_{l},x\}{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } and once the requisite expectation is reintroduced, we exactly recover the full DPO loss from (7).

Appendix E Technical Proofs

E.1 Proof of Theorem 1

Definition 4

We define labeled human preference data 𝒟¯trsubscript¯𝒟𝑡𝑟\bar{{\mathcal{D}}}_{tr}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT as some 𝒟trsubscript𝒟𝑡𝑟{\mathcal{D}}_{tr}caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, as introduced via (1), satisfying the following additional properties:

  1. 1.

    The prompts drawn from 𝒟¯trsubscript¯𝒟𝑡𝑟\bar{{\mathcal{D}}}_{tr}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT are split between two disjoint support partitions dxgoodsuperscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑d_{x}^{good}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT and dxbadsuperscriptsubscript𝑑𝑥𝑏𝑎𝑑d_{x}^{bad}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT, i.e., xdxgooddxbad𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{good}\cup d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ∪ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT with probability one, with dxgooddxbad=superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑superscriptsubscript𝑑𝑥𝑏𝑎𝑑d_{x}^{good}\cap d_{x}^{bad}=\emptysetitalic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ∩ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT = ∅.

  2. 2.

    For each prompt xdxgooddxbad𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{good}\cup d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ∪ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT within 𝒟¯trsubscript¯𝒟𝑡𝑟\bar{{\mathcal{D}}}_{tr}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, the preference distribution filling out 𝒟¯trsubscript¯𝒟𝑡𝑟\bar{{\mathcal{D}}}_{tr}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT maintains support over a single (prompt-dependent) response pair {y1,y2}subscript𝑦1subscript𝑦2\{y_{1},y_{2}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }.

  3. 3.

    Pair-wise preferences are dictated by a ground-truth BT model satisfying p(y1y2|x)(0,1)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥01p^{*}(y_{1}\succ y_{2}|x)\in(0,1)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) ∈ ( 0 , 1 ) for all xdxgooddxbad𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{good}\cup d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ∪ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT.

Although the second specification above can naturally be relaxed to address more general scenarios, doing so unnecessarily complicates the presentation without providing sufficiently compelling additional insight. Additionally, for convenience below we adopt dist[,]dist\mbox{dist}[\cdot,\cdot]dist [ ⋅ , ⋅ ] to indicate an arbitrary distance measure.

Theorem 1

(Restated formal version)   Assume preference data 𝒟¯trsubscript¯𝒟𝑡𝑟\bar{{\mathcal{D}}}_{tr}over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT that satisfies Definition 4. Furthermore, assume a reference policy πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT such that πref=πsubscript𝜋refsuperscript𝜋\pi_{\tiny\mbox{ref}}=\pi^{*}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT and dist[πref,π]>0distsubscript𝜋refsuperscript𝜋0\mbox{dist}[\pi_{\tiny\mbox{ref}},~{}\pi^{*}]>0dist [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] > 0 for xdxbad𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT, where πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a BT-optimal policy. It follows that for any selection of (ψ,μ,λ)𝜓𝜇𝜆(\psi,\mu,\lambda)( italic_ψ , italic_μ , italic_λ ), if

dist[π^θQPO,π]<dist[πref,π]forxdxbad,distsuperscriptsubscript^𝜋𝜃QPOsuperscript𝜋distsubscript𝜋refsuperscript𝜋for𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{QPO}},~{}\pi^{*}]~{}~{}<~{}~{}\mbox% {dist}[\pi_{\tiny\mbox{ref}},~{}\pi^{*}]~{}~{}\mbox{for}~{}~{}x\in d_{x}^{bad},dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT QPO end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] < dist [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] for italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT , (31)

then

dist[π^θQPO,π]>0forxdxgood,distsuperscriptsubscript^𝜋𝜃QPOsuperscript𝜋0for𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{QPO}},~{}\pi^{*}]~{}~{}>~{}~{}0~{}~% {}\mbox{for}~{}~{}x\in d_{x}^{good},dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT QPO end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] > 0 for italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT , (32)

where π^θQPO:=argminπθQPO(πθ,πref,ψ,μ,λ)assignsuperscriptsubscript^𝜋𝜃QPOsubscriptsubscript𝜋𝜃subscriptQPOsubscript𝜋𝜃subscript𝜋ref𝜓𝜇𝜆\hat{\pi}_{\theta}^{\tiny\mbox{QPO}}:=\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox{% QPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\psi,\mu,\lambda)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT QPO end_POSTSUPERSCRIPT := roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT QPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_ψ , italic_μ , italic_λ ).

The proof proceeds as follows. With some abuse/imprecision of notation, we first define

u(y1,y2,x):=μ[πθ(y1|x)πref(y1|x)]μ[πθ(y2|x)πref(y2|x)].assign𝑢subscript𝑦1subscript𝑦2𝑥𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦1𝑥subscript𝜋refconditionalsubscript𝑦1𝑥𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦2𝑥subscript𝜋refconditionalsubscript𝑦2𝑥u(y_{1},y_{2},x):=\mu\left[\frac{\pi_{\theta}(y_{1}|x)}{\pi_{\tiny\mbox{ref}}(% y_{1}|x)}\right]-\mu\left[\frac{\pi_{\theta}(y_{2}|x)}{\pi_{\tiny\mbox{ref}}(y% _{2}|x)}\right].italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) := italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG ] - italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG ] . (33)

Next, per the assumptions of the theorem statement and Definition 4, we have that the QPO loss decouples as

QPO(πθ,πref,ψ,μ,λ)subscriptQPOsubscript𝜋𝜃subscript𝜋ref𝜓𝜇𝜆\displaystyle\ell_{\tiny\mbox{QPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\psi,% \mu,\lambda)roman_ℓ start_POSTSUBSCRIPT QPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_ψ , italic_μ , italic_λ )
=𝔼{yw,yl,x}𝒟¯trψ(μ[πθ(yw|x)πref(yw|x)]μ[πθ(yl|x)πref(yl|x)],λ)absentsubscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript¯𝒟𝑡𝑟𝜓𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜇delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥𝜆\displaystyle=~{}~{}\mathbb{E}_{\{y_{w},y_{l},x\}\sim\bar{{\mathcal{D}}}_{tr}}% ~{}\psi\left(\mu\left[\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\tiny\mbox{ref}}(y_{w}% |x)}\right]-\mu\left[\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\tiny\mbox{ref}}(y_{l}|% x)}\right],\lambda\right)= blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ ( italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] - italic_μ [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ] , italic_λ ) (34)
=𝔼x𝒟x(p(y1y2|x)ψ[u(y1,y2,x),λ]+p(y2y1|x)ψ[u(y2,y1,x),λ])absentsubscript𝔼similar-to𝑥subscript𝒟𝑥superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥𝜓𝑢subscript𝑦1subscript𝑦2𝑥𝜆superscript𝑝succeedssubscript𝑦2conditionalsubscript𝑦1𝑥𝜓𝑢subscript𝑦2subscript𝑦1𝑥𝜆\displaystyle=~{}~{}\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\Big{(}p^{*}(y_{1}\succ y% _{2}|x)\psi\Big{[}u(y_{1},y_{2},x),\lambda\Big{]}+p^{*}(y_{2}\succ y_{1}|x)% \psi\Big{[}u(y_{2},y_{1},x),\lambda\Big{]}\Big{)}= blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) italic_ψ [ italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) , italic_λ ] + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) italic_ψ [ italic_u ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) , italic_λ ] )
=𝔼xdxgood[p(y1y2|x)ψ[u(y1,y2,x),λ]+p(y2y1|x)ψ[u(y1,y2,x),λ]]absentsubscript𝔼similar-to𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑delimited-[]superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥𝜓𝑢subscript𝑦1subscript𝑦2𝑥𝜆superscript𝑝succeedssubscript𝑦2conditionalsubscript𝑦1𝑥𝜓𝑢subscript𝑦1subscript𝑦2𝑥𝜆\displaystyle=~{}~{}\mathbb{E}_{x\sim d_{x}^{good}}\Big{[}p^{*}(y_{1}\succ y_{% 2}|x)\psi[u(y_{1},y_{2},x),\lambda]+p^{*}(y_{2}\succ y_{1}|x)\psi[-u(y_{1},y_{% 2},x),\lambda]\Big{]}= blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) italic_ψ [ italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) , italic_λ ] + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) italic_ψ [ - italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) , italic_λ ] ]
+𝔼xdxbad[p(y1y2|x)ψ[u(y1,y2,x),λ]+p(y2y1|x)ψ[u(y1,y2,x),λ]].subscript𝔼similar-to𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑delimited-[]superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥𝜓𝑢subscript𝑦1subscript𝑦2𝑥𝜆superscript𝑝succeedssubscript𝑦2conditionalsubscript𝑦1𝑥𝜓𝑢subscript𝑦1subscript𝑦2𝑥𝜆\displaystyle\hskip 28.45274pt+~{}~{}\mathbb{E}_{x\sim d_{x}^{bad}}\Big{[}p^{*% }(y_{1}\succ y_{2}|x)\psi[u(y_{1},y_{2},x),\lambda]+p^{*}(y_{2}\succ y_{1}|x)% \psi[-u(y_{1},y_{2},x),\lambda]\Big{]}.+ blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) italic_ψ [ italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) , italic_λ ] + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) italic_ψ [ - italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) , italic_λ ] ] .

Now consider a single prompt xbadsuperscript𝑥𝑏𝑎𝑑x^{bad}italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT drawn from dxbadsuperscriptsubscript𝑑𝑥𝑏𝑎𝑑d_{x}^{bad}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT. In order to reduce dist[πref,π]distsubscript𝜋refsuperscript𝜋\mbox{dist}[\pi_{\tiny\mbox{ref}},~{}\pi^{*}]dist [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ], it must be the case that πθ(y|xbad)πref(y|xbad)subscript𝜋𝜃conditional𝑦superscript𝑥𝑏𝑎𝑑subscript𝜋refconditional𝑦superscript𝑥𝑏𝑎𝑑\pi_{\theta}(y|x^{bad})\neq\pi_{\tiny\mbox{ref}}(y|x^{bad})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT ) ≠ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT ), which then implies that u(y1,y2,xbad)0𝑢subscript𝑦1subscript𝑦2superscript𝑥𝑏𝑎𝑑0u(y_{1},y_{2},x^{bad})\neq 0italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT ) ≠ 0. To achieve this, (ψ,μ,λ)𝜓𝜇𝜆(\psi,\mu,\lambda)( italic_ψ , italic_μ , italic_λ ) must be chosen such that

argminu(y1,y2,xbad)[p(y1y2|x)ψ[u(y1,y2,xbad),λ]+p(y2y1|xbad)ψ[u(y1,y2,xbad),λ]]0.subscript𝑢subscript𝑦1subscript𝑦2superscript𝑥𝑏𝑎𝑑superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2superscript𝑥𝜓𝑢subscript𝑦1subscript𝑦2superscript𝑥𝑏𝑎𝑑𝜆superscript𝑝succeedssubscript𝑦2conditionalsubscript𝑦1superscript𝑥𝑏𝑎𝑑𝜓𝑢subscript𝑦1subscript𝑦2superscript𝑥𝑏𝑎𝑑𝜆0\arg\min_{u(y_{1},y_{2},x^{bad})}\Big{[}p^{*}(y_{1}\succ y_{2}|x^{\prime})\psi% [u(y_{1},y_{2},x^{bad}),\lambda]+p^{*}(y_{2}\succ y_{1}|x^{bad})\psi[-u(y_{1},% y_{2},x^{bad}),\lambda]\Big{]}\neq 0.roman_arg roman_min start_POSTSUBSCRIPT italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ [ italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT ) , italic_λ ] + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT ) italic_ψ [ - italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT ) , italic_λ ] ] ≠ 0 . (35)

However, to simultaneously maintain πθ(y|xgood)=πref(y|xgood)=π(y|xgood)subscript𝜋𝜃conditional𝑦superscript𝑥𝑔𝑜𝑜𝑑subscript𝜋refconditional𝑦superscript𝑥𝑔𝑜𝑜𝑑superscript𝜋conditional𝑦superscript𝑥𝑔𝑜𝑜𝑑\pi_{\theta}(y|x^{good})=\pi_{\tiny\mbox{ref}}(y|x^{good})=\pi^{*}(y|x^{good})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ) = italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ) = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ) for some prompt xgoodsuperscript𝑥𝑔𝑜𝑜𝑑x^{good}italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT drawn from dxgoodsuperscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑d_{x}^{good}italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT, it must also be true, for the same fixed (ψ,μ,λ)𝜓𝜇𝜆(\psi,\mu,\lambda)( italic_ψ , italic_μ , italic_λ ) tuple, that

argminu(y1,y2,xgood)[p(y1y2|x)ψ[u(y1,y2,xgood),λ]+p(y2y1|xgood)ψ[u(y1,y2,xgood),λ]]=0.subscript𝑢subscript𝑦1subscript𝑦2superscript𝑥𝑔𝑜𝑜𝑑superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2superscript𝑥𝜓𝑢subscript𝑦1subscript𝑦2superscript𝑥𝑔𝑜𝑜𝑑𝜆superscript𝑝succeedssubscript𝑦2conditionalsubscript𝑦1superscript𝑥𝑔𝑜𝑜𝑑𝜓𝑢subscript𝑦1subscript𝑦2superscript𝑥𝑔𝑜𝑜𝑑𝜆0\arg\min_{u(y_{1},y_{2},x^{good})}\Big{[}p^{*}(y_{1}\succ y_{2}|x^{\prime})% \psi[u(y_{1},y_{2},x^{good}),\lambda]+p^{*}(y_{2}\succ y_{1}|x^{good})\psi[-u(% y_{1},y_{2},x^{good}),\lambda]\Big{]}=0.roman_arg roman_min start_POSTSUBSCRIPT italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_ψ [ italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ) , italic_λ ] + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ) italic_ψ [ - italic_u ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT ) , italic_λ ] ] = 0 . (36)

But this is a contradiction, as the respective arguments that minimize (35) and (36) will be identical. Hence if (35) is true then dist[π^θQPO,π]>0distsuperscriptsubscript^𝜋𝜃QPOsuperscript𝜋0\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{QPO}},~{}\pi^{*}]~{}~{}>~{}~{}0dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT QPO end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] > 0 for xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT. \blacksquare

E.2 Proof of Proposition 1

DPO lower limit:

Given our assumption that 0<p(y1y2|x)<10superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥10<p^{*}(y_{1}\succ y_{2}|x)<10 < italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) < 1, it follows that an optimal finite reward r(y,x)(,)superscript𝑟𝑦𝑥r^{*}(y,x)\in(-\infty,\infty)italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ∈ ( - ∞ , ∞ ) exists. Moreover, given that x𝑥xitalic_x and y𝑦yitalic_y are drawn from finite sample spaces, there will exist finite maximum and minimum optimal rewards, i.e., r(y,x)(B,B)superscript𝑟𝑦𝑥𝐵𝐵r^{*}(y,x)\in(-B,B)italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ∈ ( - italic_B , italic_B ) for some B<𝐵B<\inftyitalic_B < ∞. Furthermore,

limλ0argminπθRLHF(πθ,πref,r,λ)=argmaxπθ𝔼yπθ(y|x)[r(y,x)]=πδ(y|x).subscript𝜆0subscriptsubscript𝜋𝜃subscriptRLHFsubscript𝜋𝜃subscript𝜋refsuperscript𝑟𝜆subscriptsubscript𝜋𝜃subscript𝔼similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]superscript𝑟𝑦𝑥superscript𝜋𝛿conditional𝑦𝑥\lim_{\lambda\rightarrow 0}\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox{RLHF}}\left% (\pi_{\theta},\pi_{\tiny\mbox{ref}},r^{*},\lambda\right)=\arg\max_{\pi_{\theta% }}\mathbb{E}_{y\sim\pi_{\theta}(y|x)}\big{[}r^{*}(y,x)\big{]}=\pi^{\delta}(y|x).roman_lim start_POSTSUBSCRIPT italic_λ → 0 end_POSTSUBSCRIPT roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_λ ) = roman_arg roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] = italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_y | italic_x ) . (37)

Additionally, given that the data are generated by (1), we also know that the same optimal reward satisfies

r=argminrϕBT(rϕ),superscript𝑟subscriptsubscript𝑟italic-ϕsubscriptBTsubscript𝑟italic-ϕr^{*}=\arg\min_{r_{\phi}}\ell_{\tiny\mbox{BT}}\left(r_{\phi}\right),italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT BT end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) , (38)

which is independent of πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT. However, without constraints on πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, there also exists a bijection between policy and reward such that

λlog[argminπθBT(λlogπθ(y|x)πref(y|x))]λlogπref(y|x)=r.𝜆subscriptsubscript𝜋𝜃subscriptBT𝜆subscript𝜋𝜃conditional𝑦𝑥subscript𝜋refconditional𝑦𝑥𝜆subscript𝜋refconditional𝑦𝑥superscript𝑟\lambda\log\left[\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox{BT}}\left(\lambda\log% \frac{\pi_{\theta}(y|x)}{\pi_{\tiny\mbox{ref}}(y|x)}\right)\right]-\lambda\log% \pi_{\tiny\mbox{ref}}(y|x)~{}~{}=~{}~{}r^{*}.italic_λ roman_log [ roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT BT end_POSTSUBSCRIPT ( italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG ) ] - italic_λ roman_log italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) = italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (39)

Hence the DPO reparameterization produces the policy given by (5) with r=r𝑟superscript𝑟r=r^{*}italic_r = italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. From this point we then observe that

limλ01Z(x)πref(y|x)exp[1λr(y,x)]=πδ(y|x),subscript𝜆01𝑍𝑥subscript𝜋refconditional𝑦𝑥1𝜆superscript𝑟𝑦𝑥superscript𝜋𝛿conditional𝑦𝑥\lim_{\lambda\rightarrow 0}\frac{1}{Z(x)}\pi_{\tiny\mbox{ref}}(y|x)\exp\left[% \frac{1}{\lambda}r^{*}(y,x)\right]=\pi^{\delta}(y|x),roman_lim start_POSTSUBSCRIPT italic_λ → 0 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp [ divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] = italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_y | italic_x ) , (40)

noting that for any α>β>0𝛼𝛽0\alpha>\beta>0italic_α > italic_β > 0 we have exp[αλ]/exp[βλ]=exp[(αβ)λ]𝛼𝜆𝛽𝜆𝛼𝛽𝜆\exp\left[\frac{\alpha}{\lambda}\right]/\exp\left[\frac{\beta}{\lambda}\right]% =\exp\left[\frac{(\alpha-\beta)}{\lambda}\right]\rightarrow\inftyroman_exp [ divide start_ARG italic_α end_ARG start_ARG italic_λ end_ARG ] / roman_exp [ divide start_ARG italic_β end_ARG start_ARG italic_λ end_ARG ] = roman_exp [ divide start_ARG ( italic_α - italic_β ) end_ARG start_ARG italic_λ end_ARG ] → ∞ as λ0𝜆0\lambda\rightarrow 0italic_λ → 0. Hence we have fulfilled the requirements of the lower limit.

DPO upper limit:

The upper limit follows trivially from the fact that for any bounded reward

limλ1Z(x)πref(y|x)exp[1λr(y,x)]=1Z(x)πref(y|x)exp[0]=πref.subscript𝜆1𝑍𝑥subscript𝜋refconditional𝑦𝑥1𝜆𝑟𝑦𝑥1𝑍𝑥subscript𝜋refconditional𝑦𝑥0subscript𝜋ref\lim_{\lambda\rightarrow\infty}\frac{1}{Z(x)}\pi_{\tiny\mbox{ref}}(y|x)\exp% \left[\frac{1}{\lambda}r(y,x)\right]=\frac{1}{Z(x)}\pi_{\tiny\mbox{ref}}(y|x)% \exp[0]=\pi_{\tiny\mbox{ref}}.roman_lim start_POSTSUBSCRIPT italic_λ → ∞ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp [ divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG italic_r ( italic_y , italic_x ) ] = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp [ 0 ] = italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT . (41)

\blacksquare

E.3 Proof of Proposition 2

Establishing the upper and lower limiting values for IPO follows a similar pattern to the proof of Proposition 2. However, because the IPO reward is bounded between zero and one by definition, we ultimately do not require any constraint on p(y1y2|x)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥p^{*}(y_{1}\succ y_{2}|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) as we did for DPO. \blacksquare

E.4 Proof of Theorem 2

We first define

ρ^:=argminρ𝔼{yw,yl,x}𝒟¯trψ[ρ(yw,yl,x,πθ,πref),λ].assign^𝜌subscript𝜌subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript¯𝒟𝑡𝑟𝜓𝜌subscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝜋𝜃subscript𝜋ref𝜆\hat{\rho}:=\arg\min_{\rho}\mathbb{E}_{\{y_{w},y_{l},x\}\sim\bar{{\mathcal{D}}% }_{tr}}~{}\psi\Big{[}\rho(y_{w},y_{l},x,\pi_{\theta},\pi_{\tiny\mbox{ref}}),% \lambda\Big{]}.over^ start_ARG italic_ρ end_ARG := roman_arg roman_min start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ over¯ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ψ [ italic_ρ ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) , italic_λ ] . (42)

Now suppose that for a given tuple {yw,yl,x}subscript𝑦𝑤subscript𝑦𝑙𝑥\{y_{w},y_{l},x\}{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } we observe

ρ^(yw,yl,x,πθ,πref)=log[π^θ(yw|x)πref(yl|x)π^θ(yl|x)πref(yw|x)]=B(λ)^𝜌subscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝜋𝜃subscript𝜋refsubscript^𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥subscript^𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝐵𝜆\hat{\rho}(y_{w},y_{l},x,\pi_{\theta},\pi_{\tiny\mbox{ref}})=\log\left[\frac{% \hat{\pi}_{\theta}(y_{w}|x)\pi_{\tiny\mbox{ref}}(y_{l}|x)}{\hat{\pi}_{\theta}(% y_{l}|x)\pi_{\tiny\mbox{ref}}(y_{w}|x)}\right]=B(\lambda)over^ start_ARG italic_ρ end_ARG ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x , italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) = roman_log [ divide start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] = italic_B ( italic_λ ) (43)

for some optimal π^θsubscript^𝜋𝜃\hat{\pi}_{\theta}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and fixed λ(0,)𝜆0\lambda\in(0,\infty)italic_λ ∈ ( 0 , ∞ ), where B(λ)(0,)𝐵𝜆0B(\lambda)\in(0,\infty)italic_B ( italic_λ ) ∈ ( 0 , ∞ ) is a finite value dependent on λ𝜆\lambdaitalic_λ through the definition of ψ𝜓\psiitalic_ψ. Therefore, we have that

π^θ(yw|x)π^θ(yl|x)=exp(B(λ)+log[πref(yw|x)πref(yl|x)]).subscript^𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript^𝜋𝜃conditionalsubscript𝑦𝑙𝑥𝐵𝜆subscript𝜋refconditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\frac{\hat{\pi}_{\theta}(y_{w}|x)}{\hat{\pi}_{\theta}(y_{l}|x)}=\exp\left(B(% \lambda)+\log\left[\frac{\pi_{\tiny\mbox{ref}}(y_{w}|x)}{\pi_{\tiny\mbox{ref}}% (y_{l}|x)}\right]\right).divide start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG = roman_exp ( italic_B ( italic_λ ) + roman_log [ divide start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ] ) . (44)

Obviously this ratio will depend on πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT for any fixed B(λ)𝐵𝜆B(\lambda)italic_B ( italic_λ ). To satisfy the SIC though, in the limit λ0𝜆0\lambda\rightarrow 0italic_λ → 0 the optimized policy π^θsubscript^𝜋𝜃\hat{\pi}_{\theta}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT must be independent of πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT and converge to πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. However, the only way for π^θsubscript^𝜋𝜃\hat{\pi}_{\theta}over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to be independent of πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT is if limλ0B(λ)=±subscript𝜆0𝐵𝜆plus-or-minus\lim_{\lambda\rightarrow 0}B(\lambda)=\pm\inftyroman_lim start_POSTSUBSCRIPT italic_λ → 0 end_POSTSUBSCRIPT italic_B ( italic_λ ) = ± ∞. But if so, only the WIC is achievable, not the SIC. \blacksquare

E.5 Proof of Theorem 3

Our strategy here is to construct a simplified situation whereby we can pinpoint emergent differences between RLHF and DPO losses in the presence of policy constraints. To this end, we assume the following:

  • For all x𝒟xsimilar-to𝑥subscript𝒟𝑥x\sim{\mathcal{D}}_{x}italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, there exists two unique responses y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with equal probability of 1/2 under πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT;

  • Preference data {yw,yl,x}𝒟trsimilar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT are sampled according to (1);

  • The loss trade-off parameter satisfies λ=1𝜆1\lambda=1italic_λ = 1; and

  • p(y1y2|x)(0,1)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥01p^{*}(y_{1}\succ y_{2}|x)\in(0,1)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) ∈ ( 0 , 1 ) for all {y1,y2}πref(y|x)similar-tosubscript𝑦1subscript𝑦2subscript𝜋refconditional𝑦𝑥\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x){ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) and x𝒟x𝑥subscript𝒟𝑥x\in{\mathcal{D}}_{x}italic_x ∈ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT.

RLHF loss processing:

When evaluated with optimal reward model rsuperscript𝑟r^{*}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have that

RLHF(πθ,πref,r,λ)subscriptRLHFsubscript𝜋𝜃subscript𝜋refsuperscript𝑟𝜆\displaystyle\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},r% ^{*},\lambda\right)roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_λ ) =\displaystyle== 𝔼yπθ(y|x),x𝒟x[r(y,x)]+λ𝔼x𝒟x[𝕂𝕃[πθ(y|x)||πref(y|x)]]\displaystyle\mathbb{E}_{y\sim\pi_{\theta}(y|x),x\sim{\mathcal{D}}_{x}}\Big{[}% -r^{*}(y,x)\Big{]}+\lambda~{}\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\Big{[}\mathbb% {KL}\big{[}\pi_{\theta}(y|x)||\pi_{\tiny\mbox{ref}}(y|x)\big{]}\Big{]}blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] + italic_λ blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) ] ] (45)
\displaystyle\equiv 𝔼x𝒟x[𝕂𝕃[πθ(y|x)||π(y|x)]],\displaystyle\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}\pi_{% \theta}(y|x)||\pi^{**}(y|x)\big{]}\Big{]},blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_π start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ] ] ,

where

π(y|x):=1Z(x)πref(y|x)exp[1λr(y,x)].assignsuperscript𝜋absentconditional𝑦𝑥1𝑍𝑥subscript𝜋refconditional𝑦𝑥1𝜆superscript𝑟𝑦𝑥\pi^{**}(y|x)~{}:=~{}\frac{1}{Z(x)}\pi_{\tiny\mbox{ref}}(y|x)\exp\left[\frac{1% }{\lambda}r^{*}(y,x)\right].italic_π start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) := divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_x ) end_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) roman_exp [ divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] . (46)

This stems directly from the analysis in [31, 32]. However, because we are assuming λ=1𝜆1\lambda=1italic_λ = 1 and πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) is constant for any given x𝑥xitalic_x, it follows that

π(y|x)=exp[r(y,x)]yexp[r(y,x)],superscript𝜋absentconditional𝑦𝑥superscript𝑟𝑦𝑥subscript𝑦superscript𝑟𝑦𝑥\pi^{**}(y|x)=\frac{\exp\left[r^{*}(y,x)\right]}{\sum_{y}\exp\left[r^{*}(y,x)% \right]},italic_π start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] end_ARG , (47)

where the denominator is independent of y𝑦yitalic_y. Since the BT-optimal solution πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT satisfies

π(y1|x)π(y1|x)+π(y2|x)=p(y1y2|x)=exp[r(y1,x)]exp[r(y1,x)]+exp[r(y2,x)],superscript𝜋conditionalsubscript𝑦1𝑥superscript𝜋conditionalsubscript𝑦1𝑥superscript𝜋conditionalsubscript𝑦2𝑥superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥superscript𝑟subscript𝑦1𝑥superscript𝑟subscript𝑦1𝑥superscript𝑟subscript𝑦2𝑥\frac{\pi^{*}(y_{1}|x)}{\pi^{*}(y_{1}|x)+\pi^{*}(y_{2}|x)}=p^{*}(y_{1}\succ y_% {2}|x)=\frac{\exp\left[r^{*}(y_{1},x)\right]}{\exp\left[r^{*}(y_{1},x)\right]+% \exp\left[r^{*}(y_{2},x)\right]},divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) + italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) = divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] + roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] end_ARG , (48)

we may conclude that π=πsuperscript𝜋absentsuperscript𝜋\pi^{**}=\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and therefore

RLHF(πθ,πref,r,λ)=𝔼x𝒟x[𝕂𝕃[πθ(y|x)||π(y|x)]]\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny\mbox{ref}},r^{*},\lambda% \right)~{}=~{}\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}\pi_% {\theta}(y|x)||\pi^{*}(y|x)\big{]}\Big{]}roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_λ ) = blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ] ] (49)

under the stated conditions.

DPO loss processing:

When λ=1𝜆1\lambda=1italic_λ = 1 and πref(y|x)subscript𝜋refconditional𝑦𝑥\pi_{\tiny\mbox{ref}}(y|x)italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) is constant, we have that

DPO(πθ,πref,λ)subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\displaystyle\ell_{\tiny\mbox{DPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) =\displaystyle== 𝔼{yw,yl,x}𝒟tr[logσ(λlogπθ(yw|x)πref(yw|x)λlogπθ(yl|x)πref(yl|x))]subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟trdelimited-[]𝜎𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}}% \left[-\log\sigma\left(\lambda\log\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\tiny\mbox% {ref}}(y_{w}|x)}-\lambda\log\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\tiny\mbox{ref}}% (y_{l}|x)}\right)\right]blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log italic_σ ( italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_λ roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] (50)
=\displaystyle== 𝔼{yw,yl,x}𝒟tr[log(πθ(yw|x)+πθ(yl|x)πθ(yw|x))].subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}}\left[\log% \left(\frac{\pi_{\theta}(y_{w}|x)+\pi_{\theta}(y_{l}|x)}{\pi_{\theta}(y_{w}|x)% }\right)\right].blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) + italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] .

Next, given the additional data generation assumptions, it follows that πθ(yw|x)+πθ(yl|x)=1subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥1\pi_{\theta}(y_{w}|x)+\pi_{\theta}(y_{l}|x)=1italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) + italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) = 1, and so the DPO loss can be further modified as

DPO(πθ,πref,λ)subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\displaystyle\ell_{\tiny\mbox{DPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) =\displaystyle== 𝔼{yw,yl,x}𝒟tr[log(1πθ(yw|x))]subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟delimited-[]1subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}}\left[\log% \left(\frac{1}{\pi_{\theta}(y_{w}|x)}\right)\right]blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] (51)
=\displaystyle== 𝔼x𝒟x[p(z=1|y1,y2,x)log(1πθ(y1|x))\displaystyle\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\left[p^{*}(z=1|y_{1},y_{2},x)% \log\left(\frac{1}{\pi_{\theta}(y_{1}|x)}\right)\right.blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG )
+(p(z=0|y1,y2,x)log(1πθ(y2|x))]\displaystyle\hskip 28.45274pt\left.+~{}(p^{*}(z=0|y_{1},y_{2},x)\log\left(% \frac{1}{\pi_{\theta}(y_{2}|x)}\right)\right]+ ( italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 0 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG ) ]
=\displaystyle== 𝔼x𝒟x[π(y1|x)log(1πθ(y1|x))\displaystyle\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\left[\pi^{*}(y_{1}|x)\log% \left(\frac{1}{\pi_{\theta}(y_{1}|x)}\right)\right.blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG )
+π(y2|x)log(1πθ(y2|x))]\displaystyle\hskip 28.45274pt\left.+~{}\pi^{*}(y_{2}|x)\log\left(\frac{1}{\pi% _{\theta}(y_{2}|x)}\right)\right]+ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) roman_log ( divide start_ARG 1 end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG ) ]
=\displaystyle== 𝔼x𝒟x[π(y1|x)log(π(y1|x)πθ(y1|x))\displaystyle\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\left[\pi^{*}(y_{1}|x)\log% \left(\frac{\pi^{*}(y_{1}|x)}{\pi_{\theta}(y_{1}|x)}\right)\right.blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) roman_log ( divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG )
+π(y2|x)log(π(y2|x)πθ(y2|x))]+C\displaystyle\hskip 28.45274pt\left.+~{}\pi^{*}(y_{2}|x)\log\left(\frac{\pi^{*% }(y_{2}|x)}{\pi_{\theta}(y_{2}|x)}\right)\right]~{}+C+ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) roman_log ( divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] + italic_C
\displaystyle\equiv 𝔼x𝒟x[𝕂𝕃[π(y|x)||πθ(y|x)]],\displaystyle\mathbb{E}_{x\sim{\mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}\pi^{% *}(y|x)||\pi_{\theta}(y|x)\big{]}\Big{]},blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) | | italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] ] ,

where C𝐶Citalic_C is an irrelevant constant. Note that in progressing from the first to second equality, we can ignore cases where where sampled responses satisfy y1=y2subscript𝑦1subscript𝑦2y_{1}=y_{2}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, since these contribute only another irrelevant constant to the loss. Along with our stated response data assumptions, this allows us to remove expectation over {y1,y2}subscript𝑦1subscript𝑦2\{y_{1},y_{2}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } without loss of generality.

Final step:

From (49) and (51) we observe that the only difference between the RLHF and DPO losses under the given conditions is whether a forward or backward KL is used. And of course without any constraints, the minimizing solutions are equivalent as expected, consistent with the analysis from [33], i.e.,

argminπθRLHF(πθ,πref,r,λ)=argminπθDPO(πθ,πref,λ).subscriptsubscript𝜋𝜃subscriptRLHFsubscript𝜋𝜃subscript𝜋refsuperscript𝑟𝜆subscriptsubscript𝜋𝜃subscriptDPOsubscript𝜋𝜃subscript𝜋ref𝜆\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox{RLHF}}\left(\pi_{\theta},\pi_{\tiny% \mbox{ref}},r^{*},\lambda\right)~{}=~{}\arg\min_{\pi_{\theta}}\ell_{\tiny\mbox% {DPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda).roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT RLHF end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_λ ) = roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT DPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ) . (52)

Critically though, this KL equivalence transparently need not still hold once constraints are introduced, as the forward KL will favor mode covering while the backward KL will push mode following [7]. \blacksquare

E.6 Proof of Propositions 3 and 4

These results both follow directly from the original design of TYPO(πθ,πref,λ)subscriptTYPOsubscript𝜋𝜃subscript𝜋ref𝜆\ell_{\tiny\mbox{TYPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},\lambda)roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_λ ). Regarding Proposition 3, given that πref=πsubscript𝜋refsuperscript𝜋\pi_{\tiny\mbox{ref}}=\pi^{*}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for all xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT, then for the unsupervised term we have

argminπθ𝔼yπref(y|x),xdxgood[𝕂𝕃[πref(y|x)||πθ(y|x)]]=π.\arg\min_{\pi_{\theta}}\mathbb{E}_{y\sim\pi_{\tiny\mbox{ref}}(y|x),x\in d_{x}^% {good}}\Big{[}\mathbb{KL}\big{[}\pi_{\tiny\mbox{ref}}(y|x)||\pi_{\theta}(y|x)% \big{]}\Big{]}~{}~{}=~{}~{}\pi^{*}.roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) | | italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ] ] = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (53)

And for the supervised term we have

argminπθ𝔼{y1,y2}πref(y|x),x𝒟x[𝕂𝕃[p(z|y1,y2,x)||pθ(z|y1,y2,x)]]=π.\arg\min_{\pi_{\theta}}\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|% x),x\sim{\mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}p^{*}(z|y_{1},y_{2},x)||p_{% \theta}(z|y_{1},y_{2},x)\big{]}\Big{]}~{}~{}=~{}~{}\pi^{*}.roman_arg roman_min start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] ] = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (54)

Hence overall, for any xdxgood𝑥superscriptsubscript𝑑𝑥𝑔𝑜𝑜𝑑x\in d_{x}^{good}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_o italic_o italic_d end_POSTSUPERSCRIPT, πθ=πsubscript𝜋𝜃superscript𝜋\pi_{\theta}=\pi^{*}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT will be optimal for any λ𝜆\lambdaitalic_λ, as this selection independently optimizes the constituent terms. Moreover, this optimality is independent of optimization over xdxbad𝑥superscriptsubscript𝑑𝑥𝑏𝑎𝑑x\in d_{x}^{bad}italic_x ∈ italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b italic_a italic_d end_POSTSUPERSCRIPT, which retains the flexibility to achieve solutions with dist[π^θTYPO,π]<dist[πref,π]distsuperscriptsubscript^𝜋𝜃TYPOsuperscript𝜋distsubscript𝜋refsuperscript𝜋\mbox{dist}[\hat{\pi}_{\theta}^{\tiny\mbox{TYPO}},~{}\pi^{*}]<\mbox{dist}[\pi_% {\tiny\mbox{ref}},~{}\pi^{*}]dist [ over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT TYPO end_POSTSUPERSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] < dist [ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ]. From this Proposition 3 immediately follows.

Additionally, Proposition 4 follows from the same basic line of reasoning. For completeness, we note that when λ0𝜆0\lambda\rightarrow 0italic_λ → 0, only the supervised term will be minimized (which recovers the BT-optimal policy as above), while when λ𝜆\lambda\rightarrow\inftyitalic_λ → ∞, the unsupervised term will dominate the optimization (which transparently produces πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT). \blacksquare

Appendix F Other Derivations

F.1 Derivation of (12)

Note that

p(y1y2|x)superscript𝑝succeedssubscript𝑦1conditionalsubscript𝑦2𝑥\displaystyle p^{*}(y_{1}\succ y_{2}|x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) =\displaystyle== exp[r(y1,x)]exp[r(y1,x)]+exp[r(y2,x)]=exp[r(y1,x)]Z(x)exp[r(y1,x)]Z(x)+exp[r(y2,x)]Z(x)superscript𝑟subscript𝑦1𝑥superscript𝑟subscript𝑦1𝑥superscript𝑟subscript𝑦2𝑥superscript𝑟subscript𝑦1𝑥𝑍𝑥superscript𝑟subscript𝑦1𝑥𝑍𝑥superscript𝑟subscript𝑦2𝑥𝑍𝑥\displaystyle\frac{\exp[r^{*}(y_{1},x)]}{\exp[r^{*}(y_{1},x)]+\exp[r^{*}(y_{2}% ,x)]}=\frac{\frac{\exp[r^{*}(y_{1},x)]}{Z(x)}}{\frac{\exp[r^{*}(y_{1},x)]}{Z(x% )}+\frac{\exp[r^{*}(y_{2},x)]}{Z(x)}}divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] + roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] end_ARG = divide start_ARG divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG italic_Z ( italic_x ) end_ARG end_ARG start_ARG divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG italic_Z ( italic_x ) end_ARG + divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG italic_Z ( italic_x ) end_ARG end_ARG (55)
=\displaystyle== π(y1|x)π(y1|x)+π(y2|x),superscript𝜋conditionalsubscript𝑦1𝑥superscript𝜋conditionalsubscript𝑦1𝑥superscript𝜋conditionalsubscript𝑦2𝑥\displaystyle\frac{\pi^{*}(y_{1}|x)}{\pi^{*}(y_{1}|x)+\pi^{*}(y_{2}|x)},divide start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | italic_x ) + italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | italic_x ) end_ARG ,

where π(y|x)=exp[r(y1,x)]Z(x)superscript𝜋conditional𝑦𝑥superscript𝑟subscript𝑦1𝑥𝑍𝑥\pi^{*}(y|x)=\frac{\exp[r^{*}(y_{1},x)]}{Z(x)}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) = divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG italic_Z ( italic_x ) end_ARG and Z(x):=yexp[r(y,x)]assign𝑍𝑥subscript𝑦superscript𝑟𝑦𝑥Z(x):=\sum_{y}\exp[r^{*}(y,x)]italic_Z ( italic_x ) := ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ]. The policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT so-defined is necessarily BT-optimal by construction. From here then we have

argmaxπθ𝔼yπθ(y|x)[r(y,x)]subscriptsubscript𝜋𝜃subscript𝔼similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]superscript𝑟𝑦𝑥\displaystyle\arg\max_{\pi_{\theta}}\mathbb{E}_{y\sim\pi_{\theta}(y|x)}\big{[}% r^{*}(y,x)\big{]}roman_arg roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] =\displaystyle== argmaxπθ𝔼yπθ(y|x)[r(y,x)]subscriptsubscript𝜋𝜃subscript𝔼similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]superscript𝑟𝑦𝑥\displaystyle\arg\max_{\pi_{\theta}}\mathbb{E}_{y\sim\pi_{\theta}(y|x)}\big{[}% r^{*}(y,x)\big{]}roman_arg roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y , italic_x ) ] (58)
=\displaystyle== argmaxπθ𝔼yπθ(y|x)[exp[r(y1,x)]Z(x)]subscriptsubscript𝜋𝜃subscript𝔼similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]superscript𝑟subscript𝑦1𝑥𝑍𝑥\displaystyle\arg\max_{\pi_{\theta}}\mathbb{E}_{y\sim\pi_{\theta}(y|x)}\left[% \frac{\exp[r^{*}(y_{1},x)]}{Z(x)}\right]roman_arg roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ divide start_ARG roman_exp [ italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ] end_ARG start_ARG italic_Z ( italic_x ) end_ARG ]
=\displaystyle== argmaxπθ𝔼yπθ(y|x)[π(y|x)]subscriptsubscript𝜋𝜃subscript𝔼similar-to𝑦subscript𝜋𝜃conditional𝑦𝑥delimited-[]superscript𝜋conditional𝑦𝑥\displaystyle\arg\max_{\pi_{\theta}}\mathbb{E}_{y\sim\pi_{\theta}(y|x)}\big{[}% \pi^{*}(y|x)\big{]}roman_arg roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y | italic_x ) ]
=\displaystyle== {1ify=argmaxyπ(y|x)0otherwise,cases1if𝑦subscriptsuperscript𝑦superscript𝜋conditionalsuperscript𝑦𝑥0otherwise\displaystyle\left\{\begin{array}[]{cc}1&\mbox{if}~{}~{}y=\arg\max_{y^{\prime}% }\pi^{*}(y^{\prime}|x)\\ 0&\mbox{otherwise}\end{array}\right.,{ start_ARRAY start_ROW start_CELL 1 end_CELL start_CELL if italic_y = roman_arg roman_max start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_x ) end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY ,

which is the definition of πδsuperscript𝜋𝛿\pi^{\delta}italic_π start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT. \blacksquare

F.2 Additional f𝑓fitalic_f-DPO Analysis

f𝑓fitalic_f-PDO represents a novel generalization of DPO, but there remain certain aspects worth considering.

Minima that ignore the reference policy:

Consider general f𝑓fitalic_f-DPO losses as described in Section 2.4, which as special cases of QPO are expressible in the form

QPO(πθ,πref,logσ[λ()],f,λ)=subscriptQPOsubscript𝜋𝜃subscript𝜋ref𝜎delimited-[]𝜆superscript𝑓𝜆absent\displaystyle\ell_{\tiny\mbox{QPO}}(\pi_{\theta},\pi_{\tiny\mbox{ref}},-\log% \sigma[\lambda(\cdot)],f^{\prime},\lambda)=roman_ℓ start_POSTSUBSCRIPT QPO end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT , - roman_log italic_σ [ italic_λ ( ⋅ ) ] , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ ) = (59)
𝔼{yw,yl,x}𝒟trlogσ(λf[πθ(yw|x)πref(yw|x)]λf[πθ(yl|x)πref(yl|x)],λ).subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟tr𝜎𝜆superscript𝑓delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋refconditionalsubscript𝑦𝑤𝑥𝜆superscript𝑓delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋refconditionalsubscript𝑦𝑙𝑥𝜆\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{\tiny\mbox{tr}}}~% {}-\log\sigma\left(\lambda f^{\prime}\left[\frac{\pi_{\theta}(y_{w}|x)}{\pi_{% \tiny\mbox{ref}}(y_{w}|x)}\right]-\lambda f^{\prime}\left[\frac{\pi_{\theta}(y% _{l}|x)}{\pi_{\tiny\mbox{ref}}(y_{l}|x)}\right],\lambda\right).blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT tr end_POSTSUBSCRIPT end_POSTSUBSCRIPT - roman_log italic_σ ( italic_λ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ] - italic_λ italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ] , italic_λ ) .

In addition to the requirements on f𝑓fitalic_f to form an f𝑓fitalic_f-divergence, to produce a valid f𝑓fitalic_f-DPO loss per Theorem 1 from [41] it must be that fsuperscript𝑓f^{\prime}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is invertible with 0domain of f0domain of superscript𝑓0\notin\mbox{domain of }f^{\prime}0 ∉ domain of italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Therefore the domain of f𝑓fitalic_f will be (0,)0(0,\infty)( 0 , ∞ ) and f(u)superscript𝑓𝑢f^{\prime}(u)\rightarrow-\inftyitalic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_u ) → - ∞ as u0𝑢0u\rightarrow 0italic_u → 0 because of convexity. But if this is the case, upon inspection of (59) we observe that when πθ(yl|x)0subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥0\pi_{\theta}(y_{l}|x)\rightarrow 0italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) → 0, then for any fixed πθ(yw|x)>0subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥0\pi_{\theta}(y_{w}|x)>0italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) > 0 the input argument to the logistic function σ()=11+exp[()]𝜎11\sigma(\cdot)=\frac{1}{1+\exp[-(\cdot)]}italic_σ ( ⋅ ) = divide start_ARG 1 end_ARG start_ARG 1 + roman_exp [ - ( ⋅ ) ] end_ARG will converge to infinity, pushing the output to one and subsequently minimizing the corresponding negative-log factor. And so the global optimum can be achieved independent of the value of πrefsubscript𝜋ref\pi_{\tiny\mbox{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT. \blacksquare

F.3 Derivation of (16)

dsup(πθ,πref)subscript𝑑supsubscript𝜋𝜃subscript𝜋ref\displaystyle d_{\tiny\mbox{sup}}(\pi_{\theta},\pi_{\tiny\mbox{ref}})italic_d start_POSTSUBSCRIPT sup end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ) =\displaystyle== 𝔼{y1,y2}πref(y|x),x𝒟x[𝕂𝕃[p(z|y1,y2,x)||pθ(z|y1,y2,x)]]\displaystyle\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{% \mathcal{D}}_{x}}\Big{[}\mathbb{KL}\big{[}p^{*}(z|y_{1},y_{2},x)||p_{\theta}(z% |y_{1},y_{2},x)\big{]}\Big{]}blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_K blackboard_L [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) | | italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] ] (60)
=\displaystyle== 𝔼{y1,y2}πref(y|x),x𝒟x[𝔼zp(z|y1,y2,x)logpθ(z|y1,y2,x)]+Csubscript𝔼formulae-sequencesimilar-tosubscript𝑦1subscript𝑦2subscript𝜋refconditional𝑦𝑥similar-to𝑥subscript𝒟𝑥delimited-[]subscript𝔼similar-to𝑧superscript𝑝conditional𝑧subscript𝑦1subscript𝑦2𝑥subscript𝑝𝜃conditional𝑧subscript𝑦1subscript𝑦2𝑥𝐶\displaystyle-\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{% \mathcal{D}}_{x}}\Big{[}\mathbb{E}_{z\sim p^{*}(z|y_{1},y_{2},x)}\log p_{% \theta}(z|y_{1},y_{2},x)\Big{]}~{}+~{}C- blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_z ∼ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] + italic_C
\displaystyle\equiv 𝔼{y1,y2}πref(y|x),x𝒟x[p(z=1|y1,y2,x)logpθ(z=1|y1,y2,x)]subscript𝔼formulae-sequencesimilar-tosubscript𝑦1subscript𝑦2subscript𝜋refconditional𝑦𝑥similar-to𝑥subscript𝒟𝑥delimited-[]superscript𝑝𝑧conditional1subscript𝑦1subscript𝑦2𝑥subscript𝑝𝜃𝑧conditional1subscript𝑦1subscript𝑦2𝑥\displaystyle-\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{% \mathcal{D}}_{x}}\Big{[}p^{*}(z=1|y_{1},y_{2},x)\log p_{\theta}(z=1|y_{1},y_{2% },x)\Big{]}- blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ]
+𝔼{y1,y2}πref(y|x),x𝒟x[p(z=0|y1,y2,x)logpθ(z=0|y1,y2,x)],\displaystyle+~{}~{}-\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x)% ,x\sim{\mathcal{D}}_{x}}\Big{[}p^{*}(z=0|y_{1},y_{2},x)\log p_{\theta}(z=0|y_{% 1},y_{2},x)\Big{]},+ - blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 0 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 0 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) ] ,
=\displaystyle== 𝔼{y1,y2}πref(y|x),x𝒟x[p(z=1|y1,y2,x)logpθ(z=1|y1,y2,x)\displaystyle-\mathbb{E}_{\{y_{1},y_{2}\}\sim\pi_{\tiny\mbox{ref}}(y|x),x\sim{% \mathcal{D}}_{x}}\Big{[}p^{*}(z=1|y_{1},y_{2},x)\log p_{\theta}(z=1|y_{1},y_{2% },x)- blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∼ italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( italic_y | italic_x ) , italic_x ∼ caligraphic_D start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x )
+p(z=1|y2,y1,x)logpθ(z=1|y2,y1,x)]\displaystyle\hskip 102.43008pt+~{}~{}p^{*}(z=1|y_{2},y_{1},x)\log p_{\theta}(% z=1|y_{2},y_{1},x)\Big{]}+ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) ]
=\displaystyle== 𝔼{yw,yl,x}𝒟tr[logpθ(z=1|yw,yl,x)]subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟delimited-[]subscript𝑝𝜃𝑧conditional1subscript𝑦𝑤subscript𝑦𝑙𝑥\displaystyle-\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}}\Big{[}\log p% _{\theta}(z=1|y_{w},y_{l},x)\Big{]}- blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) ]
=\displaystyle== 𝔼{yw,yl,x}𝒟tr[log(πθ(yw|x)πθ(yw|x)+πθ(yl|x))],subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟delimited-[]subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\displaystyle-\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}}\left[\log% \left(\frac{\pi_{\theta}(y_{w}|x)}{\pi_{\theta}(y_{w}|x)+\pi_{\theta}(y_{l}|x)% }\right)\right],- blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) + italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] ,
=\displaystyle== 𝔼{yw,yl,x}𝒟tr[log(1+πθ(yl|x)πθ(yw|x))],subscript𝔼similar-tosubscript𝑦𝑤subscript𝑦𝑙𝑥subscript𝒟𝑡𝑟delimited-[]1subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle\mathbb{E}_{\{y_{w},y_{l},x\}\sim{\mathcal{D}}_{tr}}\left[\log% \left(1+\frac{\pi_{\theta}(y_{l}|x)}{\pi_{\theta}(y_{w}|x)}\right)\right],blackboard_E start_POSTSUBSCRIPT { italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } ∼ caligraphic_D start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 + divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] ,

where C𝐶Citalic_C is a constant independent of θ𝜃\thetaitalic_θ. Additionally, the third-to-last equality stems from the definition of how tuples {yw,yl,x}subscript𝑦𝑤subscript𝑦𝑙𝑥\{y_{w},y_{l},x\}{ italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x } are sampled. In particular, for a given pair {y1,y2}subscript𝑦1subscript𝑦2\{y_{1},y_{2}\}{ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }, by definition a proportion p(z=1|y1,y2,x)superscript𝑝𝑧conditional1subscript𝑦1subscript𝑦2𝑥p^{*}(z=1|y_{1},y_{2},x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) of the time yw=y1subscript𝑦𝑤subscript𝑦1y_{w}=y_{1}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while a proportion p(z=0|y1,y2,x)=p(z=1|y2,y1,x)superscript𝑝𝑧conditional0subscript𝑦1subscript𝑦2𝑥superscript𝑝𝑧conditional1subscript𝑦2subscript𝑦1𝑥p^{*}(z=0|y_{1},y_{2},x)=p^{*}(z=1|y_{2},y_{1},x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 0 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) of the time yw=y2subscript𝑦𝑤subscript𝑦2y_{w}=y_{2}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Hence

p(z=1|y1,y2,x)logpθ(z=1|y1,y2,x)+p(z=1|y2,y1,x)logpθ(z=1|y2,y1,x)superscript𝑝𝑧conditional1subscript𝑦1subscript𝑦2𝑥subscript𝑝𝜃𝑧conditional1subscript𝑦1subscript𝑦2𝑥superscript𝑝𝑧conditional1subscript𝑦2subscript𝑦1𝑥subscript𝑝𝜃𝑧conditional1subscript𝑦2subscript𝑦1𝑥\displaystyle p^{*}(z=1|y_{1},y_{2},x)\log p_{\theta}(z=1|y_{1},y_{2},x)~{}+~{% }p^{*}(z=1|y_{2},y_{1},x)\log p_{\theta}(z=1|y_{2},y_{1},x)italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x )
logpθ(z=1|yw,yl,x)absentsubscript𝑝𝜃𝑧conditional1subscript𝑦𝑤subscript𝑦𝑙𝑥\displaystyle~{}~{}\equiv~{}~{}\log p_{\theta}(z=1|y_{w},y_{l},x)≡ roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z = 1 | italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_x ) (61)

when the latter is averaged over the preference distribution. \blacksquare

Appendix G Limitations

As more of an analysis-driven contribution, our experiments on real-world data are limited to Figure 4. Moreover, there are promising possibilities raised by pairing our contribution with prior work in new ways that we have not yet been explored. One example is the potential use of REINFORCE in conjunction with modifications to the proposed TYPOsubscriptTYPO\ell_{\tiny\mbox{TYPO}}roman_ℓ start_POSTSUBSCRIPT TYPO end_POSTSUBSCRIPT loss.

Appendix H Broader Impacts

Aligning the output of LLMs with human preferences has obvious, well-documented benefits. However, there nonetheless remains the risk that tools designed to improve LLM responses could be repurposed for nefarious aims. For example, preference data labels could potentially be modified to train models, using preference losses such as ours, that intentionally produce toxic content.