Length Desensitization in Directed Preference Optimization

Wei Liu1  ,  Yang Bai211footnotemark: 1,  Chengcheng Han2,  Rongxiang Weng2,  Jun Xu2,
Xuezhi Cao2, Jingang Wang2,  Xunliang Cai2
1Shanghai Jiao Tong University   2Meituan Inc.
[email protected], [email protected]
Equal contribution.Work done during an internship at Meituan.
Abstract

Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO’s optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-real preferences.

“Brevity is the Soul of Wit.”

—William Shakespeare

1 Introduction

Human preference alignment is crucial to enable large language models (LLMs) to be helpful, honest, and harmless. Among the various methods to achieve effective alignment (Dai et al., 2024; Yuan et al., 2024a), Directed Preference Optimization (DPO) has emerged as a promising technique (Rafailov et al., 2024), giving rise to numerous derivative algorithms (Hong et al., 2024; Chen et al., 2024b; Ethayarajh et al., 2024). DPO eliminates the reliance on online Reward Models (RMs) by reparameterizing the reward function in Reinforcement Learning from Human Feedback (RLHF), thereby implementing a simple and stable offline preference learning paradigm. Among the dimensions of human language preferences, detailedness is one of the most straightforward categories that current alignment algorithms can effortlessly capture, as longer texts tend to be richer in content. However, it has been demonstrated that DPO is susceptible to an over-optimization issue in this particular preference dimension (Xu et al., 2024). As shown in Fig.1, this overemphasis results in models that produce excessively verbose responses, which can compromise their instruction-following and reasoning capabilities (Ding et al., 2023; Yuan et al., 2024b).

Refer to caption
Figure 1: Average response length of multi-iteration DPO Models (Chen et al., 2024d) on AlpacaEval 2.

The phenomenon of verbose response caused by DPO is often attributed to the presence of length bias in the training data (Park et al., 2024; Singhal et al., 2023). This bias arises from an inherent length preference in offline RMs (Wang et al., 2024; Chen et al., 2024c), which results in most preferred responses (chosen) being significantly longer than the dispreferred ones (rejected). Based on this assumption, Yuan et al. (2024b) proposed LIFT-DPO to mitigate the length bias in the training data through a prompt enhancement strategy. Recently, more researchers have questioned the efficacy of the DPO algorithm itself. Park et al. (2024) introduce a regularization term in the optimization objective to adjust the weight of the gradient according to the length difference between the preference pairs. Meng et al. (2024) propose a reference-model-free method SimPO, which used the likelihood averaged by length to eliminate the effect of data length. Lu et al. (2024) introduce a down-sampling approach on KL divergence to eliminate the length reliance of DPO. Though Lu et al. (2024) have conducted a statistical analysis of the implicit rewards during the DPO process and found that the rewards might be overestimated or underestimated due to length, the theoretical explanation of why DPO encounters this issue remains inadequately explored. And experimental results demonstrate that these methods either fail to achieve significant length control or compromise the performance to some extent.

In this paper, we attribute the verbosity problem to the length sensitivity of DPO. Specifically, the partial derivatives of the optimization objective of DPO with respect to the chosen and rejected responses are inversely proportional to their respective likelihood (Feng et al., 2024a). Since the likelihood, which is calculated as the product of the conditional probabilities of each token, decreases rapidly with increasing sequence length, longer chosen or rejected responses are disproportionately favored in the optimization process. Moreover, the length disparity between the chosen and rejected responses will substantially skew the optimization objective, ultimately biasing the direction of optimization. While decreasing the likelihood of any rejected may not affect response length, the primary cause of the verbosity problem is the model’s tendency to increase the likelihood of longer chosen responses.

To address this issue, we propose an offline optimization algorithm for Length Desentsitization of DPO, termed LD-DPO. In this approach, we decompose the likelihood of the longer response in a preference pair into the product of the likelihood of the public-length portion and the likelihood of the excessive portion. The excessive portion is further broken down into human-like preference (implicit and intrinsic) and verbosity preference (due to excess length). LD-DPO aims to mitigate the verbosity preference caused by excessively long responses, thereby smoothing the relationship between the likelihood and response length. This adjustment reduces the influence of length on the optimization direction in DPO, effectively achieving length desensitization.

We employ two settings (Base and Instruct) of Llama2-13B (Touvron et al., 2023), Llama3-8B (AI@Meta, 2024), and Qwen2-7B (Yang et al., 2024) for experimental validation on MT-Bench (Zheng et al., 2024) and AlpacaEval 2 (Dubois et al., 2024). The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. Moreover, experiment on MT-Bench and ProofWriter (Tafjord et al., 2021) shows that LD-DPO significantly improves models’ reasoning performance. An interesting phenomenon is also observed: the length sensitivity during DPO training exhibits a negative correlation with the underlying model capability. we then define γ𝛾\gammaitalic_γ as the length sensitivity coefficient and conduct a detailed analysis of the DPO length sensitivity across models of varying capabilities, we believe γ𝛾\gammaitalic_γ is instructive for the entire preference optimization process. Our contributions are summarized as follows:

  • To the best of our knowledge, we are the first to propose that DPO exhibits length sensitivity and provide theoretical validation for this phenomenon.

  • We propose LD-DPO, a length-desensitization preference optimization algorithm that mitigates length sensitivity by decoupling length preference from the reward.

  • We experimentally verify that LD-DPO enables the model to achieve superior results with more concise responses, reducing response length by 10-40% compared to DPO.

2 Preliminaries

In this section, we will outline the standard pipeline of Reinforcement Learning From Human Feedback (RLHF) (Bai et al., 2022; Ziegler et al., 2019)and the Direct Preference Optimization (DPO) algorithm (Rafailov et al., 2024), which is essential for the analysis of the length sensitivity of DPO and the motivation of our method.

2.1 Reinforcement Learning From Human Feedback

The standard pipeline of RLHF aligns LLMs with human preferences in three stages:
Supervised Fine-tuning (SFT) stage: In this stage, labeled data is used to fine-tune the pre-trained model so that it acquires a basic ability to follow commands and carry on a fluent conversation, to obtain model πSFT(y|x)superscript𝜋𝑆𝐹𝑇conditional𝑦𝑥\pi^{SFT}(y|x)italic_π start_POSTSUPERSCRIPT italic_S italic_F italic_T end_POSTSUPERSCRIPT ( italic_y | italic_x ).
Reward Model (RM) Training stage: In the second stage, πSFT(y|x)superscript𝜋𝑆𝐹𝑇conditional𝑦𝑥\pi^{SFT}(y|x)italic_π start_POSTSUPERSCRIPT italic_S italic_F italic_T end_POSTSUPERSCRIPT ( italic_y | italic_x ) is utilized by prompts x𝑥xitalic_x to generate pairs of responses (y1,y2)πSFT(y|x)similar-tosubscript𝑦1subscript𝑦2superscript𝜋𝑆𝐹𝑇conditional𝑦𝑥(y_{1},y_{2})\sim\pi^{SFT}(y|x)( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∼ italic_π start_POSTSUPERSCRIPT italic_S italic_F italic_T end_POSTSUPERSCRIPT ( italic_y | italic_x ), which are then labeled by human annotators as a preferred answer ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and a dispreferred answer ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, denoted as yw>ylsubscript𝑦𝑤subscript𝑦𝑙y_{w}>y_{l}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT > italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. To predict these preferences, previous works employ the Bradley-Terry (BT) RM (Bradley & Terry, 1952), which essentially constructs a pairwise contrast:

RM=logexp(rϕ(x,yw))exp(rϕ(x,yw))+exp(rϕ(x,yl)).subscript𝑅𝑀𝑒𝑥𝑝subscript𝑟italic-ϕ𝑥subscript𝑦𝑤𝑒𝑥𝑝subscript𝑟italic-ϕ𝑥subscript𝑦𝑤𝑒𝑥𝑝subscript𝑟italic-ϕ𝑥subscript𝑦𝑙\begin{split}\mathcal{L}_{RM}=-\log\frac{exp(r_{\phi}(x,y_{w}))}{exp(r_{\phi}(% x,y_{w}))+exp(r_{\phi}(x,y_{l}))}.\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT = - roman_log divide start_ARG italic_e italic_x italic_p ( italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ) end_ARG start_ARG italic_e italic_x italic_p ( italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) ) + italic_e italic_x italic_p ( italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) end_ARG . end_CELL end_ROW (1)

Reinforcement Learning (RL) stage: In the Final Stage, the reward function is used to provide feedback to the language model. The optimization objective is formulated as:

maxπθ𝔼x𝒟,yπθ(y|x)[rϕ(x,y)]β𝔻KL[πθ(y|x)πref(y|x)],\displaystyle\begin{aligned} \max_{\pi_{\theta}}\mathbb{E}_{x\sim\mathcal{D},y% \sim\pi_{\theta}(y|x)}[r_{\phi}(x,y)]-\beta\mathbb{D}_{KL}[\pi_{\theta}(y|x)\|% \pi_{ref}(y|x)],\end{aligned}start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_x ∼ caligraphic_D , italic_y ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x , italic_y ) ] - italic_β blackboard_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT [ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) ∥ italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y | italic_x ) ] , end_CELL end_ROW (2)

where β𝛽\betaitalic_β is a parameter controlling the deviation from the reference model πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT, namely the initial SFT model πSFT(y|x)superscript𝜋𝑆𝐹𝑇conditional𝑦𝑥\pi^{SFT}(y|x)italic_π start_POSTSUPERSCRIPT italic_S italic_F italic_T end_POSTSUPERSCRIPT ( italic_y | italic_x ), and in practice, the language model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is also initialized to πSFT(y|x)superscript𝜋𝑆𝐹𝑇conditional𝑦𝑥\pi^{SFT}(y|x)italic_π start_POSTSUPERSCRIPT italic_S italic_F italic_T end_POSTSUPERSCRIPT ( italic_y | italic_x ). This objective is optimized using a general purpose RL algorithm, such as PPO (Wu et al., 2023).

2.2 Direct Preference Optimization

Direct Preference Optimization (DPO) is one of the most popular offline preference optimization methods, which starts with the same objective as Eq.2, reparameterizes the reward function r𝑟ritalic_r using a closed-form expression with the optimal policy:

r(x,y)=βlogπθ(y|x)πref(y|x)+βlogZ(x),𝑟𝑥𝑦𝛽subscript𝜋𝜃conditional𝑦𝑥subscript𝜋𝑟𝑒𝑓conditional𝑦𝑥𝛽𝑍𝑥r(x,y)=\beta\log\frac{\pi_{\theta}(y|x)}{\pi_{ref}(y|x)}+\beta\log Z(x),italic_r ( italic_x , italic_y ) = italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y | italic_x ) end_ARG + italic_β roman_log italic_Z ( italic_x ) , (3)

where Z(x)=yπref(y|x)exp(1βr(x,y))𝑍𝑥subscript𝑦subscript𝜋𝑟𝑒𝑓conditional𝑦𝑥𝑒𝑥𝑝1𝛽𝑟𝑥𝑦Z(x)=\sum_{y}\pi_{ref}(y|x)exp(\frac{1}{\beta}r(x,y))italic_Z ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y | italic_x ) italic_e italic_x italic_p ( divide start_ARG 1 end_ARG start_ARG italic_β end_ARG italic_r ( italic_x , italic_y ) ) is the partition function, which is only relevant for πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT and πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, no additional training of the RM is required. By incorporating Eq.3 into the BT ranking objective, p(yw>yl|x)=σ(r(x,yw)r(x,yl))𝑝subscript𝑦𝑤conditionalsubscript𝑦𝑙𝑥𝜎𝑟𝑥subscript𝑦𝑤𝑟𝑥subscript𝑦𝑙p(y_{w}>y_{l}|x)=\sigma(r(x,y_{w})-r(x,y_{l}))italic_p ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT > italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) = italic_σ ( italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ), therefore, the optimization objective becomes:

DPO(πθ;πref)=𝔼(x,yw,yl)𝒟[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))].subscript𝐷𝑃𝑂subscript𝜋𝜃subscript𝜋𝑟𝑒𝑓subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝒟delimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝑟𝑒𝑓conditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝑟𝑒𝑓conditionalsubscript𝑦𝑙𝑥\displaystyle\begin{aligned} \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref})=-% \mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}}[\log\sigma(\beta\log\frac{\pi_{% \theta}(y_{w}|x)}{\pi_{ref}(y_{w}|x)}-\beta\log\frac{\pi_{\theta}(y_{l}|x)}{% \pi_{ref}(y_{l}|x)})].\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] . end_CELL end_ROW (4)

DPO replaces the reward model (RM) with an implicit reward, offering enhanced stability and ease of training compared to traditional reinforcement learning methods such as PPO. Several related works have validated the effectiveness of this paradigm.

3 Methodology

In this section, we first conduct a theoretical analysis of the optimization object of DPO and verify that differences in data length significantly affect the optimization direction during the training process, demonstrating that DPO is length-sensitive. We then derive our LD-DPO algorithm, which addresses the length sensitivity problem by reparameterizing the likelihood, thereby preventing the generation of verbose responses and aligning the model more closely with human-like preferences.

3.1 Length Sensitivity of DPO

According to the optimization objective of DPO in Eq.4, we know that the purpose of DPO is to make the likelihood of human-preferred response ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT given x𝑥xitalic_x greater than that of human-dispreferred response ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, denoted as πθ(yw|x)>πθ(yl|x)subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\pi_{\theta}(y_{w}|x)>\pi_{\theta}(y_{l}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) > italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ). Additionally, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT serves as the actor model, while πrefsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT is introduced to prevent the model from deviating from the reference model. Both models are tpyically products of the Supervised Fine-Tuning (SFT) phase.

Following Feng et al. (2024a), we provide the theoretical derivation of the optimization objective for DPO. Firstly, we denote 𝒳1=πθ(yw|x)subscript𝒳1subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\mathcal{X}_{1}=\pi_{\theta}(y_{w}|x)caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ), 𝒳2=πθ(yl|x)subscript𝒳2subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\mathcal{X}_{2}=\pi_{\theta}(y_{l}|x)caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) and 𝒦1=πref(yw|x)subscript𝒦1subscript𝜋𝑟𝑒𝑓conditionalsubscript𝑦𝑤𝑥\mathcal{K}_{1}=\pi_{ref}(y_{w}|x)caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ), 𝒦2=πref(yl|x)subscript𝒦2subscript𝜋𝑟𝑒𝑓conditionalsubscript𝑦𝑙𝑥\mathcal{K}_{2}=\pi_{ref}(y_{l}|x)caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ), the loss function of DPO can be written as:

DPO(𝒳1;𝒳2)=log((𝒦2𝒳1)β(𝒦2𝒳1)β+(𝒦1𝒳2)β).subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽\displaystyle\begin{aligned} \mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2}% )=-\log(\frac{(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}}{(\mathcal{K}_{2}% \mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}}).\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = - roman_log ( divide start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG ) . end_CELL end_ROW (5)

We calculate the partial derivatives of DPOsubscript𝐷𝑃𝑂\mathcal{L}_{DPO}caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT with respect to 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

{DPO(𝒳1;𝒳2)𝒳1=β(𝒦1𝒳2)β𝒳1((𝒦2𝒳1)β+(𝒦1𝒳2)β),DPO(𝒳1;𝒳2)𝒳2=β(𝒦1𝒳2)β1(𝒦2𝒳1)β+(𝒦1𝒳2)β,casessubscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽subscript𝒳1superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳2𝛽superscriptsubscript𝒦1subscript𝒳2𝛽1superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽\left\{\begin{array}[]{l}\dfrac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};% \mathcal{X}_{2})}{\partial\mathcal{X}_{1}}=-\dfrac{\beta(\mathcal{K}_{1}% \mathcal{X}_{2})^{\beta}}{\mathcal{X}_{1}((\mathcal{K}_{2}\mathcal{X}_{1})^{% \beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta})},\\ \dfrac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{2}}=\dfrac{\beta(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta-1}}{(% \mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{% \beta}},\end{array}\right.{ start_ARRAY start_ROW start_CELL divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = - divide start_ARG italic_β ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ) end_ARG , end_CELL end_ROW start_ROW start_CELL divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_β ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG , end_CELL end_ROW end_ARRAY (6)

then leading to the following result:

|DPO(𝒳1;𝒳2)𝒳1/DPO(𝒳1;𝒳2)𝒳2|=𝒳2𝒳1=πθ(yl|x)πθ(yw|x).subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳1subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳2subscript𝒳2subscript𝒳1subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle\begin{aligned} \left\lvert\frac{\partial\mathcal{L}_{DPO}(% \mathcal{X}_{1};\mathcal{X}_{2})}{\partial\mathcal{X}_{1}}/\frac{\partial% \mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial\mathcal{X}_{2}}% \right\rvert=\frac{\mathcal{X}_{2}}{\mathcal{X}_{1}}=\frac{\pi_{\theta}(y_{l}|% x)}{\pi_{\theta}(y_{w}|x)}.\end{aligned}start_ROW start_CELL | divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG / divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | = divide start_ARG caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG . end_CELL end_ROW (7)

Therefore, the partial derivatives of the optimization objective with respect to πθ(yw|x)subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\pi_{\theta}(y_{w}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) and πθ(yl|x)subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\pi_{\theta}(y_{l}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) are inversely proportional to their respective values. Furthermore, the derivation process of Eq.7and a detailed analysis of the absolute magnitude of the gradient is provided in Appendix A.1. When πθ(yw|x)subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\pi_{\theta}(y_{w}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) is less than πθ(yl|x)subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\pi_{\theta}(y_{l}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ), DPO tends to increase the likelihood of generating human-preferred response ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Conversely, DPO tends to avoid generating human-dispreferred response ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Based on this conclusion, we will analyze the length sensitivity of DPO as follows.

Refer to caption
(a) DPO (α=1𝛼1\alpha=1italic_α = 1)
Refer to caption
(b) α=0.5𝛼0.5\alpha=0.5italic_α = 0.5
Refer to caption
(c) α=0𝛼0\alpha=0italic_α = 0
Figure 2: Comparison of the relationship between the length of preference data and πθ(yl|x)/πθ(yw|x)subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\pi_{\theta}(y_{l}|x)/\pi_{\theta}(y_{w}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) / italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) under both DPO and LD-DPO. Measured on Llama3-8B-Instruct with UltraFeedback dataset (Cui et al., 2023), and the heatmap values represent logπθ(yl|x)logπθ(yw|x)subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\log\pi_{\theta}(y_{l}|x)-\log\pi_{\theta}(y_{w}|x)roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) - roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ).

In DPO process, the likelihood πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) for sequence-level output is obtained by cumulatively multiplying the conditional probability of each token p(yt|x,y<t)𝑝conditionalsubscript𝑦𝑡𝑥subscript𝑦absent𝑡p(y_{t}|x,y_{<t})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) as shown in Eq.8.

πθ(y|x)=i=1len(y)p(yi|x,y<i).subscript𝜋𝜃conditional𝑦𝑥superscriptsubscriptproduct𝑖1𝑙𝑒𝑛𝑦𝑝conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖\pi_{\theta}(y|x)=\prod_{i=1}^{len(y)}p(y_{i}|x,y_{<i}).italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l italic_e italic_n ( italic_y ) end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) . (8)

As the conditional probability of the current token from the policy p(yt|x,y<t)𝑝conditionalsubscript𝑦𝑡𝑥subscript𝑦absent𝑡p(y_{t}|x,y_{<t})italic_p ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) lies within the range [0,1]01[0,1][ 0 , 1 ], it follows that as the sentence y𝑦yitalic_y consists of more tokens, πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) will obviously be smaller. According to Eq.7, we know that if len(yw)>len(yl)𝑙𝑒𝑛subscript𝑦𝑤𝑙𝑒𝑛subscript𝑦𝑙len(y_{w})>len(y_{l})italic_l italic_e italic_n ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) > italic_l italic_e italic_n ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ), then it is highly likely that πθ(yw|x)<πθ(yl|x)subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\pi_{\theta}(y_{w}|x)<\pi_{\theta}(y_{l}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) < italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ), so the language model tends to generate the longer response ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT after DPO; Conversely, if len(yw)<len(yl)𝑙𝑒𝑛subscript𝑦𝑤𝑙𝑒𝑛subscript𝑦𝑙len(y_{w})<len(y_{l})italic_l italic_e italic_n ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) < italic_l italic_e italic_n ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ), DPO prevents the output of the longer answer ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, but this does not imply that the shorter answer ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT will be preferred, then verbosity arises.

As shown in Fig.2(a) and based on the above analysis, DPO is more sensitive to data pairs with large differences in length. Therefore, it tends to guide the model to prioritize length preferences in the data, ignoring other human-like preferences that are more important.

3.2 Derivation of LD-DPO

Based on the analysis conducted in the preceding section, it is evident that the length sensitivity of DPO primarily originates from the substantial influence text length exerts on the likelihood πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ). This influence consequently biases the optimization process towards favoring data with a length advantage. To address this issue, Length-Desensitization DPO (LD-DPO) is employed to diminish the impact of length on the likelihood. This adjustment allows the optimization process to focus more on the substantive content of the text, thereby better aligning with human preference.

For a pair of preference data (yw,ylsubscript𝑦𝑤subscript𝑦𝑙y_{w},y_{l}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT) with lengths (lw,ll)subscript𝑙𝑤subscript𝑙𝑙(l_{w},l_{l})( italic_l start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ), we denote lp=min(lw,ll)subscript𝑙𝑝𝑚𝑖𝑛subscript𝑙𝑤subscript𝑙𝑙l_{p}=min(l_{w},l_{l})italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = italic_m italic_i italic_n ( italic_l start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) as the public length. Then the likelihood of response in DPO can be rewritten as:

πθ(y|x)=i=1lpp(yi|x,y<i)i=lp+1lp(yi|x,y<i),subscript𝜋𝜃conditional𝑦𝑥superscriptsubscriptproduct𝑖1subscript𝑙𝑝𝑝conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖subscript𝑙𝑝1𝑙𝑝conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖\pi_{\theta}(y|x)=\prod_{i=1}^{l_{p}}p(y_{i}|x,y_{<i})\prod_{i=l_{p}+1}^{l}p(y% _{i}|x,y_{<i}),italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) , (9)

where l𝑙litalic_l is the length of y𝑦yitalic_y. The second term contains extensive length information, which directly decreases the reward and further biases the optimization objective. This bias contributes to the overall length sensitivity of DPO.

In LD-DPO, our objective is to attenuate the sensitivity of DPO by eliminating the verbosity preferences induced by the excessively long portions, while concurrently maintaining the human-like preferences, which include a certain degree of length preference. Initially, as demonstrated in Eq.10, we disassociate the verbosity preferences from the likelihood of over-long portion (second term in Eq.9) by introducing a hyperparameter α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ].

i=lp+1lpα(yi|x,y<i)human-like preferencesp1α(yi|x,y<i)verbosity preference.superscriptsubscriptproduct𝑖subscript𝑙𝑝1𝑙subscriptsuperscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖human-like preferencessubscriptsuperscript𝑝1𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖verbosity preference\prod_{i=l_{p}+1}^{l}\underbrace{p^{\alpha}(y_{i}|x,y_{<i})}_{\text{{human-% like preferences}}}\underbrace{p^{1-\alpha}(y_{i}|x,y_{<i})}_{\text{{verbosity% preference}}}.∏ start_POSTSUBSCRIPT italic_i = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT under⏟ start_ARG italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT human-like preferences end_POSTSUBSCRIPT under⏟ start_ARG italic_p start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT verbosity preference end_POSTSUBSCRIPT . (10)

We then diminish the length sensitivity of DPO by removing verbosity preference from πθ(y|x)subscript𝜋𝜃conditional𝑦𝑥\pi_{\theta}(y|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ), obtaining the modified likelihood π^θ(y|x)subscript^𝜋𝜃conditional𝑦𝑥\hat{\pi}_{\theta}(y|x)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) in LD-DPO:

π^θ(y|x)=i=1lpp(yi|x,y<i)i=lp+1lpα(yi|x,y<i).subscript^𝜋𝜃conditional𝑦𝑥superscriptsubscriptproduct𝑖1subscript𝑙𝑝𝑝conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖subscript𝑙𝑝1𝑙superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖\hat{\pi}_{\theta}(y|x)=\prod_{i=1}^{l_{p}}p(y_{i}|x,y_{<i})\prod_{i=l_{p}+1}^% {l}p^{\alpha}(y_{i}|x,y_{<i}).over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) . (11)

When α=1𝛼1\alpha=1italic_α = 1, π^θ(y|x)=πθ(y|x)subscript^𝜋𝜃conditional𝑦𝑥subscript𝜋𝜃conditional𝑦𝑥\hat{\pi}_{\theta}(y|x)=\pi_{\theta}(y|x)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ), which is consistent with vanilla DPO. Conversely, when α=0𝛼0\alpha=0italic_α = 0, the likelihood of over-length part is equal to 1111, meaning that only the public-length part will be considered. Ultimately, we reformulate π^θ(yk|x)subscript^𝜋𝜃conditionalsubscript𝑦𝑘𝑥\hat{\pi}_{\theta}(y_{k}|x)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_x ) in Eq.12 to present it in a more elegant form, with the detailed derivation provided in Appendix A.2.

π^θ(y|x)=i=1lpα(yi|x,y<i)i=1lpp1α(yi|x,y<i).subscript^𝜋𝜃conditional𝑦𝑥superscriptsubscriptproduct𝑖1𝑙superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖1subscript𝑙𝑝superscript𝑝1𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖\displaystyle\begin{aligned} \hat{\pi}_{\theta}(y|x)=\prod_{i=1}^{l}p^{\alpha}% (y_{i}|x,y_{<i})\prod_{i=1}^{l_{p}}p^{1-\alpha}(y_{i}|x,y_{<i}).\end{aligned}start_ROW start_CELL over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) . end_CELL end_ROW (12)

It is observable that π^θ(y|x)subscript^𝜋𝜃conditional𝑦𝑥\hat{\pi}_{\theta}(y|x)over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) is constituted by the complete sequence and the public-length component of the preference data pair. The proportion between these two components can be modulated by adjusting the hyperparameter α𝛼\alphaitalic_α. In scenarios where the length sensitivity during the DPO training process is relatively pronounced, a smaller α𝛼\alphaitalic_α should be opted for in order to decouple the verbosity preference. Conversely, a larger α𝛼\alphaitalic_α should be selected to avert the loss of genuine human preferences.

As shown in Fig.2, compared to DPO (Fig.2(a)), LD-DPO (Fig.2(b), 2(c)) can smooth πθ(yl|x)/πθ(yw|x)subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\pi_{\theta}(y_{l}|x)/\pi_{\theta}(y_{w}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) / italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ), and this effect is markedly amplified as α𝛼\alphaitalic_α diminishes. Based on the prior analysis and Eq.7, it is evident that the optimization direction of LD-DPO is less affected by the length disparity within the preferred data pairs. This indicates that, relative to DPO, LD-DPO has achieved a measure of length desensitization.

4 Experimental Setup

We follow the experimental setup of SimPO (Meng et al., 2024) to objectively demonstrate the validity of our method.

Models and training settings. We perform preference optimization using three families of models: Llama2-13B (Touvron et al., 2023), Llama3-8B (AI@Meta, 2024) and Qwen2-7B (Yang et al., 2024) under two setups: Base and Instruct/Chat.

For the Base setup, we train a base language model on the UltraChat-200k dataset (Ding et al., 2023) to obtain an SFT model, which possesses a basic capability for conversation. For the Instruct setup, we select their corresponding instruct models (i.e., Llama2-13B-Chat, Llama3-8B-Instruct, and Qwen2-7B-Instruct) as initial models. These models are more powerful and robust compared to the base models. Both setups ensure a high level of transparency as the models and training data are open source.

In the preference optimization phase, we utilize UltraFeedback(Cui et al., 2023) as the human preference dataset. This dataset consists of 60,000 high-quality data pairs (x,yw,yl)𝑥subscript𝑦𝑤subscript𝑦𝑙(x,y_{w},y_{l})( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) designed to align with human conversational preferences and emphasize helpfulness.

Evaluation benchmarks. We primarily evaluate our models using two of the most popular open-ended evaluation benchmarks: MT-Bench (Zheng et al., 2024)and AlpacaEval 2 (Dubois et al., 2024). These benchmarks assess the model’s versatile session capabilities across a wide range of queries and have been widely adopted by the community.

MT-Bench comprises 80 questions spanning 8 categories, whereas AlpacaEval 2 encompasses 805 questions derived from 5 datasets. We present the results in accordance with the evaluation protocol designated for each benchmark. For MT-Bench, we present the average score and provide a detailed breakdown of the scores for each capability item in Appendix B.2. In the case of AlpacaEval 2, we report the length-controlled (LC) win rate against GPT-4-preview-1106, a metric specifically engineered to be resistant to model verbosity. All our evaluations are executed utilizing GPT4-turbo-0409 as the adjudicating model. Furthermore, we calculate the average response length on each benchmark to compare the effects of different methods on response length.

Baselines. We compare LD-DPO with five other offline preference optimization techniques. Among these, DPO serves as our most crucial comparison. R-DPO revises DPO by incorporating a length regularity term to control the response length. SamPO avoides reward overestimation or underestimation due to length by downsampling the KL dispersion. WPO simulates the on-policy learning process by adding weights to the optimization objective of DPO. SimPO introduces an optimization objective that does not rely on a reference model, and mitigates the impact of data length by utilizing average likelihood.

Phase LR BS Epoch LS WP
SFT 2e-5 128 3 cosine 10%
PO 5e-7 32 1 cosine 10%
Table 1: General training hyperparameters settings for SFT phase and preference optimization (PO) phase, including Leaning Rate (LR), Batch Size (BS), Epoch, Learning rate Schedule (LS), Warmup Phase (WP).

General Training Hyperparameters. The training hyperparameters are shown in Table.1. Additionally, to ensure the performance of the offline preference optimization algorithms, we set the fitting tuning hyperparameters for all methods. In general, we set β=0.1𝛽0.1\beta=0.1italic_β = 0.1 for DPO, R-DPO, SamPO and LD-DPO. Specifically, for SimPO, setting β=2.0𝛽2.0\beta=2.0italic_β = 2.0 and γ=1.0𝛾1.0\gamma=1.0italic_γ = 1.0, for R-DPO, setting α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, and for LD-DPO, we set α={0.1,0.2,,1.0}𝛼0.10.21.0\alpha=\{0.1,0.2,...,1.0\}italic_α = { 0.1 , 0.2 , … , 1.0 } to explore its effect on generation length and model performance. Finally, all preference optimization training was conducted on 16 A100-80G GPUs.

5 Experimental Results

Method Llama2-Base (13B) Llama2-Chat (13B)
MT-Bench AlpacaEval 2 MT-Bench AlpacaEval 2
Score Avg. Len LC (%) Avg. Len Score Avg. Len LC (%) Avg. Len
SFT 5.51 170 6.56 220 6.35 326 23.46 452
DPO 5.67 191 6.70 266 6.33 368 25.52 487
R-DPO 5.45 150 7.64 198 6.32 346 26.27 461
SimPO 5.45 180 7.31 246 6.40 351 26.38 471
WPO 5.76 185 9.65 244 6.40 401 26.81 486
SamPO 5.78 183 8.80 259 6.21 390 26.09 484
LD-DPO 5.83 154 10.37 208 6.55 329 28.20 449
Method Llama3-Base (8B) Llama3-Instruct (8B)
MT-Bench AlpacaEval 2 MT-Bench AlpacaEval 2
Score Avg. Len LC (%) Avg. Len Score Avg. Len LC (%) Avg. Len
SFT 6.08 156 8.40 167 7.36 255 38.28 326
DPO 6.38 178 12.58 235 7.61 323 40.21 393
R-DPO 6.18 137 12.15 155 7.54 248 41.07 318
SimPO 6.24 142 9.96 194 7.36 266 39.14 374
WPO 6.42 179 12.99 226 7.60 320 39.77 386
SamPO 6.12 162 14.62 200 7.50 294 40.77 368
LD-DPO 6.45 153 16.82 144 7.74 247 44.00 308
Method Qwen2-Base (7B) Qwen2-Instruct (7B)
MT-Bench AlpacaEval 2 MT-Bench AlpacaEval 2
Score Avg. Len LC (%) Avg. Len Score Avg. Len LC (%) Avg. Len
SFT 6.30 160 7.62 173 7.95 359 34.09 373
DPO 6.73 181 10.20 204 7.79 321 35.63 437
R-DPO 6.16 137 8.79 168 7.94 314 38.85 365
SimPO 6.61 154 12.08 181 7.88 352 35.10 430
WPO 6.71 167 11.02 193 7.72 361 37.53 433
SamPO 6.79 180 10.89 187 7.78 343 37.05 399
LD-DPO 6.80 163 12.14 155 8.03 303 40.88 356
Table 2: MT-Bench and AlpacaEval 2 results under six model settings. LC-winrate denotes length-controlled win rate against the baseline model (GPT-4-1106-preview), which can mitigate the length preference of the judge model (GPT-4-turbo-0409) compared to the raw win rate. Avg. Token denotes the average length of the model’s answers. For the Base settings, we train SFT models on the UltraChat dataset. For the Instruct settings, we use off-the-shelf models as the SFT model.

In this section, we present the main results of our experiments, demonstrating that LD-DPO achieves state-of-the-art (SOTA) performance on both MT-Bench and AlpacaEval for all six settings through effective length control. Building on these results, we further analyze the sensitivity of different models to data length. Additionally, our findings show that our method significantly enhances the model’s reasoning ability. Finally, we conduct ablation studys and hyperparameter analysis.

5.1 Main Results

As shown in Table.2, LD-DPO exhibits significant improvements in both MT-Bench and AlpacaEval 2 compared to all other baselines. In addition, the average response length is reduced by 7.8% to 37.9% relative to DPO, suggesting higher quality and more concise model outputs after LD-DPO.

In the Base setting, we observe that the overall model performance is suboptimal, with responses tending to be shorter. This phenomenon may be attributed to the model’s performance not being fully realized during the SFT phase. Conversely, in the Instruct setting, the model demonstrates greater competence and generates much longer responses than the base model, due to extensive SFT and RLHF conducted by their publishers. However, in both settings, it is clear that DPO consistently encourages the model to produce more verbose outputs, with experiments showing an increase ranging from 10% to 40%, while LD-DPO can significantly alleviate this issue. This verbosity can potentially impair model performance, as we will illustrate with several case studies in Appendix C.

As illustrated in Table.2, we examine the average response length and LC-winrate (which may vary from public results due to different judge model) of the models on AlpacaEval 2 under six different settings. Our method achieves the state-of-the-art (SOTA) performance in LC-winrate across all settings. When comparing the three length control methods (R-DPO, SimPO, and SamPO), We find that R-DPO demonstrates superior length control under the Base setting, but its overall performance is suboptimal in terms of LC-winrate. SimPO does not show consistent performance across different settings, likely due to the absence of a reference model, which impacts its stability. SamPO’s performance fluctuates less compared to DPO.

5.2 Comparative Analysis of Varied Model

Refer to caption
Refer to caption
(a) Llama2-13B-Chat (SOTA: α=0.3𝛼0.3\alpha=0.3italic_α = 0.3)
Refer to caption
Refer to caption
(b) Llama3-8B-Instruct (SOTA: α=0.5𝛼0.5\alpha=0.5italic_α = 0.5)
Figure 3: Exploring the distribution of reward differences r(x,yw)r(x,yl)𝑟𝑥subscript𝑦𝑤𝑟𝑥subscript𝑦𝑙r(x,y_{w})-r(x,y_{l})italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - italic_r ( italic_x , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) in different settings: (a) Llama2-13B-Chat (SOTA: α=0.3𝛼0.3\alpha=0.3italic_α = 0.3); (b) Llama3-8B-Instruct (SOTA: α=0.5𝛼0.5\alpha=0.5italic_α = 0.5). In each subplot, the left image represents data where the chosen is longer, and the right image represents data where the rejected is longer. SOTA indicates the alpha setting where LD-DPO performs optimally. The images depict the true distribution on the UltraFeedback dataset during training.

The models selected for our experiments vary in their capabilities and, consequently, in their length sensitivity during the DPO process. Their performance based on the choice of the hyperparameter alpha is as follows:

In the Instruct setting, Llama2-13B achieves optimal performance at α=0.3𝛼0.3\alpha=0.3italic_α = 0.3, Llama3-8B at α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, and Qwen2-7B at α=0.6𝛼0.6\alpha=0.6italic_α = 0.6. In the Base setting, the optimal α𝛼\alphaitalic_α values for these models are approximately 0.1 to 0.2 lower compared to the Instruct setting.

Due to the similarity in performance between Llama3-8B and Qwen2-7B, we subsequently just conducted an in-depth analysis of the Llama2-13B and Llama3-8B models to explore the performance differences between these two sizes. This selection is representative of varying model capacities.

In Fig.3, we plot the distribution of reward bias during preference optimization for Llama2-13B-Chat (Fig.3) and Llama3-8B-Instruct (Fig.3), respectively. The data is differentiated based on the length relationship between the chosen and rejected responses. We will analyze the training process of the LLMs with different hyperparameter settings:

Under the DPO setting, We can clearly observe that the distribution of reward differences is influenced by the length of the data. When the chosen is longer, the reward difference is almost always less than 0. Based on our previous theoretical analysis, we can infer that it is easier to optimize in the direction of the chosen under these conditions, leading to lengthy outputs. Conversely, when the rejected is longer, the situation is reversed.

Under the α=𝟎𝛼0\bm{\alpha=0}bold_italic_α bold_= bold_0 setting, where only the public length portions of the ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are considered, we find that the reward difference is greater than 0 for a larger proportion of the data. This indicates that the models prefer outputting ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, which suggests that both types of models possess sizable base capabilities, with Llama3-8B-Instruct being stronger than Llama2-13B-Chat. Additionally, compared to the DPO setting, the average reward difference of Llama2-13B-Chat increases (decreases) by 19.66 (14.01) and Llama3-8B-Instruct increases (decreases) by 18.37 (13.68), indicating that the former is more significantly affected by the data length.

Under the SOTA setting, The fact that LD-DPO cannot achieve optimal results in the extreme case of α=0𝛼0\alpha=0italic_α = 0 suggests that longer responses are necessary. This is because additional text can convey more human-like preferences. Furthermore, compared to Llama2-13B-Chat, Llama3-8B-Instruct is more powerful and can capture more human-like preferences from the text. This capability can mitigate the negative effects of response length, indicating that setting α𝛼\alphaitalic_α to an extreme value may not be appropriate.

From the above anaylsis, we know that α𝛼\alphaitalic_α is actually the result of a compromise to achieve desensitization of DPO based on model capabilities and to prevent the loss of human-like preferences. In addition, we know that different models have different length sensitivities in the DPO process. We define γ𝛾\gammaitalic_γ as the length sensitivity coefficient, where γ=1αs𝛾1subscript𝛼𝑠\gamma=1-\alpha_{s}italic_γ = 1 - italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and αssubscript𝛼𝑠\alpha_{s}italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the α𝛼\alphaitalic_α that yields the best LD-DPO results. A smaller value of γ𝛾\gammaitalic_γ in more capable models indicates that such models are more likely to capture genuine human preferences rather than being influenced by text length. For example, the length sensitivity coefficient of Llama3-8B-Instruct is 0.5, whereas that of Llama2-13B-Chat is 0.7. This suggests that the latter is more sensitive to length during DPO.

5.3 Improvement of Reasoning Capability

In further analysis of the MT-Bench, we found that the reasoning ability of the model significantly improved after applying LD-DPO, compared to both the SFT and DPO models. To further validate this conclusion, we conducted experiments on ProofWriter using Llama3-8B-Instruct. ProofWriter is a specialized dataset designed to evaluate the reasoning capabilities of large language models. It comprises a broad range of problems, from direct reasoning tasks to those requiring more than five steps, and distinguishes between open-world assumptions (OWA) and closed-world assumptions (CWA), resulting in a total of 14 data subsets.

Refer to caption
Figure 4: Performance of various methods on 14 subsets of ProofWriter dataset for the Llama3-8B-Instruct setting.

We conducted experiments on ProofWriter and the results are shown in Fig.4. The scatter plot in the figure indicates the scores for each data subset. After applying LD-DPO, the model shows an overall improvement across 14 data subsets. Compared to Llama3-8B-Instruct, the average score increases from 55.19 to 58.72, outperforming five classes of preference optimization algorithms, including DPO, indicating a significant improvement in reasoning. Detailed results of MT-Bench shown in the Appendix B.2 also prove this point, and a case study of the reasoning problems in Appendix C.

In fact, verbose responses negatively impact the reasoning abilities of language models, particularly smaller models. Unlike Chain of Thought (CoT), which is considered an excellent reasoning paradigm (Feng et al., 2024b), overly lengthy responses tend to include incorrect derivations or meaningless descriptions, which interfere with the model’s ability to make the next step in the reasoning process or to reach the correct conclusion. Our approach enables the model to learn a concise CoT style while preventing overly lengthy answers, thereby improving the model’s reasoning capabilities.

5.4 Ablation Study

To verify the effect of the relative lengths of ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT(chosen) and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT(rejected) in the training data on LD-DPO, we constructed ablation experiments: The length decoupling strategy, as shown in Eq.12, was applied to ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT separately or simultaneously.

As shown in Table.3, the performance of the LLMs on MT-Bench and AlpacaEval, as well as the length control effect, is inferior to that of LD-DPO under both the chosen and rejected setups. This suggests that length decoupling is necessary for both chosen and rejected. For more detail, we have:

  • If ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is longer: According to the previous analysis, it is known that DPO is more inclined to optimize in the direction of ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT under the effect of length bias, which results in redundant output. At this point, length decoupling for ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT can alleviate this tendency.

  • If ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is longer: DPO tends to block the output of ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. In this case, the decoupling of the length of ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT may redirect the optimization towards a shorter ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, thus reducing redundancy and improving the quality of the model’s answers.

Therefore, our length desensitization strategy can encourage the model to identify and leverage more human-real preference gaps in the data samples, rather than optimizing the training process based on superficial length differences.

Method MT-Bench AlpacaEval 2
Score Avg. Token LC (%) Avg. Token
SFT 7.36 255 38.28 326
DPO 7.61 323 40.21 393
chosen 7.67 302(\downarrow21) 43.39 352(\downarrow41)
rejected 7.62 283(\downarrow40) 42.23 351(\downarrow42)
LD-DPO 7.74 247(\downarrow76) 44.00 308(\downarrow85)
Table 3: Ablation study on Llama3-8B-Instruct: chosen denotes length decoupling (Eq.12) applied only to ywsubscript𝑦𝑤y_{w}italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, rejected denotes only to ylsubscript𝑦𝑙y_{l}italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and \downarrowtokens denotes the Avg.Token drop compared to DPO.

5.5 Hyperparameter Analysis

In this subsection, we verify the effect of the hyperparameter α𝛼\alphaitalic_α on the performance of LD-DPO. We present the experimental results of Llama3-8B-Instruct. In all experiments, we set β=0.1𝛽0.1\beta=0.1italic_β = 0.1, though we encourage researchers to explore the effects of different α𝛼\alphaitalic_α selected at various β𝛽\betaitalic_β settings.

When α=1𝛼1\alpha=1italic_α = 1, LD-DPO is equivalent to DPO. As α𝛼\alphaitalic_α gradually decreases, the degree of length decoupling increases. At this point, as shown in Fig.5, the sensitivity of the training process to length begins to decline, resulting in a subsequent reduction in the average output length.

Refer to caption
Refer to caption
Figure 5: Hyperparametric analysis on α𝛼\alphaitalic_α with Llama3-8B-Instruct on AlpacaEval 2. Left: Mean response length (tokens) Right: LC-Winrate on AlpacaEval 2

We find that the decrease in average response length is pronounced as the parameter α𝛼\alphaitalic_α decreases from 1 to 0.5. However, this tendency becomes less significant as α𝛼\alphaitalic_α further decreases from 0.5 to 0. This phenomenon can likely be attributed to the fact that verbosity preference decoupling is effectively complete when α𝛼\alphaitalic_α reaches 0.5.

In terms of LC-winrate, the trend initially increases and then decreases with the value of α𝛼\alphaitalic_α. This suggests that when α𝛼\alphaitalic_α is too large, the model performance is constrained by length sensitivity. Conversely, when α𝛼\alphaitalic_α is too small, excessive length decoupling leads to a loss of human-like preferences in the text, thereby reducing the optimization effectiveness.

6 Related Works

6.1 Offline Preference Optimization

In practice, traditional RLHF paradigms are more complex in terms of coding and hyperparameter tuning, requiring four models simultaneously, which makes them more resource-intensive and difficult to train stably. Due to the lack of online reward models, DPO needs to construct artificial preference datasets in advance, and many works have proposed different data construction strategies to enable the model to better learn human preferences (An et al., 2023; Gallego, 2024; Khaki et al., 2024). Meanwhile, another research direction is to improve the preference optimization objective, including the necessity of the reference model, the selection of the reward fitting function, and the adjustment of the update weights, and derive a variety of offline optimization strategies, including ORPO (Hong et al., 2024), KTO (Ethayarajh et al., 2024), NCA (Chen et al., 2024b), IPO (Azar et al., 2024), WPO Zhou et al. (2024), etc.

6.2 Length Control Strategy

Recent research has shown that DPO may lead to biased results, such as models producing lengthy outputs, which affects the model’s ability to follow instructions and reasoning. To address this problem, Park et al. (2024) proposed R-DPO, which suppresses the model from producing excessively long answers by introducing a rule term, which is a intuitive scheme followed by several algorithms (Zhou et al., 2024; Chen et al., 2024a). Meng et al. (2024) proposed an optimization strategy that does not rely on the reference model, called SimPO, which uses length-normalized rewards to prevent the model from generating excessively long but low-quality answers. Lu et al. (2024) argued that the phenomenon of lengthy outputs in DPO is due to the overestimation or underestimation of implicit rewards caused by the length of the training data. Based on this, they proposed SamPO, which mitigates the length bias by down-sampling the KL divergence to ensure that implicit rewards are not affected by length.

7 Discussion

7.1 Conclusion

In this work, we propose for the first time that the optimization process of DPO is length-sensitive and provide a theoretical proof. Based on this, we design a length-desensitization algorithm based on DPO: LD-DPO, which achieves length desensitization by reparameterizing the likelihood to decouple verbosity preferences from complete information while preserving human-like preferences. Through extensive experimental analysis, LD-DPO consistently outperforms existing algorithms in various training settings, achieving performance improvements with a 10-40% reduction in output length, especially in reasoning ability. This suggests that previous optimization algorithms have overemphasized length at the expense of quality, validating the value of our work. Furthermore, we perform a comparative analysis of length sensitivity across models with different capabilities, which may provide new insights into the preference optimization process.

7.2 Limitations and Future Works

First, despite the empirical success and intuitive motivation of LD-DPO, the length-sensitive coefficient γ𝛾\gammaitalic_γ for different models requires manual and experimental exploration. Future work could investigate methods to determine the optimal margins automatically. Second, length preference is among the most readily captured human preferences by models, we have not yet examined the decoupling of other preferences such as format preference and morphology preference during the training process. There is significant value in decoupling all of these preferences, and we will continue to explore along this path.

References

  • AI@Meta (2024) AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md.
  • An et al. (2023) Gaon An, Junhyeok Lee, Xingdong Zuo, Norio Kosaka, Kyung-Min Kim, and Hyun Oh Song. Direct preference-based policy optimization without reward modeling. Advances in Neural Information Processing Systems, 36:70247–70266, 2023.
  • Azar et al. (2024) Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.  4447–4455. PMLR, 2024.
  • Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • Beeching et al. (2023) Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard (2023-2024). https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard, 2023.
  • Bradley & Terry (1952) Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Chen et al. (2024a) Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, and Min Lin. Bootstrapping language models with dpo implicit rewards. arXiv preprint arXiv:2406.09760, 2024a.
  • Chen et al. (2024b) Huayu Chen, Guande He, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit rewards. arXiv preprint arXiv:2402.05369, 2024b.
  • Chen et al. (2024c) Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. Odin: Disentangled reward mitigates hacking in rlhf. In Forty-first International Conference on Machine Learning, 2024c.
  • Chen et al. (2024d) Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. In Forty-first International Conference on Machine Learning, 2024d.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018. URL http://arxiv.org/abs/1803.05457.
  • Cui et al. (2023) Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  • Dai et al. (2024) Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, 2024.
  • Ding et al. (2023) Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://openreview.net/forum?id=oEsYs3WRc3.
  • Dubois et al. (2024) Yann Dubois, Balázs Galambosi, Percy Liang, and Tatsunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475, 2024.
  • Ethayarajh et al. (2024) Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  • Feng et al. (2024a) Duanyu Feng, Bowen Qin, Chen Huang, Zheng Zhang, and Wenqiang Lei. Towards analyzing and understanding the limitations of dpo: A theoretical perspective. arXiv preprint arXiv:2404.04626, 2024a.
  • Feng et al. (2024b) Guhao Feng, Bohang Zhang, Yuntian Gu, Haotian Ye, Di He, and Liwei Wang. Towards revealing the mystery behind chain of thought: a theoretical perspective. Advances in Neural Information Processing Systems, 36, 2024b.
  • Gallego (2024) Víctor Gallego. Refined direct preference optimization with synthetic data for behavioral alignment of llms. arXiv preprint arXiv:2402.08005, 2024.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  • Hong et al. (2024) Jiwoo Hong, Noah Lee, and James Thorne. Orpo: Monolithic preference optimization without reference model. arXiv preprint arXiv:2403.07691, 2(4):5, 2024.
  • Khaki et al. (2024) Saeed Khaki, JinJin Li, Lan Ma, Liu Yang, and Prathap Ramachandra. Rs-dpo: A hybrid rejection sampling and direct preference optimization method for alignment of large language models. In Findings of the Association for Computational Linguistics: NAACL 2024, pp.  1665–1680, 2024.
  • Lu et al. (2024) Junru Lu, Jiazheng Li, Siyu An, Meng Zhao, Yulan He, Di Yin, and Xing Sun. Eliminating biased length reliance of direct preference optimization via down-sampled kl divergence. arXiv preprint arXiv:2406.10957, 2024.
  • Meng et al. (2024) Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024.
  • Park et al. (2024) Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. arXiv preprint arXiv:2403.19159, 2024.
  • Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  • Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  • Singhal et al. (2023) Prasann Singhal, Tanya Goyal, Jiacheng Xu, and Greg Durrett. A long way to go: Investigating length correlations in rlhf. arXiv preprint arXiv:2310.03716, 2023.
  • Suzgun et al. (2023) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  13003–13051, 2023.
  • Tafjord et al. (2021) Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proofwriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  3621–3634, 2021.
  • Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, NAACL-HLT, Minneapolis, MN, USA, 2019.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. (2024) Zhilin Wang, Yi Dong, Jiaqi Zeng, Virginia Adams, Makesh Narsimhan Sreedhar, Daniel Egert, Olivier Delalleau, Jane Scowcroft, Neel Kant, Aidan Swope, et al. Helpsteer: Multi-attribute helpfulness dataset for steerlm. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  3371–3384, 2024.
  • Wu et al. (2023) Tianhao Wu, Banghua Zhu, Ruoyu Zhang, Zhaojin Wen, Kannan Ramchandran, and Jiantao Jiao. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  • Xu et al. (2024) Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. In Forty-first International Conference on Machine Learning, 2024.
  • Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  • Yuan et al. (2024a) Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback. Advances in Neural Information Processing Systems, 36, 2024a.
  • Yuan et al. (2024b) Weizhe Yuan, Ilia Kulikov, Ping Yu, Kyunghyun Cho, Sainbayar Sukhbaatar, Jason Weston, and Jing Xu. Following length constraints in instructions. arXiv preprint arXiv:2406.17744, 2024b.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Anna Korhonen, David R. Traum, and Lluís Màrquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACLFlorence, Italy, 2019.
  • Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  • Zhou et al. (2024) Wenxuan Zhou, Ravi Agrawal, Shujian Zhang, Sathish Reddy Indurthi, Sanqiang Zhao, Kaiqiang Song, Silei Xu, and Chenguang Zhu. Wpo: Enhancing rlhf with weighted preference optimization. arXiv preprint arXiv:2406.11827, 2024.
  • Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Appendix A Mathematical Derivations

A.1 Derivation of Optimization Direction in DPO

DPO(πθ;πref)=𝔼(x,yw,yl)𝒟[logσ(βlogπθ(yw|x)πref(yw|x)βlogπθ(yl|x)πref(yl|x))]subscript𝐷𝑃𝑂subscript𝜋𝜃subscript𝜋𝑟𝑒𝑓subscript𝔼similar-to𝑥subscript𝑦𝑤subscript𝑦𝑙𝒟delimited-[]𝜎𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝜋𝑟𝑒𝑓conditionalsubscript𝑦𝑤𝑥𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝑟𝑒𝑓conditionalsubscript𝑦𝑙𝑥\displaystyle\begin{aligned} \mathcal{L}_{DPO}(\pi_{\theta};\pi_{ref})=-% \mathbb{E}_{(x,y_{w},y_{l})\sim\mathcal{D}}[\log\sigma(\beta\log\frac{\pi_{% \theta}(y_{w}|x)}{\pi_{ref}(y_{w}|x)}-\beta\log\frac{\pi_{\theta}(y_{l}|x)}{% \pi_{ref}(y_{l}|x)})]\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = - blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG ) ] end_CELL end_ROW (13)
Refer to caption
(a) DPO(𝒳1;𝒳2)subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
Refer to caption
(b) DPO(𝒳1;𝒳2)/𝒳1subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳1\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})/\partial\mathcal{X}% _{1}∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
Refer to caption
(c) DPO(𝒳1;𝒳2)/𝒳2subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳2\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})/\partial\mathcal{X}% _{2}∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
Figure 6: (a)The optimization objective of DPO (b)The partial derivative of DPO(𝒳1;𝒳2)subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (c)The partial derivative of DPO(𝒳1;𝒳2)subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, where we denote πθ(yw|x)subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\pi_{\theta}(y_{w}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) by 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and πθ(yl|x)subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥\pi_{\theta}(y_{l}|x)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) by 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The optimization objective of DPO is presented in Eq.13 and its image is shown in Fig.6(a). We define 𝒳1=πθ(yw|x),𝒳2=πθ(yw|x)formulae-sequencesubscript𝒳1subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝒳2subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\mathcal{X}_{1}=\pi_{\theta}(y_{w}|x),\mathcal{X}_{2}=\pi_{\theta}(y_{w}|x)caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) and 𝒦1=πθ(yw|x),𝒦2=πθ(yw|x)formulae-sequencesubscript𝒦1subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥subscript𝒦2subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\mathcal{K}_{1}=\pi_{\theta}(y_{w}|x),\mathcal{K}_{2}=\pi_{\theta}(y_{w}|x)caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) , caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ), where 𝒦1subscript𝒦1\mathcal{K}_{1}caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒦2subscript𝒦2\mathcal{K}_{2}caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be regarded as constants and the optimization process of DPO is only related to 𝒳1,𝒳2subscript𝒳1subscript𝒳2\mathcal{X}_{1},\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We can rewrite the loss of DPO as:

DPO(𝒳1;𝒳2)=logσ(βlog𝒳1𝒦1βlog𝒳2𝒦2)=log11+exp{βlog(𝒦2𝒳1/𝒦1𝒳2})=log11+(𝒦1𝒳2/𝒦2𝒳1)β=log(𝒦2𝒳1)β(𝒦2𝒳1)β+(𝒦1𝒳2)β.\displaystyle\begin{aligned} \mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2}% )=&-\log\sigma(\beta\log\frac{\mathcal{X}_{1}}{\mathcal{K}_{1}}-\beta\log\frac% {\mathcal{X}_{2}}{\mathcal{K}_{2}})\\ =&-\log\frac{1}{1+exp\{-\beta\log(\mathcal{K}_{2}\mathcal{X}_{1}/\mathcal{K}_{% 1}\mathcal{X}_{2}\})}\\ =&-\log\frac{1}{1+(\mathcal{K}_{1}\mathcal{X}_{2}/\mathcal{K}_{2}\mathcal{X}_{% 1})^{\beta}}\\ =&-\log\frac{(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}}{(\mathcal{K}_{2}% \mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}}.\end{aligned}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = end_CELL start_CELL - roman_log italic_σ ( italic_β roman_log divide start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG - italic_β roman_log divide start_ARG caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - roman_log divide start_ARG 1 end_ARG start_ARG 1 + italic_e italic_x italic_p { - italic_β roman_log ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ) end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - roman_log divide start_ARG 1 end_ARG start_ARG 1 + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - roman_log divide start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG . end_CELL end_ROW (14)

As shown in the Eq.15, the partial differentiation of DPO(𝒳1;𝒳2)subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT indicates that the model is oriented in the chosen direction, the function image is showen in Fig.6(b).

DPO(𝒳1;𝒳2)𝒳1=β𝒦2β𝒳1β1[(𝒦2𝒳1)β+(𝒦1𝒳2)β](𝒦2𝒳1)ββ𝒦2β𝒳1β1[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2(𝒦2𝒳1)β+(𝒦1𝒳2)β(𝒦2𝒳1)β=β𝒦2β𝒳1β1(𝒦2𝒳1)β+β𝒦2β𝒳1β1(𝒦1𝒳2)β(𝒦2𝒳1)ββ𝒦2β𝒳1β1[(𝒦2𝒳1)β+(𝒦1𝒳2)β](𝒦2𝒳1)β=β𝒦2β𝒳1β1(𝒦1𝒳2)β[(𝒦2𝒳1)β+(𝒦1𝒳2)β](𝒦2𝒳1)β=β(𝒦1𝒳2)β𝒳1[(𝒦2𝒳1)β+(𝒦1𝒳2)β]subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳1absent𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1delimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽superscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽2superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽superscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1superscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1superscriptsubscript𝒦1subscript𝒳2𝛽superscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1delimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽superscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1superscriptsubscript𝒦1subscript𝒳2𝛽delimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽superscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦1subscript𝒳2𝛽subscript𝒳1delimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽\displaystyle\begin{aligned} \frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};% \mathcal{X}_{2})}{\partial\mathcal{X}_{1}}=&-\frac{\beta\mathcal{K}_{2}^{\beta% }\mathcal{X}_{1}^{\beta-1}[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{% K}_{1}\mathcal{X}_{2})^{\beta}]-(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}\beta% \mathcal{K}_{2}^{\beta}\mathcal{X}_{1}^{\beta-1}}{[(\mathcal{K}_{2}\mathcal{X}% _{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}]^{2}}\frac{(\mathcal{K}% _{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}}{(% \mathcal{K}_{2}\mathcal{X}_{1})^{\beta}}\\ =&-\frac{\beta\mathcal{K}_{2}^{\beta}\mathcal{X}_{1}^{\beta-1}(\mathcal{K}_{2}% \mathcal{X}_{1})^{\beta}+\beta\mathcal{K}_{2}^{\beta}\mathcal{X}_{1}^{\beta-1}% (\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}-(\mathcal{K}_{2}\mathcal{X}_{1})^{% \beta}\beta\mathcal{K}_{2}^{\beta}\mathcal{X}_{1}^{\beta-1}}{[(\mathcal{K}_{2}% \mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}](\mathcal{K}% _{2}\mathcal{X}_{1})^{\beta}}\\ =&-\frac{\beta\mathcal{K}_{2}^{\beta}\mathcal{X}_{1}^{\beta-1}(\mathcal{K}_{1}% \mathcal{X}_{2})^{\beta}}{[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{% K}_{1}\mathcal{X}_{2})^{\beta}](\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}}\\ =&-\frac{\beta(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}}{\mathcal{X}_{1}[(% \mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{% \beta}]}\end{aligned}start_ROW start_CELL divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = end_CELL start_CELL - divide start_ARG italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] - ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - divide start_ARG italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT - ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - divide start_ARG italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - divide start_ARG italic_β ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] end_ARG end_CELL end_ROW (15)

As shown in the Eq.16, the partial differentiation of DPO(𝒳1;𝒳2)subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT indicates that the model is oriented in the rejected direction, the function image is showen in Fig.6(c).

DPO(𝒳1;𝒳2)𝒳2=(𝒦2𝒳1)ββ𝒦1β𝒳2β1[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2(𝒦2𝒳1)β+(𝒦1𝒳2)β(𝒦2𝒳1)β=β𝒦1β𝒳2β1(𝒦2𝒳1)β+(𝒦1𝒳2)βsubscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳2absentsuperscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽2superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽superscriptsubscript𝒦2subscript𝒳1𝛽𝛽superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽\displaystyle\begin{aligned} \frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};% \mathcal{X}_{2})}{\partial\mathcal{X}_{2}}=&-\frac{-(\mathcal{K}_{2}\mathcal{X% }_{1})^{\beta}\beta\mathcal{K}_{1}^{\beta}\mathcal{X}_{2}^{\beta-1}}{[(% \mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{% \beta}]^{2}}\frac{(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}% \mathcal{X}_{2})^{\beta}}{(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}}\\ =&\frac{\beta\mathcal{K}_{1}^{\beta}\mathcal{X}_{2}^{\beta-1}}{(\mathcal{K}_{2% }\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}}\end{aligned}start_ROW start_CELL divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = end_CELL start_CELL - divide start_ARG - ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_β caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG divide start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG italic_β caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW (16)

According to the Eq.15 and Eq.16, we can obtain Eq.17, which implies that the value of the gradient of the DPO loss function in the preference direction, compared to the non-preference direction, is inversely proportional to its prediction probability.

|DPO(𝒳1;𝒳2)𝒳1/DPO(𝒳1;𝒳2)𝒳2|=𝒳2𝒳1=πθ(yl|x)πθ(yw|x).subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳1subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳2subscript𝒳2subscript𝒳1subscript𝜋𝜃conditionalsubscript𝑦𝑙𝑥subscript𝜋𝜃conditionalsubscript𝑦𝑤𝑥\displaystyle\begin{aligned} \left\lvert\frac{\partial\mathcal{L}_{DPO}(% \mathcal{X}_{1};\mathcal{X}_{2})}{\partial\mathcal{X}_{1}}/\frac{\partial% \mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial\mathcal{X}_{2}}% \right\rvert=\frac{\mathcal{X}_{2}}{\mathcal{X}_{1}}=\frac{\pi_{\theta}(y_{l}|% x)}{\pi_{\theta}(y_{w}|x)}.\end{aligned}start_ROW start_CELL | divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG / divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | = divide start_ARG caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | italic_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x ) end_ARG . end_CELL end_ROW (17)

Next, we analyze the relationship between the partial gradient values DPO(𝒳1;𝒳2)𝒳1subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳1\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{1}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG,DPO(𝒳1;𝒳2)𝒳2subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳2\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{2}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG and 𝒳1,𝒳2subscript𝒳1subscript𝒳2\mathcal{X}_{1},\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and we denote DPO(𝒳1;𝒳2)𝒳1subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳1\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{1}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG as 𝒢1(𝒳1;𝒳2)subscript𝒢1subscript𝒳1subscript𝒳2\mathcal{G}_{1}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and DPO(𝒳1;𝒳2)𝒳2subscript𝐷𝑃𝑂subscript𝒳1subscript𝒳2subscript𝒳2\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{2}}divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG as 𝒢2(𝒳1;𝒳2)subscript𝒢2subscript𝒳1subscript𝒳2\mathcal{G}_{2}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). In addition, it is known that 𝒳1,𝒳2,𝒦1,𝒦2subscript𝒳1subscript𝒳2subscript𝒦1subscript𝒦2\mathcal{X}_{1},\mathcal{X}_{2},\mathcal{K}_{1},\mathcal{K}_{2}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT all represent likelihood, each taking values in the range (0,1)01(0,1)( 0 , 1 ), and β𝛽\betaitalic_β is a hyperparameter that is range from (0,1)01(0,1)( 0 , 1 ) in DPO. We solve for the partial derivatives with respect to 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for 𝒢1(𝒳1;𝒳2)subscript𝒢1subscript𝒳1subscript𝒳2\mathcal{G}_{1}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and 𝒢2(𝒳1;𝒳2)subscript𝒢2subscript𝒳1subscript𝒳2\mathcal{G}_{2}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) between Eq.18 and Eq.21, and their images are shown in Fig.7.

Refer to caption
(a) 𝒢1(𝒳1;𝒳2)𝒳1subscript𝒢1subscript𝒳1subscript𝒳2subscript𝒳1\frac{\partial\mathcal{G}_{1}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{1}}divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG
Refer to caption
(b) 𝒢1(𝒳1;𝒳2)𝒳2subscript𝒢1subscript𝒳1subscript𝒳2subscript𝒳2\frac{\partial\mathcal{G}_{1}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{2}}divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG
Refer to caption
(c) 𝒢2(𝒳1;𝒳2)𝒳1subscript𝒢2subscript𝒳1subscript𝒳2subscript𝒳1\frac{\partial\mathcal{G}_{2}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{1}}divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG
Refer to caption
(d) 𝒢2(𝒳1;𝒳2)𝒳1subscript𝒢2subscript𝒳1subscript𝒳2subscript𝒳1\frac{\partial\mathcal{G}_{2}(\mathcal{X}_{1};\mathcal{X}_{2})}{\partial% \mathcal{X}_{1}}divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG
Figure 7: (a)The partial derivative of 𝒢1(𝒳1;𝒳2)subscript𝒢1subscript𝒳1subscript𝒳2\mathcal{G}_{1}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; (b)The partial derivative of 𝒢1(𝒳1;𝒳2)subscript𝒢1subscript𝒳1subscript𝒳2\mathcal{G}_{1}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; (c)The partial derivative of 𝒢2(𝒳1;𝒳2)subscript𝒢2subscript𝒳1subscript𝒳2\mathcal{G}_{2}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT; (d)The partial derivative of 𝒢2(𝒳1;𝒳2)subscript𝒢2subscript𝒳1subscript𝒳2\mathcal{G}_{2}(\mathcal{X}_{1};\mathcal{X}_{2})caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) with respect to 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.
𝒢1(𝒳1;𝒳2)𝒳1=β(𝒦1𝒳2)β[(𝒦2𝒳1)β+(𝒦1𝒳2)β+𝒳1β𝒦2β𝒳1β1]𝒳12[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2=β(𝒦1𝒳2)2β+β(β+1)(𝒦1𝒳2)β(𝒦2𝒳1)β𝒳12[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2>0subscript𝒢1subscript𝒳1subscript𝒳2subscript𝒳1absent𝛽superscriptsubscript𝒦1subscript𝒳2𝛽delimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽subscript𝒳1𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1superscriptsubscript𝒳12superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽2𝛽superscriptsubscript𝒦1subscript𝒳22𝛽𝛽𝛽1superscriptsubscript𝒦1subscript𝒳2𝛽superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒳12superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽20\displaystyle\begin{aligned} \frac{\partial{\mathcal{G}_{1}(\mathcal{X}_{1};% \mathcal{X}_{2})}}{\partial{\mathcal{X}_{1}}}=&\frac{\beta(\mathcal{K}_{1}% \mathcal{X}_{2})^{\beta}[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}% _{1}\mathcal{X}_{2})^{\beta}+\mathcal{X}_{1}\beta\mathcal{K}_{2}^{\beta}% \mathcal{X}_{1}^{\beta-1}]}{\mathcal{X}_{1}^{2}[(\mathcal{K}_{2}\mathcal{X}_{1% })^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}]^{2}}\\ =&\frac{\beta(\mathcal{K}_{1}\mathcal{X}_{2})^{2\beta}+\beta(\beta+1)(\mathcal% {K}_{1}\mathcal{X}_{2})^{\beta}(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}}{% \mathcal{X}_{1}^{2}[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}% \mathcal{X}_{2})^{\beta}]^{2}}>0\end{aligned}start_ROW start_CELL divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = end_CELL start_CELL divide start_ARG italic_β ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT ] end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG italic_β ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 italic_β end_POSTSUPERSCRIPT + italic_β ( italic_β + 1 ) ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG > 0 end_CELL end_ROW (18)
𝒢1(𝒳1;𝒳2)𝒳2=β2𝒦1β𝒳2β1𝒳1[(𝒦2𝒳1)β+(𝒦1𝒳2)β]β(𝒦1𝒳2)ββ𝒦1β𝒳1𝒳2β1𝒳12[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2=β2𝒦1β𝒳2β1𝒳1(𝒦2𝒳1)β𝒳12[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2<0subscript𝒢1subscript𝒳1subscript𝒳2subscript𝒳2absentsuperscript𝛽2superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1subscript𝒳1delimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽𝛽superscriptsubscript𝒦1subscript𝒳2𝛽𝛽superscriptsubscript𝒦1𝛽subscript𝒳1superscriptsubscript𝒳2𝛽1superscriptsubscript𝒳12superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽2superscript𝛽2superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1subscript𝒳1superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒳12superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽20\displaystyle\begin{aligned} \frac{\partial{\mathcal{G}_{1}(\mathcal{X}_{1};% \mathcal{X}_{2})}}{\partial{\mathcal{X}_{2}}}=&-\frac{\beta^{2}\mathcal{K}_{1}% ^{\beta}\mathcal{X}_{2}^{\beta-1}\mathcal{X}_{1}[(\mathcal{K}_{2}\mathcal{X}_{% 1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}]-\beta(\mathcal{K}_{1}% \mathcal{X}_{2})^{\beta}\beta\mathcal{K}_{1}^{\beta}\mathcal{X}_{1}\mathcal{X}% _{2}^{\beta-1}}{\mathcal{X}_{1}^{2}[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(% \mathcal{K}_{1}\mathcal{X}_{2})^{\beta}]^{2}}\\ =&-\frac{\beta^{2}\mathcal{K}_{1}^{\beta}\mathcal{X}_{2}^{\beta-1}\mathcal{X}_% {1}(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}}{\mathcal{X}_{1}^{2}[(\mathcal{K}_% {2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}]^{2}}<0% \end{aligned}start_ROW start_CELL divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = end_CELL start_CELL - divide start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] - italic_β ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_β caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - divide start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_ARG start_ARG caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG < 0 end_CELL end_ROW (19)
𝒢2(𝒳1;𝒳2)𝒳1=β𝒦1β𝒳2β1β𝒦2β𝒳1β1[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2=β2𝒦1β𝒳2β1𝒦2β𝒳1β1[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2<0subscript𝒢2subscript𝒳1subscript𝒳2subscript𝒳1absent𝛽superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1𝛽superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽2superscript𝛽2superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽1superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽20\displaystyle\begin{aligned} \frac{\partial{\mathcal{G}_{2}(\mathcal{X}_{1};% \mathcal{X}_{2})}}{\partial{\mathcal{X}_{1}}}=&-\frac{\beta\mathcal{K}_{1}^{% \beta}\mathcal{X}_{2}^{\beta-1}\beta\mathcal{K}_{2}^{\beta}\mathcal{X}_{1}^{% \beta-1}}{[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X% }_{2})^{\beta}]^{2}}\\ =&-\frac{\beta^{2}\mathcal{K}_{1}^{\beta}\mathcal{X}_{2}^{\beta-1}\mathcal{K}_% {2}^{\beta}\mathcal{X}_{1}^{\beta-1}}{[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta% }+(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}]^{2}}<0\end{aligned}start_ROW start_CELL divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG = end_CELL start_CELL - divide start_ARG italic_β caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT italic_β caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL - divide start_ARG italic_β start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG < 0 end_CELL end_ROW (20)
𝒢2(𝒳1;𝒳2)𝒳2=β(β1)𝒦1β𝒳2β2[(𝒦2𝒳1)β+(𝒦1𝒳2)β]β𝒦1β𝒳2β1β𝒦1β𝒳2β1[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2=β(β1)𝒦1β𝒳2β2𝒦2β𝒳1ββ𝒦12β𝒦22β2[(𝒦2𝒳1)β+(𝒦1𝒳2)β]2<0subscript𝒢2subscript𝒳1subscript𝒳2subscript𝒳2absent𝛽𝛽1superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽2delimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽𝛽superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1𝛽superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽1superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽2𝛽𝛽1superscriptsubscript𝒦1𝛽superscriptsubscript𝒳2𝛽2superscriptsubscript𝒦2𝛽superscriptsubscript𝒳1𝛽𝛽superscriptsubscript𝒦12𝛽superscriptsubscript𝒦22𝛽2superscriptdelimited-[]superscriptsubscript𝒦2subscript𝒳1𝛽superscriptsubscript𝒦1subscript𝒳2𝛽20\displaystyle\begin{aligned} \frac{\partial{\mathcal{G}_{2}(\mathcal{X}_{1};% \mathcal{X}_{2})}}{\partial{\mathcal{X}_{2}}}=&\frac{\beta(\beta-1)\mathcal{K}% _{1}^{\beta}\mathcal{X}_{2}^{\beta-2}[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}% +(\mathcal{K}_{1}\mathcal{X}_{2})^{\beta}]-\beta\mathcal{K}_{1}^{\beta}% \mathcal{X}_{2}^{\beta-1}\beta\mathcal{K}_{1}^{\beta}\mathcal{X}_{2}^{\beta-1}% }{[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}\mathcal{X}_{2})^{% \beta}]^{2}}\\ =&\frac{\beta(\beta-1)\mathcal{K}_{1}^{\beta}\mathcal{X}_{2}^{\beta-2}\mathcal% {K}_{2}^{\beta}\mathcal{X}_{1}^{\beta}-\beta\mathcal{K}_{1}^{2\beta}\mathcal{K% }_{2}^{2\beta-2}}{[(\mathcal{K}_{2}\mathcal{X}_{1})^{\beta}+(\mathcal{K}_{1}% \mathcal{X}_{2})^{\beta}]^{2}}<0\end{aligned}start_ROW start_CELL divide start_ARG ∂ caligraphic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG = end_CELL start_CELL divide start_ARG italic_β ( italic_β - 1 ) caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] - italic_β caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT italic_β caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 1 end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL divide start_ARG italic_β ( italic_β - 1 ) caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β - 2 end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT - italic_β caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_β end_POSTSUPERSCRIPT caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_β - 2 end_POSTSUPERSCRIPT end_ARG start_ARG [ ( caligraphic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT + ( caligraphic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG < 0 end_CELL end_ROW (21)

Based on the analysis of the aforementioned function trend, we can draw the following conclusion: When 𝒳1subscript𝒳1\mathcal{X}_{1}caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT decreases, both |DPO(𝒳1;𝒳2)𝒳1|\left\lvert\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{% \partial\mathcal{X}_{1}}\right\lvert| divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG | and |DPO(𝒳1;𝒳2)𝒳2|\left\lvert\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{% \partial\mathcal{X}_{2}}\right\lvert| divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | increase. When 𝒳2subscript𝒳2\mathcal{X}_{2}caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT decreases, |DPO(𝒳1;𝒳2)𝒳1|\left\lvert\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{% \partial\mathcal{X}_{1}}\right\lvert| divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG | decreases and |DPO(𝒳1;𝒳2)𝒳2|\left\lvert\frac{\partial\mathcal{L}_{DPO}(\mathcal{X}_{1};\mathcal{X}_{2})}{% \partial\mathcal{X}_{2}}\right\lvert| divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT italic_D italic_P italic_O end_POSTSUBSCRIPT ( caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG start_ARG ∂ caligraphic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG | increases. According to the analysis in Subsection 2, when the training data is more extensive, the parameters will be updated at a faster rate, thereby exacerbating the sensitivity of the DPO.

A.2 Derivation of the Modified Likelihood in LD-DPO

π^θ(y|x)=i=1lpp(yi|x,y<i)i=lp+1lpα(yi|x,y<i)=i=1lppα(yi|x,y<i)p1α(yi|x,y<i)i=lp+1lpα(yi|x,y<i)=i=1lppα(yi|x,y<i)i=lp+1lpα(yi|x,y<i)i=1lpp1α(yi|x,y<i)=i=1lpα(yi|x,y<i)i=1lpp1α(yi|x,y<i)subscript^𝜋𝜃conditional𝑦𝑥absentsuperscriptsubscriptproduct𝑖1subscript𝑙𝑝𝑝conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖subscript𝑙𝑝1𝑙superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖1subscript𝑙𝑝superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscript𝑝1𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖subscript𝑙𝑝1𝑙superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖1subscript𝑙𝑝superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖subscript𝑙𝑝1𝑙superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖1subscript𝑙𝑝superscript𝑝1𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖1𝑙superscript𝑝𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖superscriptsubscriptproduct𝑖1subscript𝑙𝑝superscript𝑝1𝛼conditionalsubscript𝑦𝑖𝑥subscript𝑦absent𝑖\displaystyle\begin{aligned} \hat{\pi}_{\theta}(y|x)=&\prod_{i=1}^{l_{p}}p(y_{% i}|x,y_{<i})\prod_{i=l_{p}+1}^{l}p^{\alpha}(y_{i}|x,y_{<i})\\ =&\prod_{i=1}^{l_{p}}p^{\alpha}(y_{i}|x,y_{<i})p^{1-\alpha}(y_{i}|x,y_{<i})% \prod_{i=l_{p}+1}^{l}p^{\alpha}(y_{i}|x,y_{<i})\\ =&\prod_{i=1}^{l_{p}}p^{\alpha}(y_{i}|x,y_{<i})\prod_{i=l_{p}+1}^{l}p^{\alpha}% (y_{i}|x,y_{<i})\prod_{i=1}^{l_{p}}p^{1-\alpha}(y_{i}|x,y_{<i})\\ =&\prod_{i=1}^{l}p^{\alpha}(y_{i}|x,y_{<i})\prod_{i=1}^{l_{p}}p^{1-\alpha}(y_{% i}|x,y_{<i})\end{aligned}start_ROW start_CELL over^ start_ARG italic_π end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y | italic_x ) = end_CELL start_CELL ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL = end_CELL start_CELL ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT 1 - italic_α end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x , italic_y start_POSTSUBSCRIPT < italic_i end_POSTSUBSCRIPT ) end_CELL end_ROW (22)
Method BBH WinoGrande CSQA ARC MMLU HellaSwag ProofWriter Average
Llama2-13B-Base
SFT 37.27 55.25 66.25 74.40 50.79 39.00 48.69 53.09
DPO 37.55 54.75 66.00 74.68 50.62 38.75 47.89 52.89
R-DPO 37.14 53.75 65.75 73.68 50.59 43.50 48.19 53.23
SimPO 32.47 52.00 65.00 74.04 50.42 40.25 47.92 51.73
WPO 36.83 55.25 66.25 74.22 50.52 38.00 47.67 52.68
SamPO 36.85 54.25 65.50 74.58 50.31 39.50 48.50 52.78
LD-DPO 37.63 54.75 66.50 74.76 50.93 44.25 48.83 53.95
Llama2-13B-Chat
SFT 44.88 48.75 65.75 74.26 55.39 50.25 47.08 55.19
DPO 45.00 49.00 66.25 73.97 55.50 50.75 47.72 55.46
R-DPO 44.41 51.00 65.25 73.65 54.78 52.25 47.42 55.54
SimPO 44.85 50.00 64.25 72.04 54.74 49.50 47.50 54.70
WPO 44.63 49.50 65.50 74.08 55.19 49.75 47.56 55.17
SamPO 44.77 49.00 65.25 73.90 55.45 51.00 47.94 55.33
LD-DPO 44.83 52.00 65.75 73.97 55.35 54.00 47.89 56.23
Llama3-8B-Base
SFT 45.05 54.75 72.75 84.79 61.70 41.50 52.75 59.04
DPO 45.73 55.50 71.75 84.72 62.41 45.25 48.89 59.18
R-DPO 45.27 53.75 71.50 84.68 62.52 48.00 50.61 59.47
SimPO 45.18 55.25 70.00 84.54 62.14 45.50 51.56 59.18
WPO 45.43 54.00 72.25 84.68 62.38 45.25 50.50 59.21
SamPO 45.05 55.00 72.25 84.58 62.57 46.25 49.61 59.33
LD-DPO 45.72 55.25 73.00 84.68 62.30 49.75 51.39 60.30
Llama3-8B-Instruct
SFT 61.56 60.00 76.50 88.43 69.27 63.00 55.19 67.71
DPO 60.80 59.00 76.25 87.68 67.50 63.25 56.14 67.23
R-DPO 60.03 55.75 76.25 87.15 66.54 63.00 57.19 66.56
SimPO 58.55 56.00 76.00 86.90 66.77 63.00 53.69 65.84
WPO 61.09 59.00 76.75 88.00 68.95 63.00 56.47 67.61
SamPO 61.16 57.75 76.75 87.72 67.66 62.50 57.42 67.28
LD-DPO 61.75 58.75 77.50 87.83 68.42 64.50 58.72 68.21
Qwen2-7B-Base
SFT 49.51 65.25 79.00 89.68 69.55 57.25 54.50 66.40
DPO 50.08 64.75 79.25 89.90 68.92 58.75 51.64 66.18
R-DPO 48.94 64.75 78.50 89.36 68.91 59.25 51.58 65.91
SimPO 47.94 64.00 79.00 89.57 68.77 58.25 53.42 65.85
WPO 49.66 64.75 79.25 89.75 68.72 57.75 54.06 66.28
SamPO 49.62 65.00 79.50 89.97 68.83 58.75 52.97 66.38
LD-DPO 49.79 65.25 79.50 90.43 69.05 60.50 54.61 67.02
Qwen2-7B-Instruct
SFT 57.56 68.50 80.50 90.79 71.15 71.75 58.14 71.11
DPO 57.85 68.00 80.25 90.89 71.41 71.00 59.11 71.22
R-DPO 56.26 68.50 80.75 90.61 70.74 70.75 58.31 70.85
SimPO 58.10 68.25 80.75 90.65 71.11 74.00 58.03 71.64
WPO 58.15 68.75 80.75 90.72 71.10 71.75 57.97 71.31
SamPO 58.33 68.00 80.25 90.79 71.41 71.00 59.58 71.34
LD-DPO 59.18 68.75 81.75 91.07 71.16 71.25 59.69 71.84
Table 4: Results on downstream tasks on the Huggingface OpenLLM Leaderboard.

Appendix B Additional Experimental Results

B.1 Results on Downstream Tasks

We evaluate the performances of LD-DPO and all baselines on various tasks on OpenLLM leaderboard (Beeching et al., 2023), including MMLU (Hendrycks et al., 2021), ARC (Clark et al., 2018), BBH (Suzgun et al., 2023), CommonsenseQA (Talmor et al., 2019), WinoGrande (Sakaguchi et al., 2021), HellaSwag (Zellers et al., 2019) and ProofWriter (Tafjord et al., 2021). The results are shown in Table 4, from where we find that:

  • LD-DPO outperforms SFT, DPO, and all other baselines on the average score across all benchmarks in all settings.

  • Preference optimization, whose goal is to align LLMs with human preference, may not significantly improve the performance on all downstream tasks.

  • All preference optimization methods perform comparably to SFT model on MMLU and CSQA, with slight fluctuations, showing that knowledge is maintained during the preference optimization stage.

  • Compared to SFT, DPO, and other baselines, LD-DPO significantly improves the performance on HellaSwag and ProofWriter, indicating that LD-DPO can enhance the reasoning capability of LLMs.

Method Writing Roleplay Reasoning Math Coding Extraction STEM Humanities Average
Llama2-13B-Base
SFT 8.15 6.20 4.35 1.65 2.95 6.75 5.90 8.15 5.51
DPO 7.55 7.05 4.95 1.70 3.00 6.55 6.30 8.30 5.67
R-DPO 8.05 5.90 3.80 2.10 2.90 6.35 6.10 8.40 5.45
SimPO 7.95 6.30 4.60 1.70 3.05 6.60 5.75 8.05 5.50
WPO 7.75 6.70 4.60 1.80 2.85 7.55 6.30 8.55 5.76
SamPO 7.85 6.55 4.75 1.35 3.05 8.05 6.40 8.25 5.78
LD-DPO 7.80 6.50 4.75 1.60 3.65 7.40 6.55 8.45 5.83
Llama2-13B-Chat
SFT 8.45 7.20 5.35 3.30 2.75 7.30 7.50 9.00 6.35
DPO 7.90 7.50 5.40 2.95 3.15 7.05 7.30 9.40 6.33
R-DPO 8.75 7.05 5.65 3.05 3.00 7.25 6.80 9.05 6.32
SimPO 8.90 7.25 5.55 3.05 3.50 6.75 7.15 9.05 6.40
WPO 8.50 6.65 5.60 2.60 3.05 7.95 7.60 9.25 6.40
SamPO 8.20 7.05 5.45 2.70 3.00 7.85 6.55 8.90 6.21
LD-DPO 8.60 7.30 6.05 3.20 3.25 7.20 7.65 9.15 6.55
Llama3-8B-Base
SFT 7.80 6.25 3.90 3.05 4.25 8.40 6.55 8.50 6.08
DPO 7.95 6.80 4.05 3.20 4.45 8.75 7.25 8.65 6.38
R-DPO 8.30 6.30 4.00 2.65 4.10 8.30 7.45 8.35 6.18
SimPO 8.10 6.30 4.40 3.05 4.35 8.35 7.15 8.20 6.24
WPO 8.45 6.60 4.10 3.45 4.60 8.50 7.10 8.50 6.42
SamPO 8.15 6.40 4.40 2.75 4.30 8.50 6.40 8.20 6.12
LD-DPO 8.15 7.40 4.45 3.15 4.40 8.55 7.15 8.35 6.45
Llama3-8B-Instruct
SFT 8.95 8.45 4.60 5.05 5.35 9.10 8.10 9.30 7.36
DPO 9.05 8.55 5.55 5.30 5.45 9.00 8.70 9.35 7.61
R-DPO 8.85 8.00 5.75 5.50 6.15 8.75 7.70 9.65 7.54
SimPO 9.05 7.40 5.45 5.30 5.75 8.60 7.90 9.45 7.36
WPO 8.20 8.55 5.50 5.20 6.20 9.25 8.50 9.40 7.60
SamPO 9.20 8.65 5.30 3.55 6.15 9.25 8.45 9.50 7.50
LD-DPO 8.95 8.35 5.90 5.35 6.75 8.70 8.55 9.40 7.74
Qwen2-7B-Base
SFT 7.45 6.35 4.65 6.05 4.45 6.85 6.50 8.15 6.30
DPO 7.65 7.30 4.70 6.95 5.00 7.35 6.65 8.25 6.73
R-DPO 7.40 6.30 4.50 6.30 3.95 6.70 6.15 8.00 6.16
SimPO 7.75 6.60 4.70 5.95 5.45 6.60 7.60 8.30 6.61
WPO 7.85 6.75 4.85 6.75 5.10 6.75 7.25 8.45 6.71
SamPO 7.95 7.35 4.85 6.70 4.50 7.15 7.20 8.65 6.79
LD-DPO 8.20 6.85 5.05 7.45 4.85 6.70 6.80 8.50 6.80
Qwen2-7B-Instruct
SFT 9.10 8.95 6.30 6.35 6.45 8.05 9.05 9.40 7.95
DPO 9.15 9.05 6.45 6.40 4.85 7.90 9.05 9.45 7.79
R-DPO 8.80 8.50 6.80 6.25 6.10 8.60 8.90 9.60 7.94
SimPO 8.85 8.90 6.20 6.55 6.00 8.00 9.05 9.50 7.88
WPO 9.00 8.80 6.80 6.40 4.90 8.10 8.20 9.55 7.72
SamPO 8.90 8.80 6.50 6.15 5.70 8.45 8.35 9.40 7.78
LD-DPO 8.90 8.55 7.25 6.25 5.90 8.75 9.05 9.55 8.03
Table 5: Scores for each capability item on the MT-Bench.

B.2 MT-Bench Result

We provide a detailed presentation of the MT-Bench results in Table 5. MT-Bench comprises 80 questions specifically designed to evaluate the model’s proficiency across 8 dimensions: writing, roleplay, reasoning, math, coding, extraction, stem, humanities. To further assess the model’s capability in multi-round interactions and bolster the validity of the results, each question in MT-Bench undergoes two rounds of Q&A. The scores reported for each dimension are the averages derived from these two rounds.

Generally, we find that LD-DPO outperforms SFT, DPO, and all other baselines on average in all settings. In the case of slight fluctuations in performance across other dimensions, LD-DPO significantly outperforms all baselines in reasoning, indicating that LD-DPO can enhance the reasoning capability of LLMs.

Appendix C Case Studies

In Fig.8, we present two examples from AlpacaEval 2, where LD-DPO generates comparably accurate outputs with less tokens compared to vanilla DPO, showing that LD-DPO can generate more concise outputs without sacrificing performance. Since longer outputs are not necessarily more accurate or informative, sometimes they may contain more redundant and unimportant information, making the outputs verbose and lengthy. We believe that providing concise and clear outputs can also reflect the expressive capability of LLMs.

We also present two examples from ProofWriter in Fig.9, where LD-DPO generates more concise and direct chain-of-thought (CoT). From the cases we find that DPO generates complex and verbose CoTs, ultimately resulting in incorrect answers. By contrast, LD-DPO provides more direct and clearer CoTs, resulting in correct answers. The experimental results show that overly lengthy CoTs may detract from the accuracy of reasoning, whereas LD-DPO can improve the reasoning capability of LLMs by producing more concise and clearer CoTs.

Refer to caption
Figure 8: Comparing generations of AlpacaEval 2 prompts from Llama3-8B trained based on DPO and LD-DPO.
Refer to caption
Figure 9: Comparing generations of ProofWriter prompts from Llama3-8B trained based on DPO and LD-DPO.