High Quality Related Search Query Suggestions Using Deep Reinforcement Learning
High Quality Related Search Query Suggestions Using Deep Reinforcement Learning
High Quality Related Search Query Suggestions Using Deep Reinforcement Learning
Reinforcement Learning
Praveen Kumar Bodigutla
[email protected]
LinkedIn
Sunnyvale, California, USA
sentence. To the best of our knowledge, this is the first time a com-
bination of long-term session-based user-feedback, un-natural sen-
Figure 1: Deep Reinforcement Learning for text-generation using REINFORCE algorithm. Agent is
tence penalty and syntactic relatedness reward signals are jointly
an encoder-decoder Seq2Seq attention model. 𝑞𝑏𝑘 is the input query, for batch-index 𝑏𝜖𝐵 and
optimized to improve query suggestions’ quality. 𝑖 {1:𝑇 }
1:𝐵,1:𝐾
Monte-Carlo sample index 𝑘𝜖𝐾 . Generated words (𝑦𝑡 ) concatenated with attention context
The remainder of this paper is structured as follows: Section (𝑐𝑡 +1 ) is passed as input to the decoder’s 𝑡 + 1 time-step. For each generated sequence (𝑦𝑏𝑘 ) policy
2 describes our proposed deep reinforcement learning approach. update is done using RL reward (𝑅𝑏𝑘
1:𝑇
), which is calculated at the end of each generated sequence.
1:𝑇
Section 3.2 presents the experimental setup and discusses empirical
results. Section 4 concludes.
learning”) . The 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 encoder-decoder framework consists
2 Approach for Improving Query Suggestions
of a BiLSTM [12] encoder, that encodes a batch (batch-size 𝐵) of in-
We fine-tune a weakly supervised Sequence-to-Sequence (Seq2Seq) put queries (𝑞𝑖1:𝐵 ) and the LSTM [13] decoder generates a batched
Neural Machine Translation (NMT) model (𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 ) to ini- sequence of words y = (𝑦11:𝐵 ,..., 𝑦𝑇1:𝐵 ). Where, 𝑇 is the sequence
tialize the query generation policy. The process then consists of length. During training, we use teacher forcing [32], i.e., use the
two steps: 1) Learn a context-aware weakly supervised naturalness co-occuring query (𝑞𝑖+ 1:𝐵 ) as input to the decoder. Context attention
1
estimator; and 2) Fine-tune pre-trained supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 vector is obtained from the alignment model [3] . Categorical cross
model using REINFORCE [31] algorithm. The future-reward is com- entropy loss is minimized during training and hyper-parameters of
posed of user-feedback in a search session (𝑈 + ), syntactic similarity the model are fine-tuned (see Section 3.2).
(𝑅𝑂𝑈 𝐺𝐸) and unnaturalness penalty (−𝜂∗(1−𝐷𝜙 )) of the generated
query given the co-occuring previous query. 2.2 Fine-tuning using Deep Reinforcement
2.1 Weakly Supervised Pre-Training Learning
This section describes the reward estimation and Deep Reinforce-
Variants of mono-lingual supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 models are used
ment Learning (DRL) model training steps to fine-tune and improve
in industry applications for query suggestions [14]. In the pre-
the policy obtained via pre-trained supervised model.
training step, we train the supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model using
co-occurring consecutive query pairs in a search sessions. A search- 2.2.1 Deep Reinforcement Learning Model. Parameters of the
session [7] is a stream of queries entered by a user in a 5-min1 DRL agent 𝐺𝜃 are initialized with pre-trained 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model
time window. N-1 Consecutive query pairs (𝑞𝑖 , 𝑞𝑖+1 ) are extracted (Section 2.1). The initial policy is fine-tuned using the REINFORCE
from a search session consisting of a sequence of N queries (𝑞 1 , policy-gradient algorithm (Figure 1). ‘𝐾’ complete sentences (𝑦𝑏, 1:𝐾
)
1:𝑇
𝑞 2 ,...,𝑞 𝑁 ). Consecutive queries could be unrelated, semantically and
generated per query (𝑞𝑖 ) constitute the action space at time-step
𝑏
(or) syntactically related. Our model is weakly supervised as we use
𝑇 , where 𝑏𝜖𝐵 is the index in a mini-batch of 𝐵 queries. To mitigate
all query pairs and do not filter them using sparse click data, costly
exposure-bias, generated words (𝑦𝑡1:−𝐵, 1:𝐾
) from previous time-step
human-evaluations and weak association rules. Weak supervision 1
are passed as input to the next time-step ‘𝑡’ of the decoder. Future-
allows the training process to scale, minimize selection-bias and
reward (𝑅𝐷𝜙 (𝑦𝑏𝑘 )) computed at the end of each generated sample,
we conjecture that it improves model generalization too. For ex- 1:𝑇
ample, unlike [14], we do not apply syntactic similarity heuristics is back-propagated to the encoder-decoder model. Given the start-
to filter query pairs, as queries could be semantically related yet state (𝑆 𝑏0 ) comprising of the input query (𝑞𝑏𝑖 ) and <START> token
syntactically dissimilar (e.g., “artificial intelligence” and “machine 𝑦𝑏0 , the objective of the agent is to generate related-search query
suggestions (𝑦𝑏1:𝑇 ) which maximize objective:
1 Based on guidance from internal search team’s proprietary analysis. 𝐽 (𝜃 ) = E[𝑅𝐷𝜙 (𝑦𝑏1:𝑇 ) |𝑆 𝑏 (1)
0, 𝜃 ]
High Quality Related Search Query Suggestions using Deep Reinforcement Learning Marble-KDD ’21, August 16, 2021, Singapore
Where per-sample reward is: session after the user enters 𝑞𝑖+1 is considered. In our work we
𝑅𝐷𝜙 (𝑦𝑏𝑘 𝑏𝑘
1:𝑇 ) = 𝑈 + + ( 1 − 𝑈 + ) ∗ (𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 − 𝜂 ∗ ( 1 − 𝐷𝜙 (𝑦1:𝑇 ))) (2) maximize session-based user-feedback, as we are interested in
𝑖 1:𝑇
maximizing user engagement across search sessions. For a gen-
MC approximation of the gradient using likelihood ratio trick:
1 𝐺 𝛽 𝑏𝑘 erated query 𝑦𝑏𝑘1:𝑇
, session-based user feedback (𝑈 + ) is “1” , if a
Δ𝜃 𝐽 (𝜃 ) ≈ ∗ Σ𝑘𝜖𝐾 [𝑅𝐷𝜙 (𝑦𝑏𝑘
1:𝑇 ) ∗ Δ𝜃 𝑙𝑜𝑔 𝐺𝜃 (𝑦1:𝑇 |𝑆 0 ) ], 𝑦1:𝑇 𝜖 𝑀𝐶
𝑏𝑘 𝑏 𝑏𝑘 (𝑦1:𝑇 ) (3)
𝐾 positive down-stream user action is observed in the remainder
Unlike SeqGAN [33] DRL model training, where the action-value at of the search-session and “0” otherwise.
each intermediate time-step 𝑡 is evaluated by generating 𝐾 samples • Relatedness of generated query to source query (𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 ):
(𝑦𝑡𝑏,+1:
𝑖 1:𝑇
𝐾
1:𝑇
), we perform MC policy roll-out from the start state alone. Despite increasing the percentage of associated positive user ac-
𝐾 queries (𝑦𝑏, 1:𝐾
) are generated using roll-out policy 𝐺 𝛽 , for each tion by considering user’s feedback across a search session, the
1:𝑇
input query 𝑞𝑏𝑖 . This modification reduces the computation cost label sparsity problem is not completely mitigated. In the search
by a factor of O (𝑇 ) (O (𝐾𝑇 2 ) → O (𝐾𝑇 )) per input query. 𝐺 𝛽 is query logs, when there is no positive downstream user action
initialized with 𝐺𝜃 and is periodically updated during training, associated with a generated query 𝑦𝑏𝑘 1:𝑇
, we estimate the reward
using a configurable schedule (see Algorithm 1). At each time-step using a syntactic similarity measure. Reformulated queries are
𝑡, the state 𝑆𝑡𝑏𝑘 is comprised of the input query and the tokens syntactically and semantically similar [18]. We compute syntactic
relatedness of generated query (𝑦𝑏𝑘 ) with the source query (𝑞𝑏𝑖 )
produced so far ({𝑞𝑏𝑖 , 𝑦𝑏𝑘
1:𝑡 −1 }) and the action is the next token 𝑦𝑡
𝑏𝑘 1:𝑇
using ROUGE-1 [22] score.
to be selected from stochastic policy 𝐺𝜃 (𝑦𝑡 |𝑆𝑡 ).
𝑏𝑘 𝑏𝑘
• Naturalness probability of generated query (𝐷𝜙 (𝑦𝑏𝑘 )): Users
Details of the constituents of future-reward (𝑈 + , 𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 1:𝑇
𝑖 1:𝑇 enter either natural language search queries[5] (e.g., “jobs requir-
and 𝐷𝜙 (𝑦𝑏𝑘
1:𝑇
)) are in the next section (Section 2.2.2). Since the ing databases expertise in the bay area” ) or they just enter key-
expectation E[.] can be approximated by sampling methods, we words (e.g., “bay area jobs database” ). In the context of related
then update the generator’s parameters with 𝛼 as the learning-rate query suggestions, we define a “natural” query as one which a
as: real user is likely to enter. We train a contextual-naturalness-
estimation model (see Section 2.2.3) to predict naturalness proba-
𝜃 ← 𝜃 + 𝛼 ∗ Δ𝜃 𝐽 (𝜃 ) (4)
bility 𝐷𝜙 (𝑦𝑏𝑘
1:𝑇
) of a generated query, given the previous query
Require: Generator policy 𝐺𝜃 ; roll-out policy 𝐺 𝛽 ; naturalness-estimator 𝐷𝜙 ;
entered by the user as context. “AI jobs” is an example of a nat-
ural query after the user searched for “Google”, even though
Query-pair in a search session 𝑞𝑖1:{𝐵1:𝑇 } , 𝑞 1:
{𝑖+1 } {1:𝑇 } ; Batch size: B;
𝐵
2. Initialize 𝐺𝜃 with fine-tuned supervised model. unlikely to be entered by a real user. In our DRL reward formu-
3. 𝛽 ← 𝜃
4. Train contextual-naturalness-estimator 𝐷𝜙 using negative examples generated from 𝐺𝜃
lation, we add penalty term −𝜂 ∗ (1 − 𝐷𝜙 (𝑦𝑏𝑘 1:𝑇
)) to syntactic-
5. repeat relatedness (𝑅𝑂𝑈 𝐺𝐸𝑞𝑏 ,𝑦𝑏𝑘 ) score to discourage generation of un-
6. for n steps 𝑖 1:𝑇
7. foreach 𝑏 𝜖 𝐵
1:𝐾
natural queries. Coefficient 𝜂 is the configurable penalty weight.
8. Generate "K" sequences 𝑦𝑏, 1:𝑇
using configured sampling-strategy
searching for 𝑞 1 . Without revealing the source of 𝑞 2 , we asked strategy. Complete set of hyper-parameters we tuned are in Appen-
annotators to identify if the query is “natural” (defined in Section dix Table 2. Best combination of hyper-parameters are chosen are
2.2.2) . On an average 58% of model-generated queries and 74% based on performance on validation set (See Appendix Table 3).
of real-user queries were identified as natural. The Inter Annota-
tor Agreement (IAA), measured using Fleiss-Kappa [25], was poor
3.3 Evaluation Metrics
(0.04) when the users evaluated model-generated sentences. In The binary “natural/unnatural” class prediction performance of
comparison, when they evaluated queries entered by real users, the contextual naturalness estimator is evaluated using F15 score
IAA was better (0.34) between the three annotators’ ratings and it and Accuracy 6 metrics. We use the mean of the following metrics
ranged from fair (0.22) to moderate (0.52) agreement between each calculated on the test set, to evaluate the relevance, engagement,
pair of annotators. Higher IAA and higher percentage of queries accuracy and diversity of generated queries.
identified as “natural” imply that real-user queries are more nat- • Sessions with positive user-action (𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 + @6): Long-term
ural and distinguishable than queries sampled from pre-trained binary engagement metric indicating if recommended queries
𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model. lead to a successful session. Its value is “1”, if any of the six
generated queries belong to a search-session in test-data with an
3 Experiment Setup and Results associated down-stream positive user action (Section 2.2.2).
This section describes the experimental setup to train and evaluate • Unique@6: Diversity metric indicating the percentage of unique
the naturalness-estimator, supervised and DRL query generation sentences in (six) query suggestions made per query 𝑞𝑖𝑡𝑒𝑠𝑡 . Queries
models. containing unknown word token (“<UNK>”) are filtered out as
only high-quality suggestions are presented to the end user.
3.1 Data • Precision@6: Measures relevance with respect to the query a
From user search-query logs, we randomly sampled 0.61 million user would enter next. Is “1” if (𝑞𝑖+
𝑡𝑒𝑠𝑡 ) is in the set of six query
1
(90% train), 34k (5% valid) and 34k (5% test) query pairs to train suggestions made for (𝑞𝑖 ) and “0” otherwise.
𝑡𝑒𝑠𝑡
the supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 and DRL models. Dataset size to train • Word-repetitions per sentence (𝑅𝑒𝑝𝑒𝑡𝑖𝑡𝑖𝑜𝑛𝑠𝑆 ): Fraction of word
the naturalness-estimator model is 5x the aforementioned amount repetitions per generated query (𝑆). Unwanted word repetitions
(See Section 2.2.3). Max-length of a query is 8 and mean-length is lead to lower quality.
~2 words. Vocabulary size is 32k and out of vocabulary words in • Prior Sentence Probability (𝑃𝑆 ): 𝑃𝑆 = Σ𝑤𝑖 𝜖𝑆 𝑙𝑜𝑔(𝑝 𝑤𝑖 ), mea-
validation and test sets are replaced with “<UNK>” unknown-token. sures the prior sentence probability. 𝑝 𝑤𝑖 is prior word probability
defined in Section 2.2.3. Lower sentence probability indicates
3.2 Experimental Setup higher diversity as generated queries contain less frequent words.
We implemented all models in Tensorflow [1] and tuned the pa- 3.4 Results
rameters using Ray[Tune] [21] on Kubernetes [17] distributed
The contextual-naturalness-estimator achieved 90% accuracy and
cluster. As described in Section 2, the query suggestion policy
80% F1 performance on test set. Table 1 shows the performance
is initialized with fine-tuned 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model. 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇
of supervised (𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 ) and proposed DRL model on the five
model parameters are updated using Adam [16] optimizer and
metrics mentioned in previous section. 𝐷𝑅𝐿𝑏𝑒𝑎𝑚 and 𝐷𝑅𝐿𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔
categorical-cross-entropy loss is minimized during training. Dur-
ing inference, six4 queries are generated per input query (𝑞𝑖 ) us- use “beam-search” and “sampling from categorical distribution”
ing beam-search [9] decoding. Negative examples to train the MC sampling strategies respectively (see Section 3.2). In order as-
two-layered BiLSTM contextual-naturalness-estimator are obtained sess the impact of applying heuristics to filter and improve quality
from pre-trained 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model. At inference, naturalness of suggestions provided by supervised models, we analyzed the
performance of 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 −
𝑀𝑇 , which is 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model with
probability (𝐷𝜙 (𝑦𝑏𝑘 )) is obtained from the output of fully-connected
1:𝑇 post-processing filters to remove suggestions with repeated words.
layer with last time-step’s hidden state as its input.
The initial policy is fine-tuned using REINFORCE policy-gradient Model\Metric 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 + @6 Unique@6 Precision@6 𝑅𝑒𝑝𝑒𝑡𝑖𝑡𝑖𝑜𝑛𝑠𝑆 𝑃𝑆
0.1108 ± 0.002 5.8244 ± 0.0045 0.0456 ± 0.0025 2.21% ± 0.04% -6.4442 ± 0.0149
algorithm, using future-reward described in Section 2.2.2. During 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇
−
𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 0.1101 ± 0.001 5.5595 ± 0.0075 0.0456 ± 0.0025 0.00% ± 0.00%† -6.4875 ± 0.0151†
training, “K” samples for MC roll-out are generated using beam-
𝑀𝑇
𝐷𝑅𝐿𝑏𝑒𝑎𝑚 0.1155 ± 0.002† 5.9606 ± 0.0023† 0.0468 ± 0.0025 1.10% ± 0.03%† -6.4897 ± 0.0140†
search or from categorical distribution of inferred word probabilities 𝐷𝑅𝐿𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 0.1149 ± 0.002† 5.9956 ± 0.0007† 0.0467 ± 0.0024 0.40%± 0.02%† -6.3932 ± 0.0141
Table 1: Mean performance of supervised 𝑆𝑒𝑞 2𝑆𝑒𝑞 𝑁 𝑀𝑇 and 𝐷𝑅𝐿 models across all query pairs
at each time-step (See Figure 1). DRL model training stability is mon- in test data. Cells show the mean and 95% confidence interval calculated using t-distribution. Best
itored using reward weighted Negative Log Likelihood convergence mean is in bold. † indicates statistically significant improvement over baseline 𝑆𝑒𝑞 2𝑆𝑒𝑞 𝑁 𝑀𝑇 .
performance, with 𝐾𝐵 1
Σ (𝑘𝜖𝐾,𝑏𝜖𝐵) [−𝑅𝐷𝜙 (𝑦𝑏𝑘
1:𝑇
) ∗ 𝑙𝑜𝑔(𝐺𝜃 (𝑦𝑏𝑘
1:𝑇
))] as On offline test data set, in comparison to the baseline 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇
the computed loss at each model training step. model, 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 removed query suggestions with repeated
−
We use SGD optimizer [15] to update the weights of the agent words completely, however the heuristics-based model performed
(Equation 4). Appendix Figure 2 shows the convergence perfor- poorly in-terms of diversity (4.5% relative drop in mean Unique@6)
mance of the DRL model for different values of unnaturalness and average number of successful sessions (0.6% relative drop in
penalty (𝜂), number of MC samples (𝐾) and choice of sampling mean 𝑆𝑒𝑠𝑠𝑖𝑜𝑛𝑠 + @6). On the other hand, both versions of our pro-
posed DRL models outperformed the baseline model on all metrics.
5 F1 ∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
= 2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
4 In production environment six related queries are suggested for each user query. 6 Categorical accuracy: calculates how often predictions match one-hot labels.
High Quality Related Search Query Suggestions using Deep Reinforcement Learning Marble-KDD ’21, August 16, 2021, Singapore
DRL variants achieved significant relative improvement in-terms AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment
of user-engagement (mean 𝑆𝑒𝑠𝑠𝑖𝑜𝑛 + @6) up to 4.2% (0.1108 → (AIIDE’08). AAAI Press, 216–217.
[9] Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural
0.1155), query suggestions’ diversity (mean Unique@6) up to 3% Machine Translation. CoRR abs/1702.01806 (2017). arXiv:1702.01806 http://arxiv.
(5.8244 → 5.9956), sentence-level diversity (mean Prior Sentence org/abs/1702.01806
[10] Thushan Ganegedara. 2020. Is the race over for Seq2Seq models? https:
Probability) up to 0.7% (−6.4442 → −6.4897) and reduction in //towardsdatascience.com/is-the-race-over-for-seq2seq-models-adef2b24841c
errors per sentence up to 82% (2.21 → 0.40). Non significant im- [11] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-
provement in relevance (mean Precision@6) is not surprising as Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative
Adversarial Nets. In Proceedings of the 27th International Conference on Neural
the supervised 𝑆𝑒𝑞2𝑆𝑒𝑞 𝑁 𝑀𝑇 model is also trained with consecutive Information Processing Systems - Volume 2 (NIPS’14). MIT Press, Cambridge, MA,
query pairs. USA, 2672–2680.
[12] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech
4 Conclusions recognition with Deep Bidirectional LSTM. In 2013 IEEE Workshop on Automatic
Speech Recognition and Understanding. 273–278. https://doi.org/10.1109/ASRU.
In this paper, we proposed a Deep Reinforcement Learning (DRL) 2013.6707742
[13] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
framework to improve the quality of related-search query sugges- Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
tions. Using long-term user-feedback, syntactic relatedness and 8.1735
estimated unnaturalness penalty as reward signals, we fine-tuned [14] Michaeel Kazi, Weiwei Guo, Huiji Gao, and Bo Long. 2020. Incorporating User
Feedback into Sequence to Sequence Model Training. In CIKM ’20: The 29th ACM
the supervised text-generation policy at scale with REINFORCE International Conference on Information and Knowledge Management, Virtual
policy-gradient algorithm. We showed significant improvement Event, Ireland, October 19-23, 2020, Mathieu d’Aquin, Stefan Dietze, Claudia Hauff,
in recommendation diversity (3%), query correctness (82%), user- Edward Curry, and Philippe Cudré-Mauroux (Eds.). ACM, 2557–2564. https:
//doi.org/10.1145/3340531.3412714
engagement (4.2%) over industry-baselines. For future work, we [15] J. Kiefer and J. Wolfowitz. 1952. Stochastic Estimation of the Maximum of a
plan to include semantic relatedness as reward. Since the proposed Regression Function. The Annals of Mathematical Statistics 23, 3 (1952), 462–466.
http://www.jstor.org/stable/2236690
DRL framework is agnostic to the choice of an encoder-decoder ar- [16] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Opti-
chitecture, we plan to fine-tune different state-of-the-art language mization. CoRR abs/1412.6980 (2015).
models using our proposed DRL framework. [17] kubernetes.io. 2020. Cluster Architecture. https://kubernetes.io/docs/concepts/
architecture/
Acknowledgments [18] Seanie Lee, Dong Bok Lee, and Sung Ju Hwang. 2021. Contrastive
Learning with Adversarial Perturbations for Conditional Text Generation.
Thanks to Cong Gu, Ankit Goyal and LinkedIn Big Data team for arXiv:cs.CL/2012.07280
[19] Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Ju-
their help in setting up DRL experiments on Kubernetes. Thanks to rafsky. 2016. Deep Reinforcement Learning for Dialogue Generation. CoRR
Souvik Ghosh and RL Foundations team for your valuable feedback. abs/1606.01541 (2016). arXiv:1606.01541 http://arxiv.org/abs/1606.01541
[20] Ruirui Li, Liangda Li, Xian Wu, Yunhong Zhou, and Wei Wang. 2019. Click
References Feedback-Aware Query Recommendation Using Adversarial Examples. In The
World Wide Web Conference (WWW ’19). Association for Computing Machinery,
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, New York, NY, USA, 2978–2984. https://doi.org/10.1145/3308558.3313412
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- [21] Richard Liaw, Eric Liang, Robert Nishihara, Philipp Moritz, Joseph E Gonzalez,
jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, and Ion Stoica. 2018. Tune: A Research Platform for Distributed Model Selection
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven- and Training. arXiv preprint arXiv:1807.05118 (2018).
berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike [22] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries.
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul In Text Summarization Branches Out. Association for Computational Linguistics,
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Barcelona, Spain, 74–81. https://www.aclweb.org/anthology/W04-1013
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. [23] Rodrigo Nogueira and Kyunghyun Cho. 2017. Task-Oriented Query Reformula-
2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. tion with Reinforcement Learning. In Proceedings of the 2017 Conference on Empir-
http://tensorflow.org/ Software available from tensorflow.org. ical Methods in Natural Language Processing. Association for Computational Lin-
[2] Himan Abdollahpouri. 2019. Popularity Bias in Ranking and Recommendation. guistics, Copenhagen, Denmark, 574–583. https://doi.org/10.18653/v1/D17-1061
https://doi.org/10.1145/3306618.3314309 [24] Florian Schmidt. 2019. Generalization in Generation: A closer look at Exposure
[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Bias. In Proceedings of the 3rd Workshop on Neural Generation and Translation.
Translation by Jointly Learning to Align and Translate. CoRR abs/1409.0473 Association for Computational Linguistics, Hong Kong, 157–167. https://doi.
(2015). org/10.18653/v1/D19-5616
[4] Praveen Kumar Bodigutla, Longshaokan Wang, Kate Ridgeway, Joshua Levy, [25] Hubert J. A. Schouten. 1986. Nominal scale agreement among observers. Psy-
Swanand Joshi, Alborz Geramifard, and Spyros Matsoukas. 2019. Domain- chometrika 51, 3 (1986), 453–466. https://doi.org/10.1007/BF02294066
Independent turn-level Dialogue Quality Evaluation via User Satisfaction Esti- [26] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
mation. arXiv:cs.LG/1908.07064 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017).
[5] F. Borges, Georgios Balikas, Marc Brette, Guillaume Kempf, A. Srikantan, arXiv:1707.06347 http://arxiv.org/abs/1707.06347
Matthieu Landos, Darya Brazouskaya, and Qianqian Shi. 2020. Query Under- [27] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea
standing for Natural Language Enterprise Search. ArXiv abs/2012.06238 (2020). Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. 2020. Learning to
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, summarize from human feedback. CoRR abs/2009.01325 (2020). arXiv:2009.01325
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda https://arxiv.org/abs/2009.01325
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, [28] Thanh Tin Tang, Nick Craswell, David Hawking, Kathy Griffiths, and Helen
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christensen. 2006. Quality and relevance of domain-specific search: A case study
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin in mental health. Information Retrieval 9, 2 (2006), 207–225. https://doi.org/10.
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya 1007/s10791-006-7150-5
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. [29] Xuanhui Wang, Michael Bendersky, Donald Metzler, and Marc Najork. 2016.
arXiv:cs.CL/2005.14165 Learning to Rank with Selection Bias in Personal Search. In Proc. of the 39th
[7] Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Hang International ACM SIGIR Conference on Research and Development in Information
Li. 2008. Context-Aware Query Suggestion by Mining Click-through and Session Retrieval. 115–124.
Data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowl- [30] Xiao Wang, Craig Macdonald, and Iadh Ounis. 2020. Deep Reinforced Query
edge Discovery and Data Mining (KDD ’08). Association for Computing Machinery, Reformulation for Information Retrieval. arXiv:cs.IR/2007.07987
New York, NY, USA, 875–883. https://doi.org/10.1145/1401890.1401995 [31] Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for
[8] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. 2008. Monte- connectionist reinforcement learning. Machine Learning 8, 3 (1992), 229–256.
Carlo Tree Search: A New Framework for Game AI. In Proceedings of the Fourth https://doi.org/10.1007/BF00992696
Marble-KDD ’21, August 16, 2021, Singapore Praveen Kumar Bodigutla
[32] Ronald J. Williams and David Zipser. 1989. A Learning Algorithm for Continually
Running Fully Recurrent Neural Networks.
[33] Lantao Yu, W. Zhang, J. Wang, and Y. Yu. 2017. SeqGAN: Sequence Generative
Adversarial Nets with Policy Gradient. In AAAI.
High Quality Related Search Query Suggestions using Deep Reinforcement Learning Marble-KDD ’21, August 16, 2021, Singapore
A Appendices
Figure 2: Reward weighted Negative log-likelihood convergence performance w.r.t the training epochs of 𝐷𝑅𝐿𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 (top figure) and 𝐷𝑅𝐿𝑏𝑒𝑎𝑚
(bottom figure) models, for different values of learning-rate. Number of Monte Carlo (MC) samples generated per query during training and unnatural
suggestion penalty are from best parameter combination for each MC sampling-method (see Appendix Table 3).
Table 2: Hyper parameter value ranges we used for training query generation and contextual-naturalness-estimation models. Once hyper-params of
Supervised-𝑆𝑒𝑞 2𝑆𝑒𝑞 𝑁 𝑀𝑇 are fine-tuned, same model architecture is used for training supervised natural-estimator (encoder) model and DRL (encoder-
decoder) agent. Hence batch-size, sequence length, number of rnn layers and hidden vector length remain consistent across all three models.
Table 3: Optimal hyper-parameter values for query generation and naturalness estimation models with criteria applied on validation-set performance
to choose them.