Professional Documents
Culture Documents
Justifying Recommendations Using Distantly-Labeled Reviews and Fine-Grained Aspects
Justifying Recommendations Using Distantly-Labeled Reviews and Fine-Grained Aspects
Fine-Grained Aspects
188
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing, pages 188–197,
Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics
where user interactions are highly sparse. alizes well to the Amazon Clothing dataset. We
On the other hand, generating diverse responses study different decoding strategies and compare
is essential in personalized content generation sce- their effect on generation performance.
narios such as justification generation. Instead
of always predicting the most popular reasons, 2 Dataset Generation
it’s preferable to present diverse justifications for
In this section, we introduce the pipeline to ex-
different users based on their personal interests.
tract high quality justifications from raw user re-
Recent work has shown that incorporating prior
views. Specifically, our goal here is to identify
knowledge into generation frameworks can greatly
review segments that can be used as justifications
improve diversity. Prior knowledge could include
and build a personalized justification dataset upon
story-lines in story generation (Yao et al., 2019),
them. Our pipeline consists of three steps:
or historical responses in dialogue systems (We-
1. Annotating a set of review segments with binary
ston et al., 2018).
labels, i.e., to determine whether they are ‘good’
In this work, our goal is to generate convincing or ‘bad’ justifications.
and diverse justifications. To address the challenge 2. Training a classifier on the annotated subset and
of lacking ground-truth data about ‘good’ justifi- applying it to distantly label all the review seg-
cations, we propose a pipeline that can identify ments to extract ‘good’ justifications for each
justifications from massive corpora of reviews or user and item pair.
tips. We extract fine-grained aspects from justifi- 3. Applying fine-grained aspect extraction for the
cations and build user personas and item profiles extracted justifications, and building user per-
consisting of sets of representative aspects. To sonas and item profiles.
improve generation quality and diversity, we pro-
pose two generation models (1) a reference-based 2.1 Identifying Justifications From Reviews
Seq2Seq model with aspect-planning, which takes
The first step is to extract text segments from re-
previous justifications as a reference and can pro-
views that are appropriate to use as justifications.
duce justifications based on different aspects, and
Instead of a complete sentence or phrase, we de-
(2) an aspect-conditional masked language model
fine each segment as an Elementary Discourse
that can generate diverse justifications from tem-
Unit (EDU; (Mann and Thompson, 1988)) which
plates extracted from previous justifications.
corresponds to a sequence of clauses. We use the
Our contributions are threefold:
model of Wang et al. (2018) to obtain EDUs from
• To facilitate recommendation justification gen- reviews. Recent works have shown that EDUs can
eration, we propose a pipeline to identify justi- improve the performance of document-level sum-
fication candidates and build aspect-based user marization (Bhatia et al., 2015) and opinion sum-
personas and item profiles from massive cor- marizaiton (Angelidis and Lapata, 2018).
pora of reviews. With this approach, we are After preprocessing the reviews into EDUs, we
able to build large-scale personalized justifica- analyzed the linguistic differences between rec-
tion datasets. We use these extractive justifica- ommendation justifications and reviews, and built
tion segments in the task of explainable recom- two rules to filter the segments that are unlikely
mendation and show that these are better train- to be suitable justifications: (1) segments with
ing sources instead of whole reviews. first-person or third-person pronouns, and (2) too
• We propose two models based on reference long or short. Next, two expert annotators were
attention, aspect-planning techniques and a exposed to 1,000 segments among those not fil-
persona-conditional masked language model. tered out and asked to determine whether they are
We show that adding such personalized informa- ‘good’ justifications. Labeling was performed iter-
tion enables the models to generate justifications atively, followed by feedback and discussion, un-
with high quality and diversity. til the quality was aligned between the two an-
• We conduct extensive experiments on two real- notators. At the end of the process, the inter-
world datasets from Yelp and Amazon Clothing. annotator agreement for the binary labeling task
We provide an annotated dataset about ‘good’ (good vs. bad), measured by Cohen’s kappa (Co-
justifications on the Yelp dataset and show that hen, 1960), was 0.927 after alignment. Then, the
the binary classifier trained on this dataset gener- annotators further labeled 600 segments. Overall,
189
Method F1 Recall Precision Yelp
The Tuna is pretty amazing
BOW-Xgboost 0.559 0.679 0.475
Appetizers and pasta are excellent here
CNN 0.644 0.596 0.700
An excellent selection of both sweet and savory crepes
LSTM-MaxPool 0.675 0.703 0.650
BERT 0.747 0.700 0.800 It was filled with delicious food, fantastic music and
BERT-SA (one epoch) 0.481 0.975 0.320 dancing
BERT-SA (three epoch) 0.491 1.000 0.325 Amazon-Cloth
The quality of the material is great
Great shirt, especially for the price.
Table 2: Performance for classifying review segments
The seams and stitching are really nice
as good or bad for recommendation justification.
Fit the bill for a Halloween costume.
190
Projection Layer P(food)
User Item FA 2 ℎ2
⨁ ⨁ ⨁ ℎ
Representation Representation Representation
justifications Ju,i = {w1 , w2 , . . . , wT } that would where ê ∈ Rls ×n is the final output of the encoder,
explain why item i fits user u’s interests, where T and We ∈ Rlr , be ∈ R are learned parameters.
is the length of the justification. Sequence decoder. The decoder is a two-layer
GRU that predicts the target words given a start to-
3.2 Reference-based Seq2Seq Model ken. The hidden state of the decoder is initialized
Our base model follows the structure of a stan- using the sum of the last hidden state of the user
dard Seq2Seq (Sutskever et al., 2014) model. Our and item encoders. The hidden state at time-step t
framework, called ‘Ref2Seq’, views the historical is updated via the GRU unit based on the previous
justifications of users and items as references and hidden state and the input word. Specifically:
learns latent personalized features from them. Fig-
ure 1 shows the structure of our Reference-based h0 = euls + eils , ht = GRU(wt , ht−1 ), (3)
Seq2Seq Model. It includes two components: (1) where euls and eils are the last hidden states of the
two sequence encoders that learn user and item user and item encoder output êu and êi .
latent representations by taking previous justifica- To explore the relation between the reference
tions as references; (2) a sequence decoder incor- and generation, we apply an attention fusion layer
porating representations from users and items to to summarize the output of each encoder. For the
generate personalized justifications. user and item reference encoder, the attention vec-
Sequence Encoders. Our user encoder and se- tor is defined as:
quence encoder share the same structure, which
includes an embedding layer, a two-layer bi- ls
X
directional GRU (Cho et al., 2014), and a projec- a1t = 1
αtj ej ,
j=1
tion layer. The inputs are a user (or item) reference
1 >
D consisting of a set of historical justifications. αtj = exp(tanh(vα1 (Wα1 [ej ; ht ] + b1α )))/Z,
These justifications pass a word embedding layer, (4)
then go through the GRU and yield a sequence of 1 n
where at ∈ R is an attention vector on the se-
hidden states e ∈ Rls ×lr ×n : quence encoder at time-step t, αtj 1 is an attention
→ ←
score over the encoder hidden state ej and decoder
E = Embedding(D), e = GRU(E) = e + e , hidden state ht , and Z is a normalization term.
(1) Aspect-Planning Generation. One of the chal-
where ls denotes the length of the sequence, n is lenges for generating justifications is how to im-
the hidden size of the encoder GRU, E ∈ Rls ×lr ×n prove controllability, i.e., directly manipulate the
→
is the embedded sequence representation, and e content being generated. Inspired by ‘plan-and-
←
and e are the hidden vectors produced by a for- write’ (Yao et al., 2019), we extend the base model
ward and a backward GRU (respectively). to an Aspect-Planning Ref2Seq (AP-Ref2Seq)
To combine information from different ‘refer- model where we plan a fine-grained aspect before
ences’ (i.e. justifications), the hidden states are generation. This aspect planning can be consid-
then projected via a linear layer: ered as an extra form of supervision instead of
a hard constraint to make justification generation
ê = We · e + be , (2) more controllable.
191
When generating the justification for user u and u wrote about item i, we adapt the pre-trained
item i, we first provide a fine-grained aspect a as a BERT model (Devlin et al., 2019) into an encoder-
plan. The aspect a is fed into the word embedding decoder network with (1) an aspect encoder which
layer to obtain the aspect embedding Ea . Then, we encodes the user persona and item profile into la-
compute the scores between the embedding of the tent representations and (2) a masked language
aspect and the decoder hidden state as: model sequence decoder that takes in a masked
justification and predicts the masked tokens.
a2t = αt2 Ea ,
Aspect Encoder. Our aspect encoder shares
>
αt2 = exp(tanh(vα2 (Wα2 [Ea ; ht ] + b2α )))/Z, the same WordPiece embeddings (Wu et al., 2016)
(5) as BERT. The encoder feeds the intersection of
where a2t ∈ Rn is an attention vector and αt2 is an fine-grained aspects from the user persona and
attention score. item profile Aui = {a1 , . . . , aK 0 } into the em-
The attention vectors a1ut of user u, a1it of item i, bedding layer and obtains the aspect embedding
0
and a2t of fine-grained aspect a, are concatenated Aui ∈ RK ×n , where K 0 is the number of com-
with the decoder hidden state at time-step t and mon fine-grained aspects and n is the dimension
projected to obtain the output word distribution P . of the WordPiece embeddings.
The output probability for word w at time-step t is Masked Language Model Sequence Decoder.
given by: We use the masked language model in the pre-
trained BERT model as our sequence decoder and
p(wt ) = tanh(W1 [ht ; a1ut ; a1it ; a2t ] + b1 ), (6)
add attention over the aspect encoder’s output. As
where wt is the target word at time-step t. Given shown in Figure 2, the input to the decoder is a
the probability p(wt ) at each time step t, the model M = {w , . . . , w } with
masked justification Ju,i 1 T
is trained using a cross-entropy loss compared multiple tokens be replaced as [MASK]. The de-
against the ground-truth sequence. coder’s output T ∈ RT ×n is then fed to the atten-
tion layer to calculate an attention score with the
3.3 Aspect Conditional Masked Language
output of the encoder:
Model
K 0
Though Seq2Seq-based models can achieve high X
a3t = 3
αtj Aj ,
quality output, they often fail to generate diverse
j=1
content. Recent works in natural language gener- >
3
ation (NLG) tried to combine generation methods αtj = exp(tanh(vα3 (Wα3 [Aj ; Tt ] + b3α )))/Z.
with information retrieval techniques to increase (7)
the generation diversity (Li et al., 2018; Baheti The attention vector a3t is then concatenated
et al., 2018). The basic idea follows the paradigm with the decoder hidden state at time-step t and
of retrieve-and-edit—which is to first retrieve his- sent to a linear projection layer to obtain the out-
torical responses as templates, and then edit the put word distribution P . The output probability
template into new content. Since our data is anno- for word w at time-step t is given by:
tated with fine-grained aspects, it naturally fits into p(wt ) = tanh(W2 [Tt ; a3t ] + b2 ) (8)
this type of retrieve-and-edit paradigm. Mean-
while, masked language models have shown great where wt is the target word at time-step t.
performance in language modeling. Recent work Masking Procedure. The original BERT pa-
(Wang and Cho, 2019; Mansimov et al., 2019) has per applies a flat rate (15%) to decide whether to
shown that by sampling from the masked language mask a token. Unlike their approach, we adopt a
model (e.g. BERT), it is able to generate coherent higher rate to mask fine-grained aspects since they
sentences. are more important in justifications. Specifically,
Inspired by this work, we want to extend such if we encounter a fine-grained aspect, we will re-
an approach into a conditional version—we ex- place it with a [MASK] token 30% of the time;
plore the use of an Aspect Conditional Masked while for other words, we will replace them with a
Language Model (ACMLM) to generate diverse [MASK] token 15% of the time.
personalized justifications. Figure 2 shows the During training, the model will only predict
structure of our Aspect Conditional Masked Lan- those masked tokens and calculate a cross-entropy
guage Model. For a justification Ju,i that user loss on them.
192
Projection Layer P(atmosphere) P(service)
𝐶𝐶 𝑇𝑇 1 𝑇𝑇 2 𝑇𝑇 3 𝑇𝑇 4 𝑇𝑇 5 𝑇𝑇 6
Embedding Layer
BERT
Iter 0 universe [MASK] is extremely friendly and per- X i+1 = (xi1 , . . . , x̃ti , . . . , xiT ). After repeating
sona ##ble this procedure N times, the final output is consid-
Iter 5 the [MASK] is extremely friendly and persona
ered as the generation output.2
##ble
Iter 10 the [MASK] is extremely friendly and persona
##ble 4 Experiments
Iter 15 the staff are extremely cool and persona ##ble
Iter 20 the staff are extra kind , persona ##ble 4.1 Datasets
193
Dataset Train Dev Test # Users # Items # Aspects
Yelp 1,219,962 115,907 115,907 115,907 51,948 2,041
Amazon Clothing 202,528 57,947 57,947 57,947 50,240 581
194
Model Shake Shack Teharu Sushi MGM Grand Hotel
Ground Truth The burger was good The rolls are pretty great , typi- Room was very clean comfort-
cal rolls not that many specials able
LexRank A great burger and fries. Sushi ? Great rooms.
Ref2Seq (Review) i love trader joe ’s , i love trader the food was good and the ser- i love this place ! the food is
joe ’s vice was great always good and the service is
always great
Ref2Seq (Tip) this place is awesome love this place come here
Ref2Seq this place has some of the best the sushi is delicious the room was nice
burgers
Ref2Seq (Top-k) the fries are amazing fresh and delicious sushi open hotel for hours
ACMLM breakfast sandwiches are over- overall fun experience with half family style dinner , long time
all very filling price sushi shopping trip to vegas, family
dining , cheap lunch
Table 8: Comparisons of the generated justifications from different models for three businesses on the Yelp dataset.
cation generation task. As shown in Table 6, both Dataset Aspects Generated Output
sampling-based methods Ref2Seq (Top-k) and dining the dining room is nice
ACMLM achieve higher Distinct-1 and Distinct-2, pastry the pastries were pretty good
Yelp
while their BLEU scores are lower than Seq2Seq chicken the chicken fried rice is the best
based models using beam search. Therefore, we sandwich the pulled pork sandwich is the
best thing on the menu
also perform human evaluation to validate the gen-
eration quality of our proposed methods. product great product , fast shippong
Amazon- price design is nice , good price
4.5 Human Evaluation Clothing leather comfortable leather sneakers .
classic
We conduct human evaluation on three aspects: walking sturdy , great city walking shoes
(1) Relevance measures whether the generated
output contains information relevant to an item; Table 9: Generated justifications from AP-Ref2Seq.
(2) Informativeness measures whether the gener- The planned aspects are randomly selected from users’
ated justification includes specific information that personas.
is helpful to users; and (3) Diversity measures how
distinct the generated output is compared with cisions. Other models trained on the justifica-
other justifications. tion datasets tend to mention concrete information
We focus on the Yelp dataset and sample 100 (e.g. different aspects). LexRank tends to generate
generated examples from each of the five mod- relevant but short content. Meanwhile, sampling-
els as shown in Table 7. Human annotators are based models are able to generate more diverse
asked to give a score in the range [1,5] (lowest to content.
highest) for each metric. Each example is rated RQ2: How does aspect planning affect gener-
by at least three annotators. The results show ation? To mitigate the trade-off between diver-
that both Ref2Seq (Top-k) and ACMLM achieve sity and relevance, one approach is to add more
higher scores on Diversity and Informativeness constraints during generation such as constrained
compared to other models. Beam Search (Anderson et al., 2017). In our work,
4.6 Qualitative Analysis we extend our base model Ref2Seq by incorpo-
rating aspect-planning to guide generation. As
Here we study the following two qualitative ques- shown in Table 9, most planned aspects are present
tions: in the generated outputs of AP-Req2Seq.
RQ1: How do training data and methods affect
generation? As Table 8 shows, models trained on 5 Related Work
reviews and tips tend to generate generic phrases
(such as ‘i love this place’) which often do not Explainable Recommendation There has been
include information that helps users to make de- a line of work that studies how to improve the
195
explainability of recommender systems. Cather- get sentiment. In this work, we also introduces a
ine and Cohen (2017) learn latent representations conditional masked language model but considers
of review text to predict ratings. These repre- more fine-grained aspects.
sentations are then used to find the most help-
ful reviews for given a particular user and item 6 Conclusion
pair. Another popular direction is to generate text In this work, we studied the problem of person-
to justify recommendations. Dong et al. (2017) alized justification generation. To build high qual-
proposed an attribute-to-sequence model to gener- ity justification datasets, we provided an annotated
ate product reviews which utilizes categorical at- dataset and proposed a pipeline to extract justifi-
tributes. Ni et al. (2017) developed a multi-task cations from massive review corpora. To gener-
learning method that considers collaborative fil- ate convincing and diverse justifications, we de-
ter and review generation. Li et al. (2019b) gen- veloped two models: (1) Ref2Seq which lever-
erated tips by considering ‘persona’ information ages historical justifications as references dur-
which can capture the language style of users and ing generation; and (2) ACMLM, which is an
characteristics of items. However, these works use aspect conditional model built on a pre-trained
whole reviews or tips as training examples, which masked language model. Our experiments showed
may not be appropriate due to the quality of re- that Ref2Seq achieves higher scores (in terms of
view text. More recently, Liu et al. (2019) pro- BLEU) and ACMLM achieves higher diversity
posed a framework to generate fine-grained expla- scores compared with baselines. Human evalu-
nations for text classification. To achieve labels for ation showed that reference-based models obtain
human-readable explanations, they constructed a high relevance scores and sampling based methods
dataset from a website which provides ratings and led to more diverse and informative outputs. Fi-
fine-grained summaries written by users. Unfor- nally, we showed that aspect-planning is a promis-
tunately, most websites do not provide such fine- ing way to guide generation to produce personl-
grained information. On the other hand, our work ized and relevant justifications.
identifies justifications from reviews, uses them as Acknowledgements. This work is partly sup-
training examples and shows these are better data ported by NSF #1750063. We thank all the re-
source for explainable recommendation via exten- viewers for their constructive suggestions.
sive experiments.
196
Kyunghyun Cho, Bart van Merrienboer, aglar Gülehre, Jianmo Ni, Zachary C. Lipton, Sharad Vikram, and Ju-
Fethi Bougares, Holger Schwenk, and Yoshua Ben- lian J. McAuley. 2017. Estimating reactions and rec-
gio. 2014. Learning phrase representations using ommending products with generative models of re-
rnn encoder-decoder for statistical machine transla- views. In IJCNLP.
tion. In EMNLP.
Jianmo Ni and Julian McAuley. 2018. Personalized re-
Jacob Willem Cohen. 1960. A coefficient of agreement view generation by expanding phrases and attending
for nominal scales. on aspect-aware representations. In ACL.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Alec Radford, Jeff Wu, Rewon Child, David Luan,
Kristina Toutanova. 2019. Bert: Pre-training of deep Dario Amodei, and Ilya Sutskever. 2019. Language
bidirectional transformers for language understand- models are unsupervised multitask learners.
ing. In NAACL.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Sequence to sequence learning with neural net-
Ming Zhou, and Ke Xu. 2017. Learning to generate works. In NIPS.
product reviews from attributes. In EACL.
Alex Wang and Kyunghyun Cho. 2019. Bert has a
Günes Erkan and Dragomir R. Radev. 2004. Lexrank: mouth, and it must speak: Bert as a markov random
Graph-based lexical centrality as salience in text field language model. CoRR, abs/1902.04094.
summarization. J. Artif. Intell. Res., 22:457–479.
Yizhong Wang, Sujian Li, and Jingfeng Yang. 2018.
Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, and Shum-
Toward fast and accurate neural discourse segmen-
ing Shi. 2018. Generating multiple diverse re-
tation. In EMNLP.
sponses for short-text conversation. In AAAI.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Jason Weston, Emily Dinan, and Alexander H. Miller.
short-term memory. Neural Computation, 9:1735– 2018. Retrieve and refine: Improved sequence gen-
1780. eration models for dialogue. In SCAI@EMNLP.
Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han,
Choi. 2019. The curious case of neural text degen- and Songlin Hu. 2019. Mask and infill: Applying
eration. CoRR, abs/1904.09751. masked language model to sentiment transfer. In IJ-
CAI.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and William B. Dolan. 2015. A diversity-promoting Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
objective function for neural conversation models. Le, Mohammad Norouzi, Wolfgang Macherey,
In HLT-NAACL. Maxim Krikun, Yuan Cao, Qin Gao, Jeff Klingner,
Apurva Shah, Melvin Johnson, Xiaobing Liu,
Juncen Li, Robin Jia, He He, and Percy S. Liang. 2018. Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato,
Delete, retrieve, generate: A simple approach to sen- Taku Kudo, Hideto Kazawa, Keith Stevens, George
timent and style transfer. In NAACL-HLT. Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason
Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals,
Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, and Yang Gregory S. Corrado, Macduff Hughes, and Jeffrey
Song. 2019a. Generating long and informative re- Dean. 2016. Google’s neural machine translation
views with aspect-aware coarse-to-fine decoding. In system: Bridging the gap between human and ma-
ACL. chine translation. CoRR, abs/1609.08144.
Piji Li, Zihao Wang, Lidong Bing, and Wai Lam. Yu Wu, Furu Wei, Shaohan Huang, Yunli Wang, Zhou-
2019b. Persona-aware tips generation. In WWW. jun Li, and Ming Zhou. 2018. Response generation
Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and by context-aware prototype editing. In AAAI.
Wai Lam. 2017. Neural rating regression with ab-
Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin
stractive tips generation for recommendation. In SI-
Knight, Dongyan Zhao, and Rui Yan. 2019. Plan-
GIR.
and-write: Towards better automatic storytelling. In
Hui Liu, Qingyu Yin, and William Yang Wang. 2019. AAAI.
Towards explainable nlp: A generative explanation
framework for text classification. In ACL. Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang,
Yiqun Liu, and Shaoping Ma. 2014. Explicit fac-
William C. Mann and Sandra A. Thompson. 1988. tor models for explainable recommendation based
Rhetorical structure theory: toward a functional the- on phrase-level sentiment analysis. In SIGIR.
ory of text.
Elman Mansimov, Alex Wang, and Kyunghyun Cho.
2019. A generalized framework of sequence genera-
tion with application to undirected sequence models.
ArXiv, abs/1905.12790.
197