Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Justifying Recommendations using Distantly-Labeled Reviews and

Fine-Grained Aspects

Jianmo Ni, Jiacheng Li, Julian McAuley


University of California, San Diego
{jin018,j9li,jmcauley}@ucsd.edu

Abstract Review examples:


I love this little stand! The coconut mocha chiller and
Several recent works have considered the caramel macchiato are delicious.
problem of generating reviews (or ‘tips’) as a Wow what a special find. One of the most unique and
form of explanation as to why a recommen- special date nights my husband and I have had.
dation might match a user’s interests. While Tip examples:
promising, we demonstrate that existing ap- Great food. Nice ambiance. Gnocchi were very good.
proaches struggle (in terms of both quality I can’t get enough of this place.
and content) to generate justifications that are Justification examples:
relevant to users’ decision-making process. The food portions were huge.
We seek to introduce new datasets and meth- Plain cheese quesadilla is very good and very cheap.
ods to address this recommendation justifica-
tion task. In terms of data, we first pro- Table 1: In contrast to reviews and tips, we seek to
pose an ‘extractive’ approach to identify re- automatically generate recommendation justifications
view segments which justify users’ intentions; that are more concise, concrete, and helpful for deci-
this approach is then used to distantly label sion making. Examples of justifications from reviews,
massive review corpora and construct large- tips, and our annotated dataset are marked in bold.
scale personalized recommendation justifica-
tion datasets. In terms of generation, we
design two personalized generation models to generate explanations in the form of natural lan-
with this data: (1) a reference-based Seq2Seq
guage, e.g. generating synthesized reviews simi-
model with aspect-planning which can gen-
erate justifications covering different aspects, lar to those that users would write about a prod-
and (2) an aspect-conditional masked language uct. However, a large portion of review text (or
model which can generate diverse justifica- text from ‘tips’) is often of little relevance to most
tions based on templates extracted from justi- users’ decision making (e.g. they describe verbose
fication histories. We conduct experiments on experiences or general endorsements) and may not
two real-world datasets which show that our be appropriate to use as explanations in terms of
model is capable of generating convincing and
content and language style. As a result, existing
diverse justifications.
models that learn directly from reviews (or tips)
1 Introduction may not capture crucial information that explains
users’ purchases. Table 1 shows examples of re-
Explaining, or justifying, recommendations to views, tips and ideal justifications. More recently,
users has the potential to increase their trans- there has been work studying the task of tip gener-
parency and reliability. However providing mean- ation where tips are concise summaries of reviews
ingful interpretations remains a difficult task, (Li et al., 2017). Though tips are concise and some
partly due to the black-box nature of many rec- subset of them might be suitable as candidates
ommendation models, but also because we simply for recommendation justifications, only a few e-
lack ground-truth datasets specifying what ‘good’ commerce systems provide tips accompanied with
justifications ought to look like. reviews. Even in systems where tips are available,
Previous work has sought to learn user prefer- the number of tips is usually far smaller than the
ences and writing styles from crowd-sourced re- number of reviews. These approaches hence suffer
views (Dong et al., 2017; Ni and McAuley, 2018) from generalizability issues, especially in settings

188
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natural Language Processing, pages 188–197,
Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics
where user interactions are highly sparse. alizes well to the Amazon Clothing dataset. We
On the other hand, generating diverse responses study different decoding strategies and compare
is essential in personalized content generation sce- their effect on generation performance.
narios such as justification generation. Instead
of always predicting the most popular reasons, 2 Dataset Generation
it’s preferable to present diverse justifications for
In this section, we introduce the pipeline to ex-
different users based on their personal interests.
tract high quality justifications from raw user re-
Recent work has shown that incorporating prior
views. Specifically, our goal here is to identify
knowledge into generation frameworks can greatly
review segments that can be used as justifications
improve diversity. Prior knowledge could include
and build a personalized justification dataset upon
story-lines in story generation (Yao et al., 2019),
them. Our pipeline consists of three steps:
or historical responses in dialogue systems (We-
1. Annotating a set of review segments with binary
ston et al., 2018).
labels, i.e., to determine whether they are ‘good’
In this work, our goal is to generate convincing or ‘bad’ justifications.
and diverse justifications. To address the challenge 2. Training a classifier on the annotated subset and
of lacking ground-truth data about ‘good’ justifi- applying it to distantly label all the review seg-
cations, we propose a pipeline that can identify ments to extract ‘good’ justifications for each
justifications from massive corpora of reviews or user and item pair.
tips. We extract fine-grained aspects from justifi- 3. Applying fine-grained aspect extraction for the
cations and build user personas and item profiles extracted justifications, and building user per-
consisting of sets of representative aspects. To sonas and item profiles.
improve generation quality and diversity, we pro-
pose two generation models (1) a reference-based 2.1 Identifying Justifications From Reviews
Seq2Seq model with aspect-planning, which takes
The first step is to extract text segments from re-
previous justifications as a reference and can pro-
views that are appropriate to use as justifications.
duce justifications based on different aspects, and
Instead of a complete sentence or phrase, we de-
(2) an aspect-conditional masked language model
fine each segment as an Elementary Discourse
that can generate diverse justifications from tem-
Unit (EDU; (Mann and Thompson, 1988)) which
plates extracted from previous justifications.
corresponds to a sequence of clauses. We use the
Our contributions are threefold:
model of Wang et al. (2018) to obtain EDUs from
• To facilitate recommendation justification gen- reviews. Recent works have shown that EDUs can
eration, we propose a pipeline to identify justi- improve the performance of document-level sum-
fication candidates and build aspect-based user marization (Bhatia et al., 2015) and opinion sum-
personas and item profiles from massive cor- marizaiton (Angelidis and Lapata, 2018).
pora of reviews. With this approach, we are After preprocessing the reviews into EDUs, we
able to build large-scale personalized justifica- analyzed the linguistic differences between rec-
tion datasets. We use these extractive justifica- ommendation justifications and reviews, and built
tion segments in the task of explainable recom- two rules to filter the segments that are unlikely
mendation and show that these are better train- to be suitable justifications: (1) segments with
ing sources instead of whole reviews. first-person or third-person pronouns, and (2) too
• We propose two models based on reference long or short. Next, two expert annotators were
attention, aspect-planning techniques and a exposed to 1,000 segments among those not fil-
persona-conditional masked language model. tered out and asked to determine whether they are
We show that adding such personalized informa- ‘good’ justifications. Labeling was performed iter-
tion enables the models to generate justifications atively, followed by feedback and discussion, un-
with high quality and diversity. til the quality was aligned between the two an-
• We conduct extensive experiments on two real- notators. At the end of the process, the inter-
world datasets from Yelp and Amazon Clothing. annotator agreement for the binary labeling task
We provide an annotated dataset about ‘good’ (good vs. bad), measured by Cohen’s kappa (Co-
justifications on the Yelp dataset and show that hen, 1960), was 0.927 after alignment. Then, the
the binary classifier trained on this dataset gener- annotators further labeled 600 segments. Overall,

189
Method F1 Recall Precision Yelp
The Tuna is pretty amazing
BOW-Xgboost 0.559 0.679 0.475
Appetizers and pasta are excellent here
CNN 0.644 0.596 0.700
An excellent selection of both sweet and savory crepes
LSTM-MaxPool 0.675 0.703 0.650
BERT 0.747 0.700 0.800 It was filled with delicious food, fantastic music and
BERT-SA (one epoch) 0.481 0.975 0.320 dancing
BERT-SA (three epoch) 0.491 1.000 0.325 Amazon-Cloth
The quality of the material is great
Great shirt, especially for the price.
Table 2: Performance for classifying review segments
The seams and stitching are really nice
as good or bad for recommendation justification.
Fit the bill for a Halloween costume.

Table 3: Examples of justifications with fine-grained


24.8% of the segments were labeled good. aspects in our annotated dataset. The fine-grained as-
pects are italic and underlined.
2.2 Automatic Classification
Our next step is to propagate labels to the complete 2.3 Fine-grained Aspect Extraction
review corpus. Here we adopt BERT (Devlin et al.,
Finally, we extract the fine-grained aspects that
2019) to fine-tune on our classification task, where
each justification covers. Fine-grained aspects
a [CLS] token is added to the beginning of each
are properties of products that appear among a
segment and the final hidden state (i.e., output of
user’s opinions. We adopt the method proposed
BERT) corresponding to this token is fed into a
by Zhang et al. (2014) to build a sentiment lex-
linear layer to obtain the binary prediction. Cross
icon which includes a set of fine-grained aspects
entropy is used as the training loss.
from the whole dataset. We then use simple rules
We split the annotated dataset into Train, Dev,
to determine which aspects appear in each justifi-
and Test sets with a 0.8/0.1/0.1 ratio, fine-tune the
cation.1 Table 3 presents a set of examples from
BERT classifier on the Train set and choose the
our dataset. Each example consists of a justifi-
best model on the Dev set. After three epochs
cation that a user has written about an item, and
of fine-tuning, BERT can achieve an F1-score
multiple fine-grained aspects mentioned in the jus-
of 0.80 on the Test set. We compare the per-
tification. Note that we only annotated the Yelp
formance of BERT with multiple baseline mod-
dataset, trained a classifer on that and applied the
els: (1) a XGBoost model which uses Bags-of-
model on both Yelp and Amazon Clothing dataset.
Words as sentence features (2) a convolutional
As shown in Table 3, the trained classifier works
neural network (CNN) with three convolution lay-
well on both datasets.
ers and one linear layer (3) a long short-term mem-
ory (LSTM) (Hochreiter and Schmidhuber, 1997) 3 Approach
network with a max-pooling layer, and a linear
layer (4) a BERT sentiment classifier (BERT-SA) 3.1 Problem Definition
trained on the complete Yelp dataset for one epoch For each user u (or item i), we build a justification
and three epochs. To obtain the pre-trained word reference D = {d1 , . . . , dlr } consisting of justifi-
embeddings for the CNN and LSTM models, we cations that the user has written (or justifications
applied fastText (Bojanowski et al., 2016) on the about the item) on the training set, where lr is the
Yelp Review dataset. We set the embedding di- maximum number of justifications. We also obtain
mension to 200 and used default values for other a user persona (or item profile) A = {a1 , . . . , aK }
hyper-parameters. based on the fine-grained aspects that the user’s (or
Table 2 presents results for our binary classifica- item’s) previous justifications have covered, where
tion task. The BERT classifier has higher F1-score K is the maximum number of aspects.
and precision than other classifiers. The BERT- Given a user u and an item i, as well as the their
SA model after three epochs only achieves an F1- justification reference Du and Di , and u’s persona
score of 0.491, which confirms the difference be- Au and i’s profile Ai , our target is to predict the
tween sentiment analysis and our good/bad task, 1
For each aspect, if its singular or plural exists in the to-
i.e., even if the segment has positive sentiment, it kenized justification, then we consider that this aspect exists
might be not suitable as a justification. in that justification.

190
Projection Layer P(food)

User Item FA 2 ℎ2
⨁ ⨁ ⨁ ℎ
Representation Representation Representation

Attention Layer The food portions were huge

RNN Layer Last ℎ1 ℎ2 ℎ3 ℎ4 ℎ5


Hidden
Embedding Layer
<SOS> The food portions were
User Item Fine-grained
Reference Reference Aspect

Figure 1: Structure of the reference-based Seq2Seq model with Aspect Planning

justifications Ju,i = {w1 , w2 , . . . , wT } that would where ê ∈ Rls ×n is the final output of the encoder,
explain why item i fits user u’s interests, where T and We ∈ Rlr , be ∈ R are learned parameters.
is the length of the justification. Sequence decoder. The decoder is a two-layer
GRU that predicts the target words given a start to-
3.2 Reference-based Seq2Seq Model ken. The hidden state of the decoder is initialized
Our base model follows the structure of a stan- using the sum of the last hidden state of the user
dard Seq2Seq (Sutskever et al., 2014) model. Our and item encoders. The hidden state at time-step t
framework, called ‘Ref2Seq’, views the historical is updated via the GRU unit based on the previous
justifications of users and items as references and hidden state and the input word. Specifically:
learns latent personalized features from them. Fig-
ure 1 shows the structure of our Reference-based h0 = euls + eils , ht = GRU(wt , ht−1 ), (3)
Seq2Seq Model. It includes two components: (1) where euls and eils are the last hidden states of the
two sequence encoders that learn user and item user and item encoder output êu and êi .
latent representations by taking previous justifica- To explore the relation between the reference
tions as references; (2) a sequence decoder incor- and generation, we apply an attention fusion layer
porating representations from users and items to to summarize the output of each encoder. For the
generate personalized justifications. user and item reference encoder, the attention vec-
Sequence Encoders. Our user encoder and se- tor is defined as:
quence encoder share the same structure, which
includes an embedding layer, a two-layer bi- ls
X
directional GRU (Cho et al., 2014), and a projec- a1t = 1
αtj ej ,
j=1
tion layer. The inputs are a user (or item) reference
1 >
D consisting of a set of historical justifications. αtj = exp(tanh(vα1 (Wα1 [ej ; ht ] + b1α )))/Z,
These justifications pass a word embedding layer, (4)
then go through the GRU and yield a sequence of 1 n
where at ∈ R is an attention vector on the se-
hidden states e ∈ Rls ×lr ×n : quence encoder at time-step t, αtj 1 is an attention

→ ←
score over the encoder hidden state ej and decoder
E = Embedding(D), e = GRU(E) = e + e , hidden state ht , and Z is a normalization term.
(1) Aspect-Planning Generation. One of the chal-
where ls denotes the length of the sequence, n is lenges for generating justifications is how to im-
the hidden size of the encoder GRU, E ∈ Rls ×lr ×n prove controllability, i.e., directly manipulate the

is the embedded sequence representation, and e content being generated. Inspired by ‘plan-and-

and e are the hidden vectors produced by a for- write’ (Yao et al., 2019), we extend the base model
ward and a backward GRU (respectively). to an Aspect-Planning Ref2Seq (AP-Ref2Seq)
To combine information from different ‘refer- model where we plan a fine-grained aspect before
ences’ (i.e. justifications), the hidden states are generation. This aspect planning can be consid-
then projected via a linear layer: ered as an extra form of supervision instead of
a hard constraint to make justification generation
ê = We · e + be , (2) more controllable.

191
When generating the justification for user u and u wrote about item i, we adapt the pre-trained
item i, we first provide a fine-grained aspect a as a BERT model (Devlin et al., 2019) into an encoder-
plan. The aspect a is fed into the word embedding decoder network with (1) an aspect encoder which
layer to obtain the aspect embedding Ea . Then, we encodes the user persona and item profile into la-
compute the scores between the embedding of the tent representations and (2) a masked language
aspect and the decoder hidden state as: model sequence decoder that takes in a masked
justification and predicts the masked tokens.
a2t = αt2 Ea ,
Aspect Encoder. Our aspect encoder shares
>
αt2 = exp(tanh(vα2 (Wα2 [Ea ; ht ] + b2α )))/Z, the same WordPiece embeddings (Wu et al., 2016)
(5) as BERT. The encoder feeds the intersection of
where a2t ∈ Rn is an attention vector and αt2 is an fine-grained aspects from the user persona and
attention score. item profile Aui = {a1 , . . . , aK 0 } into the em-
The attention vectors a1ut of user u, a1it of item i, bedding layer and obtains the aspect embedding
0
and a2t of fine-grained aspect a, are concatenated Aui ∈ RK ×n , where K 0 is the number of com-
with the decoder hidden state at time-step t and mon fine-grained aspects and n is the dimension
projected to obtain the output word distribution P . of the WordPiece embeddings.
The output probability for word w at time-step t is Masked Language Model Sequence Decoder.
given by: We use the masked language model in the pre-
trained BERT model as our sequence decoder and
p(wt ) = tanh(W1 [ht ; a1ut ; a1it ; a2t ] + b1 ), (6)
add attention over the aspect encoder’s output. As
where wt is the target word at time-step t. Given shown in Figure 2, the input to the decoder is a
the probability p(wt ) at each time step t, the model M = {w , . . . , w } with
masked justification Ju,i 1 T
is trained using a cross-entropy loss compared multiple tokens be replaced as [MASK]. The de-
against the ground-truth sequence. coder’s output T ∈ RT ×n is then fed to the atten-
tion layer to calculate an attention score with the
3.3 Aspect Conditional Masked Language
output of the encoder:
Model
K 0
Though Seq2Seq-based models can achieve high X
a3t = 3
αtj Aj ,
quality output, they often fail to generate diverse
j=1
content. Recent works in natural language gener- >
3
ation (NLG) tried to combine generation methods αtj = exp(tanh(vα3 (Wα3 [Aj ; Tt ] + b3α )))/Z.
with information retrieval techniques to increase (7)
the generation diversity (Li et al., 2018; Baheti The attention vector a3t is then concatenated
et al., 2018). The basic idea follows the paradigm with the decoder hidden state at time-step t and
of retrieve-and-edit—which is to first retrieve his- sent to a linear projection layer to obtain the out-
torical responses as templates, and then edit the put word distribution P . The output probability
template into new content. Since our data is anno- for word w at time-step t is given by:
tated with fine-grained aspects, it naturally fits into p(wt ) = tanh(W2 [Tt ; a3t ] + b2 ) (8)
this type of retrieve-and-edit paradigm. Mean-
while, masked language models have shown great where wt is the target word at time-step t.
performance in language modeling. Recent work Masking Procedure. The original BERT pa-
(Wang and Cho, 2019; Mansimov et al., 2019) has per applies a flat rate (15%) to decide whether to
shown that by sampling from the masked language mask a token. Unlike their approach, we adopt a
model (e.g. BERT), it is able to generate coherent higher rate to mask fine-grained aspects since they
sentences. are more important in justifications. Specifically,
Inspired by this work, we want to extend such if we encounter a fine-grained aspect, we will re-
an approach into a conditional version—we ex- place it with a [MASK] token 30% of the time;
plore the use of an Aspect Conditional Masked while for other words, we will replace them with a
Language Model (ACMLM) to generate diverse [MASK] token 15% of the time.
personalized justifications. Figure 2 shows the During training, the model will only predict
structure of our Aspect Conditional Masked Lan- those masked tokens and calculate a cross-entropy
guage Model. For a justification Ju,i that user loss on them.

192
Projection Layer P(atmosphere) P(service)

atmosphere and service are top notch


Attention Layer

𝐶𝐶 𝑇𝑇 1 𝑇𝑇 2 𝑇𝑇 3 𝑇𝑇 4 𝑇𝑇 5 𝑇𝑇 6
Embedding Layer
BERT

Fine-grained Aspects 𝐸𝐸 𝐶𝐶𝐶𝐶𝐶𝐶 𝐸𝐸 1 𝐸𝐸 2 𝐸𝐸 3 𝐸𝐸 4 𝐸𝐸 5 𝐸𝐸 6



𝐴𝐴𝑢𝑢𝑢𝑢 = [𝐴𝐴1𝑢𝑢𝑢𝑢 , … , 𝐴𝐴𝑘𝑘𝑢𝑢𝑖𝑖 ]
[CLS] [MASK] and [MASK] are top notch

Figure 2: Structure of the Aspect Conditional Masked Language Model

Iter 0 universe [MASK] is extremely friendly and per- X i+1 = (xi1 , . . . , x̃ti , . . . , xiT ). After repeating
sona ##ble this procedure N times, the final output is consid-
Iter 5 the [MASK] is extremely friendly and persona
ered as the generation output.2
##ble
Iter 10 the [MASK] is extremely friendly and persona
##ble 4 Experiments
Iter 15 the staff are extremely cool and persona ##ble
Iter 20 the staff are extra kind , persona ##ble 4.1 Datasets

Table 4: Examples of the generation output of


With our proposed pipeline (Section 2), we con-
ACMLM at different iterations. struct two personalized justification datasets from
existing review data—Yelp and Amazon Cloth-
ing.34 We further filter those users with fewer than
Generation by Sampling from Masked Tem- five justifications. For each user, we randomly
plates. We next discuss how to generate justifi- hold out two samples from all of their justifica-
cations from the trained ACMLM. We follow the tions to construct the Dev and Test sets. Table 5
sampling strategy of Wang and Cho (2019) to gen- shows the statistics of our two datasets.
erate justifications. Instead of generating from
a sequence of all [MASK] tokens, we start with 4.2 Baselines
masked templates generated from historical justi- For automatic evaluation, we consider three base-
fications about the target item. These masked tem- lines: Item-Rand is a baseline which randomly
plates include prior knowledge about the item and chooses a justification from the item’s historical
can increase the speed of sampling convergence. justifications. LexRank is a strong unsupervised
Table 4 shows an example of the generation pro- baseline that is widely used in text summarization
cess. We initialize the template sequence X 0 as (Erkan and Radev, 2004). Given all historical jus-
(universe, [MASK], . . . , ##ble) with length T . At tifications about an item, LexRank can select one
each iteration i, a position ti is sampled uniformly justification as the summary. We then use that
at random from {1, . . . , T } and the token at ti as the justification for all users. Attr2Seq (Dong
(i.e. xiti ) of the current sequence X i is replaced et al., 2017) is a Seq2Seq baseline that uses at-
by [MASK]. After that, we obtain the conditional tributes (i.e. user and item identity) as input.
probability of xti as By default, all models use beam search dur-
1 ing generation. Recently, there have been works
i
p(xti |X\t = i )
exp(1h(xti )> fθ (X\t
i
))), showing that the generation output of sampling
i
Z(X\t i
i methods is more diverse and suitable on high-
(9)
entropy tasks (Holtzman et al., 2019). To this end,
where 1h(xti ) is a one-hot vector with index xti
i is the sequence we obtain after re-
we explore another decoding strategy— ‘Top-k
set to 1, X\ti sampling’ (Radford et al., 2019) in experiments
placing the token at position ti of X i by [MASK],
i ) is the output after feeding X i
fθ (X\t 2
\ti into
We set N proportional to the length T of the initial
i
masked template to prevent the generation diverging too
the ACMLM as in Equation (8), and Z is the much from the original template.
normalization term. We then sample x̃ti from 3
https://www.yelp.com/dataset/challenge
4
Equation (9), and construct the next sequence by http://jmcauley.ucsd.edu/data/amazon

193
Dataset Train Dev Test # Users # Items # Aspects
Yelp 1,219,962 115,907 115,907 115,907 51,948 2,041
Amazon Clothing 202,528 57,947 57,947 57,947 50,240 581

Table 5: Statistics of our datasets.

Dataset Yelp Amazon Clothing


Model BLEU-3 BLEU-4 Distinct-1 Distinct-2 BLEU-3 BLEU-4 Distinct-1 Distinct-2
Item-Rand 0.440 0.150 2.766 20.151 1.620 0.680 2.400 11.853
LexRank 2.290 0.920 1.738 8.509 3.480 2.250 2.407 14.956
Attr2seq 7.890 0.000 0.049 0.095 1.720 0.560 0.076 0.352
Ref2Seq 4.380 2.450 0.188 1.163 8.780 5.670 0.141 1.240
AP-Ref2Seq 3.390 1.830 0.326 2.094 13.910 12.500 0.557 3.661
Ref2Seq (Top-k) 1.630 0.700 0.818 11.927 3.960 2.130 0.697 10.858
ACMLM 0.700 0.280 1.322 14.319 2.420 1.590 0.942 9.312

Table 6: Performance on Automatic Evaluation.

and include a variant of our model: Ref2Ref (Top- Model R I D


k).5 Ref2Seq (Review) 3.02 2.39 2.10
For human evaluation, we include two base- Ref2Seq (Tip) 3.25 2.35 2.34
lines: Ref2Seq (Review) and Ref2Seq (Tip), both Ref2Seq 3.87 3.13 2.96
Ref2Seq (Top-k) 3.95 3.34 3.39
of which are the same model as Ref2Seq model
ACMLM 3.23 3.29 3.42
but trained on the original review and tip data,
respectively. Comparisons with these two base- Table 7: Performance on Human Evaluation, where
lines demonstrates that training on our annotated R,I,D represents Relevance, Informativeness and
dataset tends to generate text more suitable as jus- Diversity, respectively.
tifications.

4.3 Implementation Detail k to 5. For ACMLM, we use a burn-in step equal


We use PyTorch6 to implement our models. to the length of the initial sequence. Our data and
For Req2Seq and AP-Ref2Seq, we set the hid- code are available online.8
den size and word embedding size as 256. We ap-
4.4 Automatic Evaluation
ply a dropout rate of 0.5 for the encoder and 0.2
for the decoder. The size of the justification refer- For automatic evaluation, we use BLEU, Distinct-
ence lr is set to 5 and the number of fine-grained 1, and Distinct-2 (Li et al., 2015) to measure the
aspects K in the user persona and item profile is performance of our model. As shown in Table 6,
set to 30. We train the model using Adam with our reference-based models achieve the highest
learning rate 2e−4 and stop training either when BLEU scores on both datasets except for BLEU-
it reaches 20 epochs or the perplexity does not im- 3 on Yelp. This confirms that Ref2Seq is able
prove (on the Dev set). For ACMLM, we build our to capture user and item content to generate the
model based on the BERT implementation from most relevant content, compared with unperson-
HuggingFace.7 We initialize our decoder using alized models such as LexRank and personalized
the pre-trained ‘Bert-base’ model and set the max models that do not leverage historical justifications
sequence length to 30. We train the model for 5 such as Attr2Seq.
epochs using Adam with learning rate 2e−5 . For On the other hand, recent works have reported
models using beam search, we set the beam size that models achieving higher diversity scores
as 10. For models using ‘top-k’ sampling, we set will have lower scores on overlap-based metrics
5
(e.g. BLEU) for open-domain generation tasks
At each time step, the next word is sampled from the top
k possible next tokens, according to their probabilities. (Baheti et al., 2018; Gao et al., 2018). We make
6
http://pytorch.org/docs/master/index.html a similar observation for our personalized justifi-
7
https://github.com/huggingface/pytorch-pretrained-
8
BERT https://github.com/nijianmo/recsys justification.git

194
Model Shake Shack Teharu Sushi MGM Grand Hotel
Ground Truth The burger was good The rolls are pretty great , typi- Room was very clean comfort-
cal rolls not that many specials able
LexRank A great burger and fries. Sushi ? Great rooms.
Ref2Seq (Review) i love trader joe ’s , i love trader the food was good and the ser- i love this place ! the food is
joe ’s vice was great always good and the service is
always great
Ref2Seq (Tip) this place is awesome love this place come here
Ref2Seq this place has some of the best the sushi is delicious the room was nice
burgers
Ref2Seq (Top-k) the fries are amazing fresh and delicious sushi open hotel for hours
ACMLM breakfast sandwiches are over- overall fun experience with half family style dinner , long time
all very filling price sushi shopping trip to vegas, family
dining , cheap lunch

Table 8: Comparisons of the generated justifications from different models for three businesses on the Yelp dataset.

cation generation task. As shown in Table 6, both Dataset Aspects Generated Output
sampling-based methods Ref2Seq (Top-k) and dining the dining room is nice
ACMLM achieve higher Distinct-1 and Distinct-2, pastry the pastries were pretty good
Yelp
while their BLEU scores are lower than Seq2Seq chicken the chicken fried rice is the best
based models using beam search. Therefore, we sandwich the pulled pork sandwich is the
best thing on the menu
also perform human evaluation to validate the gen-
eration quality of our proposed methods. product great product , fast shippong
Amazon- price design is nice , good price
4.5 Human Evaluation Clothing leather comfortable leather sneakers .
classic
We conduct human evaluation on three aspects: walking sturdy , great city walking shoes
(1) Relevance measures whether the generated
output contains information relevant to an item; Table 9: Generated justifications from AP-Ref2Seq.
(2) Informativeness measures whether the gener- The planned aspects are randomly selected from users’
ated justification includes specific information that personas.
is helpful to users; and (3) Diversity measures how
distinct the generated output is compared with cisions. Other models trained on the justifica-
other justifications. tion datasets tend to mention concrete information
We focus on the Yelp dataset and sample 100 (e.g. different aspects). LexRank tends to generate
generated examples from each of the five mod- relevant but short content. Meanwhile, sampling-
els as shown in Table 7. Human annotators are based models are able to generate more diverse
asked to give a score in the range [1,5] (lowest to content.
highest) for each metric. Each example is rated RQ2: How does aspect planning affect gener-
by at least three annotators. The results show ation? To mitigate the trade-off between diver-
that both Ref2Seq (Top-k) and ACMLM achieve sity and relevance, one approach is to add more
higher scores on Diversity and Informativeness constraints during generation such as constrained
compared to other models. Beam Search (Anderson et al., 2017). In our work,
4.6 Qualitative Analysis we extend our base model Ref2Seq by incorpo-
rating aspect-planning to guide generation. As
Here we study the following two qualitative ques- shown in Table 9, most planned aspects are present
tions: in the generated outputs of AP-Req2Seq.
RQ1: How do training data and methods affect
generation? As Table 8 shows, models trained on 5 Related Work
reviews and tips tend to generate generic phrases
(such as ‘i love this place’) which often do not Explainable Recommendation There has been
include information that helps users to make de- a line of work that studies how to improve the

195
explainability of recommender systems. Cather- get sentiment. In this work, we also introduces a
ine and Cohen (2017) learn latent representations conditional masked language model but considers
of review text to predict ratings. These repre- more fine-grained aspects.
sentations are then used to find the most help-
ful reviews for given a particular user and item 6 Conclusion
pair. Another popular direction is to generate text In this work, we studied the problem of person-
to justify recommendations. Dong et al. (2017) alized justification generation. To build high qual-
proposed an attribute-to-sequence model to gener- ity justification datasets, we provided an annotated
ate product reviews which utilizes categorical at- dataset and proposed a pipeline to extract justifi-
tributes. Ni et al. (2017) developed a multi-task cations from massive review corpora. To gener-
learning method that considers collaborative fil- ate convincing and diverse justifications, we de-
ter and review generation. Li et al. (2019b) gen- veloped two models: (1) Ref2Seq which lever-
erated tips by considering ‘persona’ information ages historical justifications as references dur-
which can capture the language style of users and ing generation; and (2) ACMLM, which is an
characteristics of items. However, these works use aspect conditional model built on a pre-trained
whole reviews or tips as training examples, which masked language model. Our experiments showed
may not be appropriate due to the quality of re- that Ref2Seq achieves higher scores (in terms of
view text. More recently, Liu et al. (2019) pro- BLEU) and ACMLM achieves higher diversity
posed a framework to generate fine-grained expla- scores compared with baselines. Human evalu-
nations for text classification. To achieve labels for ation showed that reference-based models obtain
human-readable explanations, they constructed a high relevance scores and sampling based methods
dataset from a website which provides ratings and led to more diverse and informative outputs. Fi-
fine-grained summaries written by users. Unfor- nally, we showed that aspect-planning is a promis-
tunately, most websites do not provide such fine- ing way to guide generation to produce personl-
grained information. On the other hand, our work ized and relevant justifications.
identifies justifications from reviews, uses them as Acknowledgements. This work is partly sup-
training examples and shows these are better data ported by NSF #1750063. We thank all the re-
source for explainable recommendation via exten- viewers for their constructive suggestions.
sive experiments.

Diversity-aware NLG Diversity is an important References


aspect of NLG systems. Recent works have fo- Peter Anderson, Basura Fernando, Mark Johnson, and
cused on digesting prior knowledge to improve Stephen Gould. 2017. Guided open vocabulary im-
generation diversity. Yao et al. (2019) proposed a age captioning with constrained beam search. In
method to incorporate planned story-lines in story EMNLP.
generation. Li et al. (2019a) developed an aspect- Stefanos Angelidis and Mirella Lapata. 2018. Summa-
aware coarse-to-fine review generation method. rizing opinions: Aspect extraction meets sentiment
They predict an aspect for each sentence in the prediction and they are both weakly supervised. In
EMNLP.
review to capture the content flow. Given the as-
pects, a sequence of sentence sketches is gener- Ashutosh Baheti, Alan Ritter, Jiwei Li, and William B.
Dolan. 2018. Generating more interesting responses
ated and a decoder will fill in the slots of each
in neural conversation models with distributional
sketch. In dialogue systems, several works have constraints. In EMNLP.
studied frameworks to extract templates from his-
Parminder Bhatia, Yangfeng Ji, and Jacob Eisenstein.
torical responses, which are then edited to form
2015. Better document-level sentiment analysis
new responses (Weston et al., 2018; Wu et al., from rst discourse parsing. In EMNLP.
2018). Similarly, the extract-and-edit paradigm
Piotr Bojanowski, Edouard Grave, Armand Joulin, and
has been studied in style transfer tasks in NLG Tomas Mikolov. 2016. Enriching word vectors with
(Li et al., 2018). Wu et al. (2019) proposed an subword information. Transactions of the Associa-
attribute aware masked language model for non- tion for Computational Linguistics, 5:135–146.
parallel sentiment transfer. They first mask out the Rose Catherine and William W. Cohen. 2017.
sentimental tokens and then train a masked lan- Transnets: Learning to transform for recommenda-
guage model to infill the masked positions for tar- tion. In RecSys.

196
Kyunghyun Cho, Bart van Merrienboer, aglar Gülehre, Jianmo Ni, Zachary C. Lipton, Sharad Vikram, and Ju-
Fethi Bougares, Holger Schwenk, and Yoshua Ben- lian J. McAuley. 2017. Estimating reactions and rec-
gio. 2014. Learning phrase representations using ommending products with generative models of re-
rnn encoder-decoder for statistical machine transla- views. In IJCNLP.
tion. In EMNLP.
Jianmo Ni and Julian McAuley. 2018. Personalized re-
Jacob Willem Cohen. 1960. A coefficient of agreement view generation by expanding phrases and attending
for nominal scales. on aspect-aware representations. In ACL.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Alec Radford, Jeff Wu, Rewon Child, David Luan,
Kristina Toutanova. 2019. Bert: Pre-training of deep Dario Amodei, and Ilya Sutskever. 2019. Language
bidirectional transformers for language understand- models are unsupervised multitask learners.
ing. In NAACL.
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Sequence to sequence learning with neural net-
Ming Zhou, and Ke Xu. 2017. Learning to generate works. In NIPS.
product reviews from attributes. In EACL.
Alex Wang and Kyunghyun Cho. 2019. Bert has a
Günes Erkan and Dragomir R. Radev. 2004. Lexrank: mouth, and it must speak: Bert as a markov random
Graph-based lexical centrality as salience in text field language model. CoRR, abs/1902.04094.
summarization. J. Artif. Intell. Res., 22:457–479.
Yizhong Wang, Sujian Li, and Jingfeng Yang. 2018.
Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, and Shum-
Toward fast and accurate neural discourse segmen-
ing Shi. 2018. Generating multiple diverse re-
tation. In EMNLP.
sponses for short-text conversation. In AAAI.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Jason Weston, Emily Dinan, and Alexander H. Miller.
short-term memory. Neural Computation, 9:1735– 2018. Retrieve and refine: Improved sequence gen-
1780. eration models for dialogue. In SCAI@EMNLP.

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han,
Choi. 2019. The curious case of neural text degen- and Songlin Hu. 2019. Mask and infill: Applying
eration. CoRR, abs/1904.09751. masked language model to sentiment transfer. In IJ-
CAI.
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao,
and William B. Dolan. 2015. A diversity-promoting Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V.
objective function for neural conversation models. Le, Mohammad Norouzi, Wolfgang Macherey,
In HLT-NAACL. Maxim Krikun, Yuan Cao, Qin Gao, Jeff Klingner,
Apurva Shah, Melvin Johnson, Xiaobing Liu,
Juncen Li, Robin Jia, He He, and Percy S. Liang. 2018. Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato,
Delete, retrieve, generate: A simple approach to sen- Taku Kudo, Hideto Kazawa, Keith Stevens, George
timent and style transfer. In NAACL-HLT. Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason
Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals,
Junyi Li, Wayne Xin Zhao, Ji-Rong Wen, and Yang Gregory S. Corrado, Macduff Hughes, and Jeffrey
Song. 2019a. Generating long and informative re- Dean. 2016. Google’s neural machine translation
views with aspect-aware coarse-to-fine decoding. In system: Bridging the gap between human and ma-
ACL. chine translation. CoRR, abs/1609.08144.
Piji Li, Zihao Wang, Lidong Bing, and Wai Lam. Yu Wu, Furu Wei, Shaohan Huang, Yunli Wang, Zhou-
2019b. Persona-aware tips generation. In WWW. jun Li, and Ming Zhou. 2018. Response generation
Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and by context-aware prototype editing. In AAAI.
Wai Lam. 2017. Neural rating regression with ab-
Lili Yao, Nanyun Peng, Ralph Weischedel, Kevin
stractive tips generation for recommendation. In SI-
Knight, Dongyan Zhao, and Rui Yan. 2019. Plan-
GIR.
and-write: Towards better automatic storytelling. In
Hui Liu, Qingyu Yin, and William Yang Wang. 2019. AAAI.
Towards explainable nlp: A generative explanation
framework for text classification. In ACL. Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang,
Yiqun Liu, and Shaoping Ma. 2014. Explicit fac-
William C. Mann and Sandra A. Thompson. 1988. tor models for explainable recommendation based
Rhetorical structure theory: toward a functional the- on phrase-level sentiment analysis. In SIGIR.
ory of text.
Elman Mansimov, Alex Wang, and Kyunghyun Cho.
2019. A generalized framework of sequence genera-
tion with application to undirected sequence models.
ArXiv, abs/1905.12790.

197

You might also like